skip to main content


Search for: All records

Award ID contains: 1943371

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract Premise

    Robust standards to evaluate quality and completeness are lacking in eukaryotic structural genome annotation, as genome annotation software is developed using model organisms and typically lacks benchmarking to comprehensively evaluate the quality and accuracy of the final predictions. The annotation of plant genomes is particularly challenging due to their large sizes, abundant transposable elements, and variable ploidies. This study investigates the impact of genome quality, complexity, sequence read input, and method on protein‐coding gene predictions.

    Methods

    The impact of repeat masking, long‐read and short‐read inputs, and de novo and genome‐guided protein evidence was examined in the context of the popular BRAKER and MAKER workflows for five plant genomes. The annotations were benchmarked for structural traits and sequence similarity.

    Results

    Benchmarks that reflect gene structures, reciprocal similarity search alignments, and mono‐exonic/multi‐exonic gene counts provide a more complete view of annotation accuracy. Transcripts derived from RNA‐read alignments alone are not sufficient for genome annotation. Gene prediction workflows that combine evidence‐based and ab initio approaches are recommended, and a combination of short and long reads can improve genome annotation. Adding protein evidence from de novo assemblies, genome‐guided transcriptome assemblies, or full‐length proteins from OrthoDB generates more putative false positives as implemented in the current workflows. Post‐processing with functional and structural filters is highly recommended.

    Discussion

    While the annotation of non‐model plant genomes remains complex, this study provides recommendations for inputs and methodological approaches. We discuss a set of best practices to generate an optimal plant genome annotation and present a more robust set of metrics to evaluate the resulting predictions.

     
    more » « less
    Free, publicly-accessible full text available July 1, 2024
  2. Abstract

    We present a high-quality assembly and annotation of the periodical cicada species, Magicicada septendecula (Hemiptera: Auchenorrhyncha: Cicadidae). Periodical cicadas have a significant ecological impact, serving as a food source for many mammals, reptiles, and birds. Magicicada are well known for their massive emergences of 1 to 3 species that appear in different locations in the eastern United States nearly every year. These year classes (“broods”) emerge dependably every 13 or 17 yr in a given location. Recently, it has become clear that 4-yr early or late emergences of a sizeable portion of a population are an important part of the history of brood formation; however, the biological mechanisms by which they track the passage of time remain a mystery. Using PacBio HiFi reads in conjunction with Hi-C proximity ligation data, we have assembled and annotated the first whole genome for a periodical cicada, an important resource for future phylogenetic and comparative genomic analysis. This also represents the first quality genome assembly and annotation for the Hemipteran superfamily Cicadoidea. With a scaffold N50 of 518.9 Mb and a complete BUSCO score of 96.7%, we are confident that this assembly will serve as a vital resource toward uncovering the genomic basis of periodical cicadas’ long, synchronized life cycles and will provide a robust framework for further investigations into these insects.

     
    more » « less
  3. Abstract Background

    De novo phased (haplo)genome assembly using long-read DNA sequencing data has improved the detection and characterization of structural variants (SVs) in plant and animal genomes. Able to span across haplotypes, long reads allow phased, haplogenome assembly in highly outbred organisms such as forest trees. Eucalyptus tree species and interspecific hybrids are the most widely planted hardwood trees with F1 hybrids of Eucalyptus grandis and E. urophylla forming the bulk of fast-growing pulpwood plantations in subtropical regions. The extent of structural variation and its effect on interspecific hybridization is unknown in these trees. As a first step towards elucidating the extent of structural variation between the genomes of E. grandis and E. urophylla, we sequenced and assembled the haplogenomes contained in an F1 hybrid of the two species.

    Findings

    Using Nanopore sequencing and a trio-binning approach, we assembled the separate haplogenomes (566.7 Mb and 544.5 Mb) to 98.0% BUSCO completion. High-density SNP genetic linkage maps of both parents allowed scaffolding of 88.0% of the haplogenome contigs into 11 pseudo-chromosomes (scaffold N50 of 43.8 Mb and 42.5 Mb for the E. grandis and E. urophylla haplogenomes, respectively). We identify 48,729 SVs between the two haplogenomes providing the first detailed insight into genome structural rearrangement in these species. The two haplogenomes have similar gene content, 35,572 and 33,915 functionally annotated genes, of which 34.7% are contained in genome rearrangements.

    Conclusions

    Knowledge of SV and haplotype diversity in the two species will form the basis for understanding the genetic basis of hybrid superiority in these trees.

     
    more » « less
  4. Abstract

    Human activity changes multiple factors in the environment, which can have positive or negative synergistic effects on organisms. However, few studies have explored the causal effects of multiple anthropogenic factors, such as urbanization and invasive species, on animals and the mechanisms that mediate these interactions. This study examines the influence of urbanization on the detrimental effect of invasive avian vampire flies (Philornis downsi) on endemic Darwin's finches in the Galápagos Islands. We experimentally manipulated nest fly abundance in urban and non‐urban locations and then characterized nestling health, fledging success, diet, and gene expression patterns related to host defense. Fledging success of non‐parasitized nestlings from urban (79%) and non‐urban (75%) nests did not differ significantly. However, parasitized, non‐urban nestlings lost more blood, and fewer nestlings survived (8%) compared to urban nestlings (50%). Stable isotopic values (δ15N) from urban nestling feces were higher than those from non‐urban nestlings, suggesting that urban nestlings are consuming more protein. δ15N values correlated negatively with parasite abundance, which suggests that diet might influence host defenses (e.g., tolerance and resistance). Parasitized, urban nestlings differentially expressed genes within pathways associated with red blood cell production (tolerance) and pro‐inflammatory response (innate immunological resistance), compared to parasitized, non‐urban nestlings. In contrast, parasitized non‐urban nestlings differentially expressed genes within pathways associated with immunoglobulin production (adaptive immunological resistance). Our results suggest that urban nestlings are investing more in pro‐inflammatory responses to resist parasites but also recovering more blood cells to tolerate blood loss. Although non‐urban nestlings are mounting an adaptive immune response, it is likely a last effort by the immune system rather than an effective defense against avian vampire flies since few nestlings survived.

     
    more » « less
  5. Abstract

    With the advent of affordable and more accurate third-generation sequencing technologies, and the associated bioinformatic tools, it is now possible to sequence, assemble, and annotate more species of conservation concern than ever before. Juglans cinerea, commonly known as butternut or white walnut, is a member of the walnut family, native to the Eastern United States and Southeastern Canada. The species is currently listed as Endangered on the IUCN Red List due to decline from an invasive fungus known as Ophiognomonia clavigignenti-juglandacearum (Oc-j) that causes butternut canker. Oc-j creates visible sores on the trunks of the tree which essentially starves and slowly kills the tree. Natural resistance to this pathogen is rare. Conserving butternut is of utmost priority due to its critical ecosystem role and cultural significance. As part of an integrated undergraduate and graduate student training program in biodiversity and conservation genomics, the first reference genome for Juglans cinerea is described here. This chromosome-scale 539 Mb assembly was generated from over 100 × coverage of Oxford Nanopore long reads and scaffolded with the Juglans mandshurica genome. Scaffolding with a closely related species oriented and ordered the sequences in a manner more representative of the structure of the genome without altering the sequence. Comparisons with sequenced Juglandaceae revealed high levels of synteny and further supported J. cinerea's recent phylogenetic placement. Comparative assessment of gene family evolution revealed a significant number of contracting families, including several associated with biotic stress response.

     
    more » « less
  6. Holland, J. (Ed.)
    Abstract

    Douglas-fir (Pseudotsuga menziesii) is native to western North America. It grows in a wide range of environmental conditions and is an important timber tree. Although there are several studies on the gene expression responses of Douglas-fir to abiotic cues, the absence of high-quality transcriptome and genome data is a barrier to further investigation. Like for most conifers, the available transcriptome and genome reference dataset for Douglas-fir remains fragmented and requires refinement. We aimed to generate a highly accurate, and complete reference transcriptome and genome annotation. We deep-sequenced the transcriptome of Douglas-fir needles from seedlings that were grown under nonstress control conditions or a combination of heat and drought stress conditions using long-read (LR) and short-read (SR) sequencing platforms. We used 2 computational approaches, namely de novo and genome-guided LR transcriptome assembly. Using the LR de novo assembly, we identified 1.3X more high-quality transcripts, 1.85X more “complete” genes, and 2.7X more functionally annotated genes compared to the genome-guided assembly approach. We predicted 666 long noncoding RNAs and 12,778 unique protein-coding transcripts including 2,016 putative transcription factors. We leveraged the LR de novo assembled transcriptome with paired-end SR and a published single-end SR transcriptome to generate an improved genome annotation. This was conducted with BRAKER2 and refined based on functional annotation, repetitive content, and transcriptome alignment. This high-quality genome annotation has 51,419 unique gene models derived from 322,631 initial predictions. Overall, our informatics approach provides a new reference Douglas-fir transcriptome assembly and genome annotation with considerably improved completeness and functional annotation.

     
    more » « less
  7. Green plants play a fundamental role in ecosystems, human health, and agriculture. As de novo genomes are being generated for all known eukaryotic species as advocated by the Earth BioGenome Project, increasing genomic information on green land plants is essential. However, setting standards for the generation and storage of the complex set of genomes that characterize the green lineage of life is a major challenge for plant scientists. Such standards will need to accommodate the immense variation in green plant genome size, transposable element content, and structural complexity while enabling research into the molecular and evolutionary processes that have resulted in this enormous genomic variation. Here we provide an overview and assessment of the current state of knowledge of green plant genomes. To date fewer than 300 complete chromosome-scale genome assemblies representing fewer than 900 species have been generated across the estimated 450,000 to 500,000 species in the green plant clade. These genomes range in size from 12 Mb to 27.6 Gb and are biased toward agricultural crops with large branches of the green tree of life untouched by genomic-scale sequencing. Locating suitable tissue samples of most species of plants, especially those taxa from extreme environments, remains one of the biggest hurdles to increasing our genomic inventory. Furthermore, the annotation of plant genomes is at present undergoing intensive improvement. It is our hope that this fresh overview will help in the development of genomic quality standards for a cohesive and meaningful synthesis of green plant genomes as we scale up for the future. 
    more » « less
  8. A global international initiative, such as the Earth BioGenome Project (EBP), requires both agreement and coordination on standards to ensure that the collective effort generates rapid progress toward its goals. To this end, the EBP initiated five technical standards committees comprising volunteer members from the global genomics scientific community: Sample Collection and Processing, Sequencing and Assembly, Annotation, Analysis, and IT and Informatics. The current versions of the resulting standards documents are available on the EBP website, with the recognition that opportunities, technologies, and challenges may improve or change in the future, requiring flexibility for the EBP to meet its goals. Here, we describe some highlights from the proposed standards, and areas where additional challenges will need to be met. 
    more » « less