skip to main content

Title: Leveraging histone modifications to improve genome annotations
Abstract Accurate genome annotations are essential to modern biology; however, they remain challenging to produce. Variation in gene structure and expression across species, as well as within an organism, make correctly annotating genes arduous; an issue exacerbated by pitfalls in current in silico methods. These issues necessitate complementary approaches to add additional confidence and rectify potential misannotations. Integration of epigenomic data into genome annotation is one such approach. In this study, we utilized sets of histone modification data, which are precisely distributed at either gene bodies or promoters to evaluate the annotation of the Zea mays genome. We leveraged these data genome wide, allowing for identification of annotations discordant with empirical data. In total, 13,159 annotation discrepancies were found in Z. mays upon integrating data across three different tissues, which were corroborated using RNA-based approaches. Upon correction, genes were extended by an average of 2128 base pairs, and we identified 2529 novel genes. Application of this method to five additional plant genomes identified a series of misannotations, as well as identified novel genes, including 13,836 in Asparagus officinalis, 2724 in Setaria viridis, 2446 in Sorghum bicolor, 8631 in Glycine max, and 2585 in Phaseolous vulgaris. This study demonstrates that histone more » modification data can be leveraged to rapidly improve current genome annotations across diverse plant lineages. « less
; ; ; ;
Hufford, M
Award ID(s):
1856627 1905869
Publication Date:
Journal Name:
G3 Genes|Genomes|Genetics
Sponsoring Org:
National Science Foundation
More Like this
  1. Accurate annotation of plant genomes remains complex due to the presence of many pseudogenes arising from whole-genome duplication-generated redundancy or the capture and movement of gene fragments by transposable elements. Machine learning on genome-wide epigenetic marks, informed by transcriptomic and proteomic training data, could be used to improve annotations through classification of all putative protein-coding genes as either constitutively silent or able to be expressed. Expressed genes were subclassified as able to express both mRNAs and proteins or only RNAs, and CG gene body methylation was associated only with the former subclass. More than 60,000 protein-coding genes have been annotated in the reference genome of maize inbred B73. About two-thirds of these genes are transcribed and are designated the filtered gene set (FGS). Classification of genes by our trained random forest algorithm was accurate and relied only on histone modifications or DNA methylation patterns within the gene body; promoter methylation was unimportant. Other inbred lines are known to transcribe significantly different sets of genes, indicating that the FGS is specific to B73. We accurately classified the sets of transcribed genes in additional inbred lines, arising from inbred-specific DNA methylation patterns. This approach highlights the potential of using chromatin information tomore »improve annotations of functional genes.« less
  2. The contemporary capacity of genome sequence analysis significantly lags behind the rapidly evolving sequencing technologies. Retrieving biological meaningful information from an ever-increasing amount of genome data would be significantly beneficial for functional genomic studies. For example, the duplication, organization, evolution, and function of superfamily genes are arguably important in many aspects of life. However, the incompleteness of annotations in many sequenced genomes often results in biased conclusions in comparative genomic studies of superfamilies. Here, we present a Perl software, called Closing Target Trimming (CTT), for automatically identifying most, if not all, members of a gene family in any sequenced genomes on CentOS 7 platform. To benefit a broader application on other operating systems, we also created a Docker application package, CTTdocker. Our test data on the F-box gene superfamily showed 78.2 and 79% gene finding accuracies in two well annotated plant genomes, Arabidopsis thaliana and rice, respectively. To further demonstrate the effectiveness of this program, we ran it through 18 plant genomes and five non-plant genomes to compare the expansion of the F-box and the BTB superfamilies. The program discovered that on average 12.7 and 9.3% of the total F-box and BTB members, respectively, are new loci in plant genomes,more »while it only found a small number of new members in vertebrate genomes. Therefore, different evolutionary and regulatory mechanisms of cullin-RING ubiquitin ligases may be present in plants and animals. We also annotated and compared the Pkinase family members across a wide range of organisms, including 10 fungi, 10 metazoa, 10 vertebrates, and 10 additional plants, which were randomly selected from the Ensembl database. Our CTT annotation recovered on average 14% more loci, including pseudogenes, of the Pkinase superfamily in these 40 genomes, demonstrating its robust replicability and scalability in annotating superfamiy members in any genomes.« less
  3. The stiff-stalk heterotic group in Maize (Zea mays L.) is an important source of inbreds used in U.S. commercial hybrid production. Founder inbreds B14, B37, B73, and, to a lesser extent, B84, are found in the pedigrees of a majority of commercial seed parent inbred lines. We created high-quality genome assemblies of B84 and four expired Plant Variety Protection (ex-PVP) lines LH145 representing B14, NKH8431 of mixed descent, PHB47 representing B37, and PHJ40, which is a Pioneer Hi-Bred International (PHI) early stiff-stalk type. Sequence was generated using long-read sequencing achieving highly contiguous assemblies of 2.13–2.18 Gbp with N50 scaffold lengths >200 Mbp. Inbred-specific gene annotations were generated using a core five-tissue gene expression atlas, whereas transposable element (TE) annotation was conducted using de novo and homology-directed methodologies. Compared with the reference inbred B73, synteny analyses revealed extensive collinearity across the five stiff-stalk genomes, although unique components of the maize pangenome were detected. Comparison of this set of stiff-stalk inbreds with the original Iowa Stiff Stalk Synthetic breeding population revealed that these inbreds represent only a proportion of variation in the original stiff-stalk pool and there are highly conserved haplotypes in released public and ex-Plant Variety Protection inbreds. Despite the reductionmore »in variation from the original stiff-stalk population, substantial genetic and genomic variation was identified supporting the potential for continued breeding success in this pool. The assemblies described here represent stiff-stalk inbreds that have historical and commercial relevance and provide further insight into the emerging maize pangenome.« less
  4. INTRODUCTION Transposable elements (TEs), repeat expansions, and repeat-mediated structural rearrangements play key roles in chromosome structure and species evolution, contribute to human genetic variation, and substantially influence human health through copy number variants, structural variants, insertions, deletions, and alterations to gene transcription and splicing. Despite their formative role in genome stability, repetitive regions have been relegated to gaps and collapsed regions in human genome reference GRCh38 owing to the technological limitations during its development. The lack of linear sequence in these regions, particularly in centromeres, resulted in the inability to fully explore the repeat content of the human genome in the context of both local and regional chromosomal environments. RATIONALE Long-read sequencing supported the complete, telomere-to-telomere (T2T) assembly of the pseudo-haploid human cell line CHM13. This resource affords a genome-scale assessment of all human repetitive sequences, including TEs and previously unknown repeats and satellites, both within and outside of gaps and collapsed regions. Additionally, a complete genome enables the opportunity to explore the epigenetic and transcriptional profiles of these elements that are fundamental to our understanding of chromosome structure, function, and evolution. Comparative analyses reveal modes of repeat divergence, evolution, and expansion or contraction with locus-level resolution. RESULTS We implementedmore »a comprehensive repeat annotation workflow using previously known human repeats and de novo repeat modeling followed by manual curation, including assessing overlaps with gene annotations, segmental duplications, tandem repeats, and annotated repeats. Using this method, we developed an updated catalog of human repetitive sequences and refined previous repeat annotations. We discovered 43 previously unknown repeats and repeat variants and characterized 19 complex, composite repetitive structures, which often carry genes, across T2T-CHM13. Using precision nuclear run-on sequencing (PRO-seq) and CpG methylated sites generated from Oxford Nanopore Technologies long-read sequencing data, we assessed RNA polymerase engagement across retroelements genome-wide, revealing correlations between nascent transcription, sequence divergence, CpG density, and methylation. These analyses were extended to evaluate RNA polymerase occupancy for all repeats, including high-density satellite repeats that reside in previously inaccessible centromeric regions of all human chromosomes. Moreover, using both mapping-dependent and mapping-independent approaches across early developmental stages and a complete cell cycle time series, we found that engaged RNA polymerase across satellites is low; in contrast, TE transcription is abundant and serves as a boundary for changes in CpG methylation and centromere substructure. Together, these data reveal the dynamic relationship between transcriptionally active retroelement subclasses and DNA methylation, as well as potential mechanisms for the derivation and evolution of new repeat families and composite elements. Focusing on the emerging T2T-level assembly of the HG002 X chromosome, we reveal that a high level of repeat variation likely exists across the human population, including composite element copy numbers that affect gene copy number. Additionally, we highlight the impact of repeats on the structural diversity of the genome, revealing repeat expansions with extreme copy number differences between humans and primates while also providing high-confidence annotations of retroelement transduction events. CONCLUSION The comprehensive repeat annotations and updated repeat models described herein serve as a resource for expanding the compendium of human genome sequences and reveal the impact of specific repeats on the human genome. In developing this resource, we provide a methodological framework for assessing repeat variation within and between human genomes. The exhaustive assessment of the transcriptional landscape of repeats, at both the genome scale and locally, such as within centromeres, sets the stage for functional studies to disentangle the role transcription plays in the mechanisms essential for genome stability and chromosome segregation. Finally, our work demonstrates the need to increase efforts toward achieving T2T-level assemblies for nonhuman primates and other species to fully understand the complexity and impact of repeat-derived genomic innovations that define primate lineages, including humans. Telomere-to-telomere assembly of CHM13 supports repeat annotations and discoveries. The human reference T2T-CHM13 filled gaps and corrected collapsed regions (triangles) in GRCh38. Combining long read–based methylation calls, PRO-seq, and multilevel computational methods, we provide a compendium of human repeats, define retroelement expression and methylation profiles, and delineate locus-specific sites of nascent transcription genome-wide, including previously inaccessible centromeres. SINE, short interspersed element; SVA, SINE–variable number tandem repeat– Alu ; LINE, long interspersed element; LTR, long terminal repeat; TSS, transcription start site; pA, xxxxxxxxxxxxxxxx.« less
  5. Polyacetylenic lipids accumulate in various Apiaceae species after pathogen attack, suggesting that these compounds are naturally occurring pesticides and potentially valuable resources for crop improvement. These compounds also promote human health and slow tumor growth. Even though polyacetylenic lipids were discovered decades ago, the biosynthetic pathway underlying their production is largely unknown. To begin filling this gap and ultimately enable polyacetylene engineering, we studied polyacetylenes and their biosynthesis in the major Apiaceae crop carrot (Daucus carota subsp. sativus). Using gas chromatography and mass spectrometry, we identified three known polyacetylenes and assigned provisional structures to two novel polyacetylenes. We also quantified these compounds in carrot leaf, petiole, root xylem, root phloem, and root periderm extracts. Falcarindiol and falcarinol predominated and accumulated primarily in the root periderm. Since the multiple double and triple carbon-carbon bonds that distinguish polyacetylenes from ubiquitous fatty acids are often introduced by Δ12 oleic acid desaturase (FAD2)-type enzymes, we mined the carrot genome for FAD2 genes. We identified a FAD2 family with an unprecedented 24 members and analyzed public, tissue-specific carrot RNA-Seq data to identify coexpressed members with root periderm-enhanced expression. Six candidate genes were heterologously expressed individually and in combination in yeast and Arabidopsis (Arabidopsis thaliana), resultingmore »in the identification of one canonical FAD2 that converts oleic to linoleic acid, three divergent FAD2-like acetylenases that convert linoleic into crepenynic acid, and two bifunctional FAD2s with Δ12 and Δ14 desaturase activity that convert crepenynic into the further desaturated dehydrocrepenynic acid, a polyacetylene pathway intermediate. These genes can now be used as a basis for discovering other steps of falcarin-type polyacetylene biosynthesis, to modulate polyacetylene levels in plants, and to test the in planta function of these molecules. Many organisms implement specialized biochemical pathways to convert ubiquitous metabolites into bioactive chemical compounds. Since plants comprise the majority of the human diet, specialized plant metabolites play crucial roles not only in crop biology but also in human nutrition. Some asterids produce lipid compounds called polyacetylenes (for review, see Negri, 2015) that exhibit antifungal activity (Garrod et al., 1978; Kemp, 1978; Harding and Heale, 1980, 1981; Olsson and Svensson, 1996) and accumulate in response to fungal phytopathogen attack (De Wit and Kodde, 1981; Elgersma and Liem, 1989). These observations have led to the longstanding hypothesis that polyacetylenes are natural pesticides. These same lipid compounds exhibit cytotoxic activity against human cancer cell lines and slow tumor growth (Fujimoto and Satoh, 1988; Matsunaga et al., 1989, 1990; Cunsolo et al., 1993; Bernart et al., 1996; Kobaek-Larsen et al., 2005; Zidorn et al., 2005), making them important nutritional compounds. The major source of polyacetylenes in the human diet is carrot (Daucus carota L.). Carrot is one of the most important crop species in the Apiaceae, with rapidly increasing worldwide cultivation (Rubatzky et al., 1999; Dawid et al., 2015). The most common carrot polyacetylenes are C17 linear aliphatic compounds containing two conjugated carbon-carbon triple bonds, one or two carbon-carbon double bonds, and a diversity of additional in-chain oxygen-containing functional groups. In carrot, the most abundant of these compounds are falcarinol and falcarindiol (Dawid et al., 2015). Based on their structures, it has been hypothesized that these compounds (alias falcarin-type polyacetylenes) are derived from ubiquitous fatty acids. Indeed, biochemical investigations (Haigh et al., 1968; Bohlman, 1988), radio-chemical tracer studies (Barley et al., 1988), and the discovery of pathway intermediates (Jones et al., 1966; Kawazu et al., 1973) implicate a diversion of flux away from linolenate biosynthesis as the entry point into falcarin-type polyacetylene biosynthesis (for review, see Minto and Blacklock, 2008). The final steps of linolenate biosynthesis are the conversion of oleate to linoleate, mediated by fatty acid desaturase 2 (FAD2), and linoleate to linolenate, catalyzed by FAD3. Some plant species contain divergent forms of FAD2 that, instead of or in addition to converting oleate to linoleate, catalyze the installation of unusual in-chain functional groups such as hydroxyl groups, epoxy groups, conjugated double bonds, or carbon-carbon triple bonds into the acyl chain (Badami and Patil, 1980) and thus divert flux from linolenate production into the accumulation of unusual fatty acids. Previous work in parsley (Petroselinum crispum; Apiaceae) identified a divergent form of FAD2 that (1) was up-regulated in response to pathogen treatment and (2) when expressed in soybean embryos resulted in production of the monoyne crepenynate and, by the action of an unassigned enzyme, dehydrocrepenynate (Kirsch et al., 1997; Cahoon et al., 2003). The results of the parsley studies are consistent with a pathogen-responsive, divergent FAD2-mediated pathway that leads to acetylenic fatty acids. However, information regarding the branch point into acetylenic fatty acid production in agriculturally relevant carrot is still largely missing, in particular, the identification and functional characterization of enzymes that can divert carbon flux away from linolenate biosynthesis into the production of dehydrocrepenynate and ultimately falcarin-type polyacetylenes. Such genes, once identified, could be used in the future design of transgenic carrot lines with altered polyacetylene content, enabling direct testing of in planta polyacetylene function and potentially the engineering of pathogen-resistant, more nutritious carrots. These genes could also provide the foundation for further investigations of more basic aspects of plant biology, including the evolution of fatty acid-derived natural product biosynthesis pathways across the Asterid clade, as well as the role of these pathways and compounds in plant ecology and plant defense. Recently, a high-quality carrot genome assembly was released (Iorizzo et al., 2016), providing a foundation for genome-enabled studies of Apiaceous species. This study also provided publicly accessible RNA sequencing (RNA-Seq) data from diverse carrot tissues. Using these resources, this study aimed to provide a detailed gas chromatography-based quantification of polyacetylenes in carrot tissues for which RNA-Seq data are available, then combine this information with bioinformatics analysis and heterologous expression to identify and characterize biosynthetic genes that underlie the major entry point into carrot polyacetylene biosynthesis. To achieve these goals, thin-layer chromatography (TLC) was combined with gas chromatography-mass spectrometry (GC-MS) and gas chromatography-flame ionization detection to identify and quantify polyacetylenic metabolites in five different carrot tissues. Then the sequences and tissue expression profiles of potential FAD2 and FAD2-like genes annotated in the D. carota genome were compared with the metabolite data to identify candidate pathway genes, followed by biochemical functionality tests using yeast (Saccharomyces cerevisae) and Arabidopsis (Arabidopsis thaliana) as heterologous expression systems.« less