skip to main content

Title: Araport11: a complete reannotation of the Arabidopsis thaliana reference genome

The flowering plantArabidopsis thalianais a dicot model organism for research in many aspects of plant biology. A comprehensive annotation of its genome paves the way for understanding the functions and activities of all types of transcripts, includingmRNA, the various classes of non‐codingRNA, and smallRNA. TheTAIR10 annotation update had a profound impact on Arabidopsis research but was released more than 5 years ago. Maintaining the accuracy of the annotation continues to be a prerequisite for future progress. Using an integrative annotation pipeline, we assembled tissue‐specificRNA‐Seq libraries from 113 datasets and constructed 48 359 transcript models of protein‐coding genes in eleven tissues. In addition, we annotated various classes of non‐codingRNAincluding microRNA, long intergenicRNA, small nucleolarRNA, natural antisense transcript, small nuclearRNA, and smallRNAusing published datasets and in‐house analytic results. Altogether, we identified 635 novel protein‐coding genes, 508 novel transcribed regions, 5178 non‐codingRNAs, and 35 846 smallRNAloci that were formerly unannotated. Analysis of the splicing events andRNA‐Seq based expression profiles revealed the landscapes of gene structures, untranslated regions, and splicing activities to be more intricate than previously appreciated. Furthermore, we present 692 uniformly expressed housekeeping genes, 43% of whose human orthologs are also housekeeping genes. This updated Arabidopsis genome annotation with a substantially increased resolution of gene models will not only further our understanding of the biological processes of this plant model but also of other species.

more » « less
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  
Publisher / Repository:
Date Published:
Journal Name:
The Plant Journal
Page Range / eLocation ID:
p. 789-804
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    Plant smallRNAs (sRNAs) modulate key physiological mechanisms through post‐transcriptional and transcriptional silencing of gene expression. SmallRNAs fall into two major categories: those are reliant onRNA‐dependentRNApolymerases (RDRs) for biogenesis and those that are not. KnownRDR1/2/6‐dependentsRNAs include phased and repeat‐associated short interferingRNAs, while knownRDR1/2/6‐independentsRNAs are primarily microRNAs (miRNA) and other hairpin‐derivedsRNAs. In this study we produced and analyzedsRNA‐seq libraries fromrdr1/rdr2/rdr6triple mutant plants. We found 58 previously annotated miRNAloci that were reliant onRDR1, ‐2, or ‐6function, casting doubt on their classification. We also found 38RDR1/2/6‐independentsRNAloci that are notMIRNAs or otherwise hairpin‐derived, and did not fit into other known paradigms forsRNAbiogenesis. These 38sRNA‐producing loci have as‐yet‐undescribed biogenesis mechanisms, and are frequently located in the vicinity of protein‐coding genes. Altogether, our analysis suggests that these 38 loci represent one or more undescribed types ofsRNAinArabidopsis thaliana.

    more » « less
  2. Abstract

    Identifying genes that interact to confer a biological function to an organism is one of the main goals of functional genomics. High‐throughput technologies for assessment and quantification of genome‐wide gene expression patterns have enabled systems‐level analyses to infer pathways or networks of genes involved in different functions under many different conditions. Here, we leveraged the publicly available, information‐rich RNA‐Seq datasets of the model plantArabidopsis thalianato construct a gene co‐expression network, which was partitioned into clusters or modules that harbor genes correlated by expression. Gene ontology and pathway enrichment analyses were performed to assess functional terms and pathways that were enriched within the different gene modules. By interrogating the co‐expression network for genes in different modules that associate with a gene of interest, diverse functional roles of the gene can be deciphered. By mapping genes differentially expressing under a certain condition inArabidopsisonto the co‐expression network, we demonstrate the ability of the network to uncover novel genes that are likely transcriptionally active but prone to be missed by standard statistical approaches due to their falling outside of the confidence zone of detection. To our knowledge, this is the firstA. thalianaco‐expression network constructed using the entire mRNA‐Seq datasets (>20,000) available at the NCBI SRA database. The developed network can serve as a useful resource for theArabidopsisresearch community to interrogate specific genes of interest within the network, retrieve the respective interactomes, decipher gene modules that are transcriptionally altered under certain condition or stage, and gain understanding of gene functions.

    more » « less
  3. Summary

    Spirodela polyrhizais a fast‐growing aquatic monocot with highly reduced morphology, genome size and number of protein‐coding genes. Considering these biological features of Spirodela and its basal position in the monocot lineage, understanding its genome architecture could shed light on plant adaptation and genome evolution. Like many draft genomes, however, the 158‐Mb Spirodela genome sequence has not been resolved to chromosomes, and important genome characteristics have not been defined. Here we deployed rapid genome‐wide physical maps combined with high‐coverage short‐read sequencing to resolve the 20 chromosomes of Spirodela and to empirically delineate its genome features. Our data revealed a dramatic reduction in the number of therDNArepeat units in Spirodela to fewer than 100, which is even fewer than that reported for yeast. Consistent with its unique phylogenetic position, smallRNAsequencing revealed 29 Spirodela‐specific microRNA, with only two being shared withElaeis guineensis(oil palm) andMusa balbisiana(banana). CombiningDNAmethylation data and smallRNAsequencing enabled the accurate prediction of 20.5% long terminal repeats (LTRs) that doubled the previous estimate, and revealed a high Solo:IntactLTRratio of 8.2. Interestingly, we found that Spirodela has the lowest globalDNAmethylation levels (9%) of any plant species tested. Taken together our results reveal a genome that has undergone reduction, likely through eliminating non‐essential protein coding genes,rDNAandLTRs. In addition to delineating the genome features of this unique plant, the methodologies described and large‐scale genome resources from this work will enable future evolutionary and functional studies of this basal monocot family.

    more » « less
  4. Abstract

    Alternatively spliced genes produce multiple spliced isoforms, called transcript variants. In differential alternative splicing, transcript variant abundance differs across sample types. Differential alternative splicing is common in animal systems and influences cellular development in many processes, but its extent and significance is not as well known in plants. To investigate differential alternative splicing in plants, we examined RNA‐Seq data from rice seedlings. The data included three biological replicates per sample type, approximately 30 million sequence alignments per replicate, and four sample types: roots and shoots treated with exogenous cytokinin delivered hydroponically or a mock treatment. Cytokinin treatment triggered expression changes in thousands of genes but had negligible effect on splicing patterns. However, many genes were differentially spliced between mock‐treated roots and shoots, indicating that our methods were sufficiently sensitive to detect differential splicing between data sets. Quantitative fragment analysis of reverse transcriptase‐PCR products made from newly prepared rice samples confirmed 9 of 10 differential splicing events between rice roots and shoots. Differential alternative splicing typically changed the relative abundance of splice variants that co‐occurred in a data set. Analysis of a similar (but less deeply sequenced) RNA‐Seq data set fromArabidopsisshowed the same pattern. In both theArabidopsisand rice RNA‐Seq data sets, most genes annotated as alternatively spliced had small minor variant frequencies. Of splicing choices with abundant support for minor forms, most alternative splicing events were located within the protein‐coding sequence and maintained the annotated reading frame. A tool for visualizing protein annotations in the context of genomic sequence (ProtAnnot) together with a genome browser (Integrated Genome Browser) were used to visualize and assess effects of differential splicing on gene function. In general, differentially spliced regions coincided with conserved protein domains, indicating that differential alternative splicing is likely to affect protein function between root and shoot tissue in rice.

    more » « less
  5. Summary

    Rice is an important cereal crop, being a staple food for over half of the world's population, and sexual reproduction resulting in grain formation underpins global food security. However, despite considerable research efforts, many of the genes, especially long intergenic non‐codingRNA(lincRNA) genes, involved in sexual reproduction in rice remain uncharacterized. With an increasing number of public resources becoming available, information from different sources can be combined to perform gene functional annotation. We report the development of MCRiceRepGP, a machine learning framework which integrates heterogeneous evidence and employs multicriteria decision analysis and machine learning to predict coding and lincRNA genes involved in sexual reproduction in rice. The rice genome was reannotated using deep‐sequencing transcriptomic data from reproduction‐associated tissue/cell types identifying previously unannotated putative protein‐coding genes and lincRNAs. MCRiceRepGP was used for genome‐wide discovery of sexual reproduction associated coding and lincRNA genes. The protein‐coding and lincRNA genes identified have distinct expression profiles, with a large proportion of lincRNAs reaching maximum expression levels in the sperm cells. Some of the genes are potentially linked to male‐ and female‐specific fertility and heat stress tolerance during the reproductive stage. MCRiceRepGP can be used in combination with other genome‐wide studies, such as genome‐wide association studies, giving greater confidence that the genes identified are associated with the biological process of interest. As more data, especially about mutant plant phenotypes, become available, the power of MCRiceRepGP will grow, providing researchers with a tool to identify candidate genes for future experiments. MCRiceRepGP is available as a web application (

    more » « less