skip to main content

Title: MCRiceRepGP: a framework for the identification of genes associated with sexual reproduction in rice

Rice is an important cereal crop, being a staple food for over half of the world's population, and sexual reproduction resulting in grain formation underpins global food security. However, despite considerable research efforts, many of the genes, especially long intergenic non‐codingRNA(lincRNA) genes, involved in sexual reproduction in rice remain uncharacterized. With an increasing number of public resources becoming available, information from different sources can be combined to perform gene functional annotation. We report the development of MCRiceRepGP, a machine learning framework which integrates heterogeneous evidence and employs multicriteria decision analysis and machine learning to predict coding and lincRNA genes involved in sexual reproduction in rice. The rice genome was reannotated using deep‐sequencing transcriptomic data from reproduction‐associated tissue/cell types identifying previously unannotated putative protein‐coding genes and lincRNAs. MCRiceRepGP was used for genome‐wide discovery of sexual reproduction associated coding and lincRNA genes. The protein‐coding and lincRNA genes identified have distinct expression profiles, with a large proportion of lincRNAs reaching maximum expression levels in the sperm cells. Some of the genes are potentially linked to male‐ and female‐specific fertility and heat stress tolerance during the reproductive stage. MCRiceRepGP can be used in combination with other genome‐wide studies, such as genome‐wide association studies, giving greater confidence that the genes identified are associated with the biological process of interest. As more data, especially about mutant plant phenotypes, become available, the power of MCRiceRepGP will grow, providing researchers with a tool to identify candidate genes for future experiments. MCRiceRepGP is available as a web application (

more » « less
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Date Published:
Journal Name:
The Plant Journal
Page Range / eLocation ID:
p. 188-202
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    CHESS 3 represents an improved human gene catalog based on nearly 10,000 RNA-seq experiments across 54 body sites. It significantly improves current genome annotation by integrating the latest reference data and algorithms, machine learning techniques for noise filtering, and new protein structure prediction methods. CHESS 3 contains 41,356 genes, including 19,839 protein-coding genes and 158,377 transcripts, with 14,863 protein-coding transcripts not in other catalogs. It includes all MANE transcripts and at least one transcript for most RefSeq and GENCODE genes. On the CHM13 human genome, the CHESS 3 catalog contains an additional 129 protein-coding genes. CHESS 3 is available at

    more » « less
  2. Abstract

    Rice, an important food resource, is highly sensitive to salt stress, which is directly related to food security. Although many studies have identified physiological mechanisms that confer tolerance to the osmotic effects of salinity, the link between rice genotype and salt tolerance is not very clear yet. Association of gene co‐expression network and rice phenotypic data under stress has penitential to identify stress‐responsive genes, but there is no standard method to associate stress phenotype with gene co‐expression network. A novel method for integration of gene co‐expression network and stress phenotype data was developed to conduct a system analysis to link genotype to phenotype. We applied aLASSO‐based method to the gene co‐expression network of rice with salt stress to discover key genes and their interactions for salt tolerance‐related phenotypes. Submodules in gene modules identified from the co‐expression network were selected by theLASSOregression, which establishes a linear relationship between gene expression profiles and physiological responses, that is, sodium/potassium condenses under salt stress. Genes in these submodules have functions related to ion transport, osmotic adjustment, and oxidative tolerance. We argued that these genes in submodules are biologically meaningful and useful for studies on rice salt tolerance. This method can be applied to other studies to efficiently and reliably integrate co‐expression network and phenotypic data.

    more » « less
  3. Summary

    Alternative polyadenylation (APA) is a widespread post‐transcriptional mechanism that regulates gene expression throughmRNAmetabolism, playing a pivotal role in modulating phenotypic traits in rice (Oryza sativaL.). However, little is known about theAPA‐mediated regulation underlying the distinct characteristics between two major rice subspecies,indicaandjaponica. Using a poly(A)‐tag sequencing approach, polyadenylation (poly(A)) site profiles were investigated and compared pairwise from germination to the mature stage betweenindicaandjaponica, and extensive differentiation inAPAprofiles was detected genome‐wide. Genes with subspecies‐specific poly(A) sites were found to contribute to subspecies characteristics, particularly in disease resistance ofindicaand cold‐stress tolerance ofjaponica. In most tissues, differential usage ofAPAsites exhibited an apparent impact on the gene expression profiles between subspecies, and genes with those APA sites were significantly enriched in quantitative trait loci (QTL) related to yield traits, such as spikelet number and 1000‐seed weight. In leaves of the booting stage,APAsite‐switching genes displayed global shortening of 3′ untranslated regions with increased expression inindicacompared withjaponica, and they were overrepresented in the porphyrin and chlorophyll metabolism pathways. This phenomenon may lead to a higher chlorophyll content and photosynthesis inindicathan injaponica, being associated with their differential growth rates and yield potentials. We further constructed an online resource for querying and visualizing the poly(A) atlas in these two rice subspecies. Our results suggest thatAPAmay be largely involved in developmental differentiations between two rice subspecies, especially in leaf characteristics and the stress response, broadening our knowledge of the post‐transcriptional genetic basis underlying the divergence of rice traits.

    more » « less
  4. INTRODUCTION Diverse phenotypes, including large brains relative to body size, group living, and vocal learning ability, have evolved multiple times throughout mammalian history. These shared phenotypes may have arisen repeatedly by means of common mechanisms discernible through genome comparisons. RATIONALE Protein-coding sequence differences have failed to fully explain the evolution of multiple mammalian phenotypes. This suggests that these phenotypes have evolved at least in part through changes in gene expression, meaning that their differences across species may be caused by differences in genome sequence at enhancer regions that control gene expression in specific tissues and cell types. Yet the enhancers involved in phenotype evolution are largely unknown. Sequence conservation–based approaches for identifying such enhancers are limited because enhancer activity can be conserved even when the individual nucleotides within the sequence are poorly conserved. This is due to an overwhelming number of cases where nucleotides turn over at a high rate, but a similar combination of transcription factor binding sites and other sequence features can be maintained across millions of years of evolution, allowing the function of the enhancer to be conserved in a particular cell type or tissue. Experimentally measuring the function of orthologous enhancers across dozens of species is currently infeasible, but new machine learning methods make it possible to make reliable sequence-based predictions of enhancer function across species in specific tissues and cell types. RESULTS To overcome the limits of studying individual nucleotides, we developed the Tissue-Aware Conservation Inference Toolkit (TACIT). Rather than measuring the extent to which individual nucleotides are conserved across a region, TACIT uses machine learning to test whether the function of a given part of the genome is likely to be conserved. More specifically, convolutional neural networks learn the tissue- or cell type–specific regulatory code connecting genome sequence to enhancer activity using candidate enhancers identified from only a few species. This approach allows us to accurately associate differences between species in tissue or cell type–specific enhancer activity with genome sequence differences at enhancer orthologs. We then connect these predictions of enhancer function to phenotypes across hundreds of mammals in a way that accounts for species’ phylogenetic relatedness. We applied TACIT to identify candidate enhancers from motor cortex and parvalbumin neuron open chromatin data that are associated with brain size relative to body size, solitary living, and vocal learning across 222 mammals. Our results include the identification of multiple candidate enhancers associated with brain size relative to body size, several of which are located in linear or three-dimensional proximity to genes whose protein-coding mutations have been implicated in microcephaly or macrocephaly in humans. We also identified candidate enhancers associated with the evolution of solitary living near a gene implicated in separation anxiety and other enhancers associated with the evolution of vocal learning ability. We obtained distinct results for bulk motor cortex and parvalbumin neurons, demonstrating the value in applying TACIT to both bulk tissue and specific minority cell type populations. To facilitate future analyses of our results and applications of TACIT, we released predicted enhancer activity of >400,000 candidate enhancers in each of 222 mammals and their associations with the phenotypes we investigated. CONCLUSION TACIT leverages predicted enhancer activity conservation rather than nucleotide-level conservation to connect genetic sequence differences between species to phenotypes across large numbers of mammals. TACIT can be applied to any phenotype with enhancer activity data available from at least a few species in a relevant tissue or cell type and a whole-genome alignment available across dozens of species with substantial phenotypic variation. Although we developed TACIT for transcriptional enhancers, it could also be applied to genomic regions involved in other components of gene regulation, such as promoters and splicing enhancers and silencers. As the number of sequenced genomes grows, machine learning approaches such as TACIT have the potential to help make sense of how conservation of, or changes in, subtle genome patterns can help explain phenotype evolution. Tissue-Aware Conservation Inference Toolkit (TACIT) associates genetic differences between species with phenotypes. TACIT works by generating open chromatin data from a few species in a tissue related to a phenotype, using the sequences underlying open and closed chromatin regions to train a machine learning model for predicting tissue-specific open chromatin and associating open chromatin predictions across dozens of mammals with the phenotype. [Species silhouettes are from PhyloPic] 
    more » « less
  5. Heřmanský–Pudlák syndrome (HPS), a rare autosomal recessive disorder, manifests with oculocutaneous albinism and a bleeding diathesis. However, severity of disease can be variable and is typically related to the genetic subtype of HPS; HPS type 6 (HPS‐6) is an uncommon subtype generally associated with mild disease. A Caucasian adult female presented with a history of severe bleeding; ophthalmologic examination indicated occult oculocutaneous albinism. The patient was diagnosed with a platelet storage pool disorder, and platelet whole mount electron microscopy demonstrated absent delta granules. Genome‐wide SNP analysis showed regions of homozygosity that included theHPS1andHPS6genes. Full lengthHPS1transcript was amplified by PCR of genomic DNA. Targeted next‐generation sequencing identified a novel homozygous missense variant inHPS6(c.383 T > C; p.V128A); this was associated with significantly reducedHPS6mRNA and protein expression in the patient's fibroblasts compared to control cells. These findings highlight the variable severity of disease manifestations in patients with HPS, and illustrate that HPS can be diagnosed in patients with excessive bleeding and occult oculocutaneous albinism. Genetic analysis and platelet electron microscopy are useful diagnostic tests in evaluating patients with suspected HPS.

    Clinical Trial registration:

    Registration Numbers: NCT00001456 and NCT00084305.

    more » « less