skip to main content


Title: Predicting transcriptional responses to cold stress across plant species
Although genome-sequence assemblies are available for a growing number of plant species, gene-expression responses to stimuli have been cataloged for only a subset of these species. Many genes show altered transcription patterns in response to abiotic stresses. However, orthologous genes in related species often exhibit different responses to a given stress. Accordingly, data on the regulation of gene expression in one species are not reliable predictors of orthologous gene responses in a related species. Here, we trained a supervised classification model to identify genes that transcriptionally respond to cold stress. A model trained with only features calculated directly from genome assemblies exhibited only modest decreases in performance relative to models trained by using genomic, chromatin, and evolution/diversity features. Models trained with data from one species successfully predicted which genes would respond to cold stress in other related species. Cross-species predictions remained accurate when training was performed in cold-sensitive species and predictions were performed in cold-tolerant species and vice versa. Models trained with data on gene expression in multiple species provided at least equivalent performance to models trained and tested in a single species and outperformed single-species models in cross-species prediction. These results suggest that classifiers trained on stress data from well-studied species may suffice for predicting gene-expression patterns in related, less-studied species with sequenced genomes.  more » « less
Award ID(s):
1845175
NSF-PAR ID:
10320744
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Volume:
118
Issue:
10
ISSN:
0027-8424
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Transcription factors (TFs) play a central role in regulating molecular level responses of plants to external stresses such as water limiting conditions, but identification of such TFs in the genome remains a challenge. Here, we describe a network-based supervised machine learning framework that accurately predicts and ranks all TFs in the genome according to their potential association with drought tolerance. We show that top ranked regulators fall mainly into two ‘age’ groups; genes that appeared first in land plants and genes that emerged later in the Oryza clade. TFs predicted to be high in the ranking belong to specific gene families, have relatively simple intron/exon and protein structures, and functionally converge to regulate primary and secondary metabolism pathways. Repeated trials of nested cross-validation tests showed that models trained only on regulatory network patterns, inferred from large transcriptome datasets, outperform models trained on heterogenous genomic features in the prediction of known drought response regulators. A new R/Shiny based web application, called the DroughtApp, provides a primer for generation of new testable hypotheses related to regulation of drought stress response. Furthermore, to test the system we experimentally validated predictions on the functional role of the rice transcription factor OsbHLH148, using RNA sequencing of knockout mutants in response to drought stress and protein-DNA interaction assays. Our study exemplifies the integration of domain knowledge for prioritization of regulatory genes in biological pathways of well-studied agricultural traits. 
    more » « less
  2. INTRODUCTION Diverse phenotypes, including large brains relative to body size, group living, and vocal learning ability, have evolved multiple times throughout mammalian history. These shared phenotypes may have arisen repeatedly by means of common mechanisms discernible through genome comparisons. RATIONALE Protein-coding sequence differences have failed to fully explain the evolution of multiple mammalian phenotypes. This suggests that these phenotypes have evolved at least in part through changes in gene expression, meaning that their differences across species may be caused by differences in genome sequence at enhancer regions that control gene expression in specific tissues and cell types. Yet the enhancers involved in phenotype evolution are largely unknown. Sequence conservation–based approaches for identifying such enhancers are limited because enhancer activity can be conserved even when the individual nucleotides within the sequence are poorly conserved. This is due to an overwhelming number of cases where nucleotides turn over at a high rate, but a similar combination of transcription factor binding sites and other sequence features can be maintained across millions of years of evolution, allowing the function of the enhancer to be conserved in a particular cell type or tissue. Experimentally measuring the function of orthologous enhancers across dozens of species is currently infeasible, but new machine learning methods make it possible to make reliable sequence-based predictions of enhancer function across species in specific tissues and cell types. RESULTS To overcome the limits of studying individual nucleotides, we developed the Tissue-Aware Conservation Inference Toolkit (TACIT). Rather than measuring the extent to which individual nucleotides are conserved across a region, TACIT uses machine learning to test whether the function of a given part of the genome is likely to be conserved. More specifically, convolutional neural networks learn the tissue- or cell type–specific regulatory code connecting genome sequence to enhancer activity using candidate enhancers identified from only a few species. This approach allows us to accurately associate differences between species in tissue or cell type–specific enhancer activity with genome sequence differences at enhancer orthologs. We then connect these predictions of enhancer function to phenotypes across hundreds of mammals in a way that accounts for species’ phylogenetic relatedness. We applied TACIT to identify candidate enhancers from motor cortex and parvalbumin neuron open chromatin data that are associated with brain size relative to body size, solitary living, and vocal learning across 222 mammals. Our results include the identification of multiple candidate enhancers associated with brain size relative to body size, several of which are located in linear or three-dimensional proximity to genes whose protein-coding mutations have been implicated in microcephaly or macrocephaly in humans. We also identified candidate enhancers associated with the evolution of solitary living near a gene implicated in separation anxiety and other enhancers associated with the evolution of vocal learning ability. We obtained distinct results for bulk motor cortex and parvalbumin neurons, demonstrating the value in applying TACIT to both bulk tissue and specific minority cell type populations. To facilitate future analyses of our results and applications of TACIT, we released predicted enhancer activity of >400,000 candidate enhancers in each of 222 mammals and their associations with the phenotypes we investigated. CONCLUSION TACIT leverages predicted enhancer activity conservation rather than nucleotide-level conservation to connect genetic sequence differences between species to phenotypes across large numbers of mammals. TACIT can be applied to any phenotype with enhancer activity data available from at least a few species in a relevant tissue or cell type and a whole-genome alignment available across dozens of species with substantial phenotypic variation. Although we developed TACIT for transcriptional enhancers, it could also be applied to genomic regions involved in other components of gene regulation, such as promoters and splicing enhancers and silencers. As the number of sequenced genomes grows, machine learning approaches such as TACIT have the potential to help make sense of how conservation of, or changes in, subtle genome patterns can help explain phenotype evolution. Tissue-Aware Conservation Inference Toolkit (TACIT) associates genetic differences between species with phenotypes. TACIT works by generating open chromatin data from a few species in a tissue related to a phenotype, using the sequences underlying open and closed chromatin regions to train a machine learning model for predicting tissue-specific open chromatin and associating open chromatin predictions across dozens of mammals with the phenotype. [Species silhouettes are from PhyloPic] 
    more » « less
  3. Abstract Changes in gene expression are important for responses to abiotic stress. Transcriptome profiling of heat- or cold-stressed maize genotypes identifies many changes in transcript abundance. We used comparisons of expression responses in multiple genotypes to identify alleles with variable responses to heat or cold stress and to distinguish examples of cis- or trans-regulatory variation for stress-responsive expression changes. We used motifs enriched near the transcription start sites (TSSs) for thermal stress-responsive genes to develop predictive models of gene expression responses. Prediction accuracies can be improved by focusing only on motifs within unmethylated regions near the TSS and vary for genes with different dynamic responses to stress. Models trained on expression responses in a single genotype and promoter sequences provided lower performance when applied to other genotypes but this could be improved by using models trained on data from all three genotypes tested. The analysis of genes with cis-regulatory variation provides evidence for structural variants that result in presence/absence of transcription factor binding sites in creating variable responses. This study provides insights into cis-regulatory motifs for heat- and cold-responsive gene expression and defines a framework for developing models to predict expression responses across multiple genotypes. 
    more » « less
  4. null (Ed.)
    Like animals, plants have internal biological clocks that allow them to adapt to daily and yearly changes, such as day-night cycles or seasons turning. Unlike animals, however, plants cannot move when their environment becomes different, so they need to be able to weather these changes by adjusting which genes they switch on and off. To do this, plants keep track of how long days are using external cues such as light or temperature. One of the effects of climate change is that these cues become less reliable, making it harder for plants to adapt to their environment and survive. This is a potential problem for crop species, like Brassica rapa . This plant has many edible forms, including Chinese cabbage, oilseed, pak choi, and turnip. It is also a close relative of the well-studied model plant, Arabidopsis . Since evolving away from Arabidopsis , the genome of B. rapa tripled, meaning it has one, two, or three copies of each gene. This has allowed the extra gene copies to mutate and adapt to different purposes. The question is, what impact has this genome expansion had on the plant's biological clock? One way to find out is to perform RNA-sequencing experiments, which record the genes a plant is using at any one time. Here, Greenham, Sartor et al. report the results of a series of RNA-sequencing experiments performed every two hours across two days. Plants were first exposed to light-dark or temperature cycles and then samples were taken when the plants were in constant light and temperature. This revealed which genes B. rapa turned on and off in response to signals from the internal biological clock. It turns out that the biological clock of B. rapa controls close to three quarters of its genes. These genes showed distinct phases, increasing or decreasing in regular patterns. But the different copies of duplicated and triplicated genes did not necessarily all behave in the same way. Many of the copies had different rhythms, and some increased and decreased in patterns totally opposite to their counterparts. Not only did the daily patterns differ, but responses to stressors like drought were also altered. Comparing these patterns to the patterns seen in Arabidopsis revealed that often, one B. rapa gene behaved just like its Arabidopsis equivalent, while its copies had evolved new behaviors. The different behaviors of the copies of each gene in B. rapa relative to its biological clock allow this plant to grow in different environments with varying temperatures and day lengths. Understanding how these adaptations work opens new avenues of research into how plants detect and respond to environmental signals. This could help to guide future work into targeting genes to improve crop growth and stress resilience. 
    more » « less
  5. Abstract Background

    Crop improvement through cross-population genomic prediction and genome editing requires identification of causal variants at high resolution, within fewer than hundreds of base pairs. Most genetic mapping studies have generally lacked such resolution. In contrast, evolutionary approaches can detect genetic effects at high resolution, but they are limited by shifting selection, missing data, and low depth of multiple-sequence alignments. Here we use genomic annotations to accurately predict nucleotide conservation across angiosperms, as a proxy for fitness effect of mutations.

    Results

    Using only sequence analysis, we annotate nonsynonymous mutations in 25,824 maize gene models, with information from bioinformatics and deep learning. Our predictions are validated by experimental information: within-species conservation, chromatin accessibility, and gene expression. According to gene ontology and pathway enrichment analyses, predicted nucleotide conservation points to genes in central carbon metabolism. Importantly, it improves genomic prediction for fitness-related traits such as grain yield, in elite maize panels, by stringent prioritization of fewer than 1% of single-site variants.

    Conclusions

    Our results suggest that predicting nucleotide conservation across angiosperms may effectively prioritize sites most likely to impact fitness-related traits in crops, without being limited by shifting selection, missing data, and low depth of multiple-sequence alignments. Our approach—Prediction of mutation Impact by Calibrated Nucleotide Conservation (PICNC)—could be useful to select polymorphisms for accurate genomic prediction, and candidate mutations for efficient base editing. The trained PICNC models and predicted nucleotide conservation at protein-coding SNPs in maize are publicly available in CyVerse (https://doi.org/10.25739/hybz-2957).

     
    more » « less