skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Robust predictions of specialized metabolism genes through machine learning
Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. UsingArabidopsis thalianaas a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220A. thalianagenes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome.  more » « less
Award ID(s):
1811055
PAR ID:
10083806
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Publisher / Repository:
Proceedings of the National Academy of Sciences
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Volume:
116
Issue:
6
ISSN:
0027-8424
Page Range / eLocation ID:
p. 2344-2353
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Marshall-Colon, Amy (Ed.)
    Abstract Plant specialized metabolites mediate interactions between plants and the environment and have significant agronomical/pharmaceutical value. Most genes involved in specialized metabolism (SM) are unknown because of the large number of metabolites and the challenge in differentiating SM genes from general metabolism (GM) genes. Plant models like Arabidopsis thaliana have extensive, experimentally derived annotations, whereas many non-model species do not. Here we employed a machine learning strategy, transfer learning, where knowledge from A. thaliana is transferred to predict gene functions in cultivated tomato with fewer experimentally annotated genes. The first tomato SM/GM prediction model using only tomato data performs well (F-measure = 0.74, compared with 0.5 for random and 1.0 for perfect predictions), but from manually curating 88 SM/GM genes, we found many mis-predicted entries were likely mis-annotated. When the SM/GM prediction models built with A. thaliana data were used to filter out genes where the A. thaliana-based model predictions disagreed with tomato annotations, the new tomato model trained with filtered data improved significantly (F-measure = 0.92). Our study demonstrates that SM/GM genes can be better predicted by leveraging cross-species information. Additionally, our findings provide an example for transfer learning in genomics where knowledge can be transferred from an information-rich species to an information-poor one. 
    more » « less
  2. Wallqvist, Anders (Ed.)
    Bacterial pathogens adapt their metabolism to the plant environment to successfully colonize their hosts. In our efforts to uncover the metabolic pathways that contribute to the colonization ofArabidopsis thalianaleaves byPseudomonas syringaepvtomatoDC3000 (PstDC3000), we created iPst19, an ensemble of 100 genome-scale network reconstructions ofPstDC3000 metabolism. We developed a novel approach for gene essentiality screens, leveraging the predictive power of iPst19 to identify core and ancillary condition-specific essential genes. Constraining the metabolic flux of iPst19 withPstDC3000 gene expression data obtained from naïve-infected or pre-immunized-infected plants, revealed changes in bacterial metabolism imposed by plant immunity. Machine learning analysis revealed that among other amino acids, branched-chain amino acids (BCAAs) metabolism significantly contributed to the overall metabolic status of each gene-expression-contextualized iPst19 simulation. These predictions were tested and confirmed experimentally.PstDC3000 growth and gene expression analysis showed that BCAAs suppress virulence gene expressionin vitrowithout affecting bacterial growth.In planta, however, an excess of BCAAs suppress the expression of virulence genes at the early stages of infection and significantly impair the colonization of Arabidopsis leaves. Our findings suggesting that BCAAs catabolism is necessary to express virulence and colonize the host. Overall, this study provides valuable insights into how plant immunity impactsPstDC3000 metabolism, and how bacterial metabolism impacts the expression of virulence. 
    more » « less
  3. Abstract FatPlants, an open-access, web-based database, consolidates data, annotations, analysis results, and visualizations of lipid-related genes, proteins, and metabolic pathways in plants. Serving as a minable resource, FatPlants offers a user-friendly interface for facilitating studies into the regulation of plant lipid metabolism and supporting breeding efforts aimed at increasing crop oil content. This web resource, developed using data derived from our own research, curated from public resources, and gleaned from academic literature, comprises information on known fatty-acid-related proteins, genes, and pathways in multiple plants, with an emphasis on Glycine max, Arabidopsis thaliana, and Camelina sativa. Furthermore, the platform includes machine-learning based methods and navigation tools designed to aid in characterizing metabolic pathways and protein interactions. Comprehensive gene and protein information cards, a Basic Local Alignment Search Tool search function, similar structure search capacities from AphaFold, and ChatGPT-based query for protein information are additional features. Database URL: https://www.fatplants.net/ 
    more » « less
  4. Abstract Identifying genes that interact to confer a biological function to an organism is one of the main goals of functional genomics. High‐throughput technologies for assessment and quantification of genome‐wide gene expression patterns have enabled systems‐level analyses to infer pathways or networks of genes involved in different functions under many different conditions. Here, we leveraged the publicly available, information‐rich RNA‐Seq datasets of the model plantArabidopsis thalianato construct a gene co‐expression network, which was partitioned into clusters or modules that harbor genes correlated by expression. Gene ontology and pathway enrichment analyses were performed to assess functional terms and pathways that were enriched within the different gene modules. By interrogating the co‐expression network for genes in different modules that associate with a gene of interest, diverse functional roles of the gene can be deciphered. By mapping genes differentially expressing under a certain condition inArabidopsisonto the co‐expression network, we demonstrate the ability of the network to uncover novel genes that are likely transcriptionally active but prone to be missed by standard statistical approaches due to their falling outside of the confidence zone of detection. To our knowledge, this is the firstA. thalianaco‐expression network constructed using the entire mRNA‐Seq datasets (>20,000) available at the NCBI SRA database. The developed network can serve as a useful resource for theArabidopsisresearch community to interrogate specific genes of interest within the network, retrieve the respective interactomes, decipher gene modules that are transcriptionally altered under certain condition or stage, and gain understanding of gene functions. 
    more » « less
  5. This study investigated the generalizability of Arabidopsis thaliana immune responses across diverse pathogens, including Botrytis cinerea, Sclerotinia sclerotiorum, and Pseudomonas syringae, using a data-driven, machine learning approach. Machine learning models were trained to predict disease development from early transcriptional responses. Feature selection techniques based on network science and topology were used to train models employing only a fraction of the transcriptome. Machine learning models trained on one pathosystem where then validated by predicting disease development in new pathosystems. The identified feature selection gene sets were enriched for pathways related to biotic, abiotic, and stress responses, though the specific genes involved differed between feature sets. This suggests common immune responses to diverse pathogens that operate via different gene sets.The study demonstrates that machine learning can uncover both established and novel components of the plant's immune response, offering insights into disease resistance mechanisms. These predictive models highlight the potential to advance our understanding of multigenic outcomes in plant immunity and can be further refined for applications in disease prediction. 
    more » « less