skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Robust predictions of specialized metabolism genes through machine learning
Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. Using Arabidopsis thaliana as a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220 A. thaliana genes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome.  more » « less
Award ID(s):
1811055
PAR ID:
10171276
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Volume:
116
Issue:
6
ISSN:
0027-8424
Page Range / eLocation ID:
2344 to 2353
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Marshall-Colon, Amy (Ed.)
    Abstract Plant specialized metabolites mediate interactions between plants and the environment and have significant agronomical/pharmaceutical value. Most genes involved in specialized metabolism (SM) are unknown because of the large number of metabolites and the challenge in differentiating SM genes from general metabolism (GM) genes. Plant models like Arabidopsis thaliana have extensive, experimentally derived annotations, whereas many non-model species do not. Here we employed a machine learning strategy, transfer learning, where knowledge from A. thaliana is transferred to predict gene functions in cultivated tomato with fewer experimentally annotated genes. The first tomato SM/GM prediction model using only tomato data performs well (F-measure = 0.74, compared with 0.5 for random and 1.0 for perfect predictions), but from manually curating 88 SM/GM genes, we found many mis-predicted entries were likely mis-annotated. When the SM/GM prediction models built with A. thaliana data were used to filter out genes where the A. thaliana-based model predictions disagreed with tomato annotations, the new tomato model trained with filtered data improved significantly (F-measure = 0.92). Our study demonstrates that SM/GM genes can be better predicted by leveraging cross-species information. Additionally, our findings provide an example for transfer learning in genomics where knowledge can be transferred from an information-rich species to an information-poor one. 
    more » « less
  2. Abstract FatPlants, an open-access, web-based database, consolidates data, annotations, analysis results, and visualizations of lipid-related genes, proteins, and metabolic pathways in plants. Serving as a minable resource, FatPlants offers a user-friendly interface for facilitating studies into the regulation of plant lipid metabolism and supporting breeding efforts aimed at increasing crop oil content. This web resource, developed using data derived from our own research, curated from public resources, and gleaned from academic literature, comprises information on known fatty-acid-related proteins, genes, and pathways in multiple plants, with an emphasis on Glycine max, Arabidopsis thaliana, and Camelina sativa. Furthermore, the platform includes machine-learning based methods and navigation tools designed to aid in characterizing metabolic pathways and protein interactions. Comprehensive gene and protein information cards, a Basic Local Alignment Search Tool search function, similar structure search capacities from AphaFold, and ChatGPT-based query for protein information are additional features. Database URL: https://www.fatplants.net/ 
    more » « less
  3. Wallqvist, Anders (Ed.)
    Bacterial pathogens adapt their metabolism to the plant environment to successfully colonize their hosts. In our efforts to uncover the metabolic pathways that contribute to the colonization ofArabidopsis thalianaleaves byPseudomonas syringaepvtomatoDC3000 (PstDC3000), we created iPst19, an ensemble of 100 genome-scale network reconstructions ofPstDC3000 metabolism. We developed a novel approach for gene essentiality screens, leveraging the predictive power of iPst19 to identify core and ancillary condition-specific essential genes. Constraining the metabolic flux of iPst19 withPstDC3000 gene expression data obtained from naïve-infected or pre-immunized-infected plants, revealed changes in bacterial metabolism imposed by plant immunity. Machine learning analysis revealed that among other amino acids, branched-chain amino acids (BCAAs) metabolism significantly contributed to the overall metabolic status of each gene-expression-contextualized iPst19 simulation. These predictions were tested and confirmed experimentally.PstDC3000 growth and gene expression analysis showed that BCAAs suppress virulence gene expressionin vitrowithout affecting bacterial growth.In planta, however, an excess of BCAAs suppress the expression of virulence genes at the early stages of infection and significantly impair the colonization of Arabidopsis leaves. Our findings suggesting that BCAAs catabolism is necessary to express virulence and colonize the host. Overall, this study provides valuable insights into how plant immunity impactsPstDC3000 metabolism, and how bacterial metabolism impacts the expression of virulence. 
    more » « less
  4. This study investigated the generalizability of Arabidopsis thaliana immune responses across diverse pathogens, including Botrytis cinerea, Sclerotinia sclerotiorum, and Pseudomonas syringae, using a data-driven, machine learning approach. Machine learning models were trained to predict disease development from early transcriptional responses. Feature selection techniques based on network science and topology were used to train models employing only a fraction of the transcriptome. Machine learning models trained on one pathosystem where then validated by predicting disease development in new pathosystems. The identified feature selection gene sets were enriched for pathways related to biotic, abiotic, and stress responses, though the specific genes involved differed between feature sets. This suggests common immune responses to diverse pathogens that operate via different gene sets.The study demonstrates that machine learning can uncover both established and novel components of the plant's immune response, offering insights into disease resistance mechanisms. These predictive models highlight the potential to advance our understanding of multigenic outcomes in plant immunity and can be further refined for applications in disease prediction. 
    more » « less
  5. null (Ed.)
    Plant growth, development, and nutritional quality depends upon amino acid homeostasis, especially in seeds. However, our understanding of the underlying genetics influencing amino acid content and composition remains limited, with only a few candidate genes and quantitative trait loci identified to date. Improved knowledge of the genetics and biological processes that determine amino acid levels will enable researchers to use this information for plant breeding and biological discovery. Toward this goal, we used genomic prediction to identify biological processes that are associated with, and therefore potentially influence, free amino acid (FAA) composition in seeds of the model plant Arabidopsis thaliana . Markers were split into categories based on metabolic pathway annotations and fit using a genomic partitioning model to evaluate the influence of each pathway on heritability explained, model fit, and predictive ability. Selected pathways included processes known to influence FAA composition, albeit to an unknown degree, and spanned four categories: amino acid, core, specialized, and protein metabolism. Using this approach, we identified associations for pathways containing known variants for FAA traits, in addition to finding new trait-pathway associations. Markers related to amino acid metabolism, which are directly involved in FAA regulation, improved predictive ability for branched chain amino acids and histidine. The use of genomic partitioning also revealed patterns across biochemical families, in which serine-derived FAAs were associated with protein related annotations and aromatic FAAs were associated with specialized metabolic pathways. Taken together, these findings provide evidence that genomic partitioning is a viable strategy to uncover the relative contributions of biological processes to FAA traits in seeds, offering a promising framework to guide hypothesis testing and narrow the search space for candidate genes. 
    more » « less