skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Within- and cross-species predictions of plant specialized metabolism genes using transfer learning
Abstract Plant specialized metabolites mediate interactions between plants and the environment and have significant agronomical/pharmaceutical value. Most genes involved in specialized metabolism (SM) are unknown because of the large number of metabolites and the challenge in differentiating SM genes from general metabolism (GM) genes. Plant models like Arabidopsis thaliana have extensive, experimentally derived annotations, whereas many non-model species do not. Here we employed a machine learning strategy, transfer learning, where knowledge from A. thaliana is transferred to predict gene functions in cultivated tomato with fewer experimentally annotated genes. The first tomato SM/GM prediction model using only tomato data performs well (F-measure = 0.74, compared with 0.5 for random and 1.0 for perfect predictions), but from manually curating 88 SM/GM genes, we found many mis-predicted entries were likely mis-annotated. When the SM/GM prediction models built with A. thaliana data were used to filter out genes where the A. thaliana-based model predictions disagreed with tomato annotations, the new tomato model trained with filtered data improved significantly (F-measure = 0.92). Our study demonstrates that SM/GM genes can be better predicted by leveraging cross-species information. Additionally, our findings provide an example for transfer learning in genomics where knowledge can be transferred from an information-rich species to an information-poor one.  more » « less
Award ID(s):
1655386 1546617
PAR ID:
10222402
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Editor(s):
Marshall-Colon, Amy
Date Published:
Journal Name:
in silico Plants
Volume:
2
Issue:
1
ISSN:
2517-5025
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. UsingArabidopsis thalianaas a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220A. thalianagenes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome. 
    more » « less
  2. Summary Plant metabolites from diverse pathways are important for plant survival, human nutrition and medicine. The pathway memberships of most plant enzyme genes are unknown. While co‐expression is useful for assigning genes to pathways, expression correlation may exist only under specific spatiotemporal and conditional contexts.Utilising > 600 tomato (Solanum lycopersicum) expression data combinations, three strategies for predicting memberships in 85 pathways were explored.Optimal predictions for different pathways require distinct data combinations indicative of pathway functions. Naive prediction (i.e. identifying pathways with the most similarly expressed genes) is error prone. In 52 pathways, unsupervised learning performed better than supervised approaches, possibly due to limited training data availability. Using gene‐to‐pathway expression similarities led to prediction models that outperformed those based simply on expression levels. Using 36 experimental validated genes, the pathway‐best model prediction accuracy is 58.3%, significantly better compared with that for predicting annotated genes without experimental evidence (37.0%) or random guess (1.2%), demonstrating the importance of data quality.Our study highlights the need to extensively explore expression‐based features and prediction strategies to maximise the accuracy of metabolic pathway membership assignment. The prediction framework outlined here can be applied to other species and serves as a baseline model for future comparisons. 
    more » « less
  3. Plants collectively synthesize a huge repertoire of metabolites. General metabolites, also referred to as primary metabolites, are conserved across the plant kingdom and are required for processes essential to growth and development. These include amino acids, sugars, lipids, and organic acids. In contrast, specialized metabolites, historically termed secondary metabolites, are structurally diverse, exhibit lineage-specific distribution and provide selective advantage to host species to facilitate reproduction and environmental adaptation. Due to their potent bioactivities, plant specialized metabolites attract considerable attention for use as flavorings, fragrances, pharmaceuticals, and bio-pesticides. The Solanaceae (Nightshade family) consists of approximately 2700 species and includes crops of significant economic, cultural, and scientific importance: these include potato, tomato, pepper, eggplant, tobacco, and petunia. The Solanaceae has emerged as a model family for studying the biochemical evolution of plant specialized metabolism and multiple examples exist of lineage-specific metabolites that influence the senses and physiology of commensal and harmful organisms, including humans. These include, alcohols, phenylpropanoids, and carotenoids that contribute to fruit aroma and color in tomato (fruity), glandular trichome-derived terpenoids and acylsugars that contribute to plant defense (stinky & sticky, respectively), capsaicinoids in chilli-peppers that influence seed dispersal (spicy), and steroidal glycoalkaloids (bitter) from Solanum, nicotine (addictive) from tobacco, as well as tropane alkaloids (deadly) from Deadly Nightshade that deter herbivory. Advances in genomics and metabolomics, coupled with the adoption of comparative phylogenetic approaches, resulted in deeper knowledge of the biosynthesis and evolution of these metabolites. This review highlights recent progress in this area and outlines opportunities for – and challenges of-developing a more comprehensive understanding of Solanaceae metabolism. 
    more » « less
  4. null (Ed.)
    Plants produce diverse metabolites to cope with the challenges presented by complex and ever-changing environments. These challenges drive the diversification of specialized metabolites within and between plant species. However, we are just beginning to understand how frequently new alleles arise controlling specialized metabolite diversity and how the geographic distribution of these alleles may be structured by ecological and demographic pressures. Here we measure the variation in specialized metabolites across a population of 797 natural Arabidopsis thaliana accessions. We show a combination of geography, environmental parameters, demography, and different genetic processes all combine to influence the specific chemotypes and their distribution. This showed that causal loci in specialized metabolism contain frequent independently generated alleles with patterns suggesting potential within species convergence. This provides a new perspective about the complexity of the selective forces and mechanisms that shape the generation and distribution of allelic variation that may influence local adaptation. 
    more » « less
  5. null (Ed.)
    Plant growth, development, and nutritional quality depends upon amino acid homeostasis, especially in seeds. However, our understanding of the underlying genetics influencing amino acid content and composition remains limited, with only a few candidate genes and quantitative trait loci identified to date. Improved knowledge of the genetics and biological processes that determine amino acid levels will enable researchers to use this information for plant breeding and biological discovery. Toward this goal, we used genomic prediction to identify biological processes that are associated with, and therefore potentially influence, free amino acid (FAA) composition in seeds of the model plant Arabidopsis thaliana . Markers were split into categories based on metabolic pathway annotations and fit using a genomic partitioning model to evaluate the influence of each pathway on heritability explained, model fit, and predictive ability. Selected pathways included processes known to influence FAA composition, albeit to an unknown degree, and spanned four categories: amino acid, core, specialized, and protein metabolism. Using this approach, we identified associations for pathways containing known variants for FAA traits, in addition to finding new trait-pathway associations. Markers related to amino acid metabolism, which are directly involved in FAA regulation, improved predictive ability for branched chain amino acids and histidine. The use of genomic partitioning also revealed patterns across biochemical families, in which serine-derived FAAs were associated with protein related annotations and aromatic FAAs were associated with specialized metabolic pathways. Taken together, these findings provide evidence that genomic partitioning is a viable strategy to uncover the relative contributions of biological processes to FAA traits in seeds, offering a promising framework to guide hypothesis testing and narrow the search space for candidate genes. 
    more » « less