Robust predictions of specialized metabolism genes through machine learning

Moore, Bethany M. (ORCID:0000000221047292); Wang, Peipei; Fan, Pengxiang (ORCID:0000000245603783); Leong, Bryan (ORCID:0000000340421160); Schenck, Craig A.; Lloyd, John P.; Lehti-Shiu, Melissa D.; Last, Robert L. (ORCID:0000000169749587); Pichersky, Eran; Shiu, Shin-Han (ORCID:000000016470235X)

doi:10.1073/pnas.1817074116

Citation Details

Robust predictions of specialized metabolism genes through machine learning

Plant specialized metabolism (SM) enzymes produce lineage-specific metabolites with important ecological, evolutionary, and biotechnological implications. UsingArabidopsis thalianaas a model, we identified distinguishing characteristics of SM and GM (general metabolism, traditionally referred to as primary metabolism) genes through a detailed study of features including duplication pattern, sequence conservation, transcription, protein domain content, and gene network properties. Analysis of multiple sets of benchmark genes revealed that SM genes tend to be tandemly duplicated, coexpressed with their paralogs, narrowly expressed at lower levels, less conserved, and less well connected in gene networks relative to GM genes. Although the values of each of these features significantly differed between SM and GM genes, any single feature was ineffective at predicting SM from GM genes. Using machine learning methods to integrate all features, a prediction model was established with a true positive rate of 87% and a true negative rate of 71%. In addition, 86% of known SM genes not used to create the machine learning model were predicted. We also demonstrated that the model could be further improved when we distinguished between SM, GM, and junction genes responsible for reactions shared by SM and GM pathways, indicating that topological considerations may further improve the SM prediction model. Application of the prediction model led to the identification of 1,220A. thalianagenes with previously unknown functions, each assigned a confidence measure called an SM score, providing a global estimate of SM gene content in a plant genome. more »

Award ID(s):: 1811055

PAR ID:: 10083806

Author(s) / Creator(s):: Moore, Bethany M.; Wang, Peipei; Fan, Pengxiang; Leong, Bryan; Schenck, Craig A.; Lloyd, John P.; Lehti-Shiu, Melissa D.; Last, Robert L.; Pichersky, Eran; Shiu, Shin-Han

Publisher / Repository:: Proceedings of the National Academy of Sciences

Date Published:: 2019-02-05

Journal Name:: Proceedings of the National Academy of Sciences

Volume:: 116

Issue:: 6

ISSN:: 0027-8424

Page Range / eLocation ID:: p. 2344-2353

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Journal Article:
https://doi.org/10.1073/pnas.1817074116

More Like this