skip to main content


Title: Using graph neural networks for site-of-metabolism prediction and its applications to ranking promiscuous enzymatic products
Abstract Motivation

While traditionally utilized for identifying site-specific metabolic activity within a compound to alter its interaction with a metabolizing enzyme, predicting the site-of-metabolism (SOM) is essential in analyzing the promiscuity of enzymes on substrates. The successful prediction of SOMs and the relevant promiscuous products has a wide range of applications that include creating extended metabolic models (EMMs) that account for enzyme promiscuity and the construction of novel heterologous synthesis pathways. There is therefore a need to develop generalized methods that can predict molecular SOMs for a wide range of metabolizing enzymes.

Results

This article develops a Graph Neural Network (GNN) model for the classification of an atom (or a bond) being an SOM. Our model, GNN-SOM, is trained on enzymatic interactions, available in the KEGG database, that span all enzyme commission numbers. We demonstrate that GNN-SOM consistently outperforms baseline machine learning models, when trained on all enzymes, on Cytochrome P450 (CYP) enzymes, or on non-CYP enzymes. We showcase the utility of GNN-SOM in prioritizing predicted enzymatic products due to enzyme promiscuity for two biological applications: the construction of EMMs and the construction of synthesis pathways.

Availability and implementation

A python implementation of the trained SOM predictor model can be found at https://github.com/HassounLab/GNN-SOM.

Supplementary information

Supplementary data are available at Bioinformatics online.

 
more » « less
NSF-PAR ID:
10400645
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
39
Issue:
3
ISSN:
1367-4811
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Despite experimental and curation efforts, the extent of enzyme promiscuity on substrates continues to be largely unexplored and under documented. Providing computational tools for the exploration of the enzyme–substrate interaction space can expedite experimentation and benefit applications such as constructing synthesis pathways for novel biomolecules, identifying products of metabolism on ingested compounds, and elucidating xenobiotic metabolism. Recommender systems (RS), which are currently unexplored for the enzyme–substrate interaction prediction problem, can be utilized to provide enzyme recommendations for substrates, and vice versa. The performance of Collaborative-Filtering (CF) RSs; however, hinges on the quality of embedding vectors of users and items (enzymes and substrates in our case). Importantly, enhancing CF embeddings with heterogeneous auxiliary data, specially relational data (e.g. hierarchical, pairwise or groupings), remains a challenge.

    Results

    We propose an innovative general RS framework, termed Boost-RS that enhances RS performance by ‘boosting’ embedding vectors through auxiliary data. Specifically, Boost-RS is trained and dynamically tuned on multiple relevant auxiliary learning tasks Boost-RS utilizes contrastive learning tasks to exploit relational data. To show the efficacy of Boost-RS for the enzyme–substrate prediction interaction problem, we apply the Boost-RS framework to several baseline CF models. We show that each of our auxiliary tasks boosts learning of the embedding vectors, and that contrastive learning using Boost-RS outperforms attribute concatenation and multi-label learning. We also show that Boost-RS outperforms similarity-based models. Ablation studies and visualization of learned representations highlight the importance of using contrastive learning on some of the auxiliary data in boosting the embedding vectors.

    Availability and implementation

    A Python implementation for Boost-RS is provided at https://github.com/HassounLab/Boost-RS. The enzyme-substrate interaction data is available from the KEGG database (https://www.genome.jp/kegg/).

     
    more » « less
  2. Martelli, Pier Luigi (Ed.)
    Abstract Motivation As experimental efforts are costly and time consuming, computational characterization of enzyme capabilities is an attractive alternative. We present and evaluate several machine-learning models to predict which of 983 distinct enzymes, as defined via the Enzyme Commission (EC) numbers, are likely to interact with a given query molecule. Our data consists of enzyme-substrate interactions from the BRENDA database. Some interactions are attributed to natural selection and involve the enzyme’s natural substrates. The majority of the interactions however involve non-natural substrates, thus reflecting promiscuous enzymatic activities. Results We frame this ‘enzyme promiscuity prediction’ problem as a multi-label classification task. We maximally utilize inhibitor and unlabeled data to train prediction models that can take advantage of known hierarchical relationships between enzyme classes. We report that a hierarchical multi-label neural network, EPP-HMCNF, is the best model for solving this problem, outperforming k-nearest neighbors similarity-based and other machine-learning models. We show that inhibitor information during training consistently improves predictive power, particularly for EPP-HMCNF. We also show that all promiscuity prediction models perform worse under a realistic data split when compared to a random data split, and when evaluating performance on non-natural substrates compared to natural substrates. Availability and implementation We provide Python code and data for EPP-HMCNF and other models in a repository termed EPP (Enzyme Promiscuity Prediction) at https://github.com/hassounlab/EPP. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. BACKGROUND Diverse organisms, from archaea and bacteria to plants and humans, use receptor systems to recognize both pathogens and dangerous self-derived or environmentally derived stimuli. These intricate, well-coordinated immune systems, composed of innate and adaptive components, ensure host survival. In the late 20th century, researchers identified the Toll/interleukin-1/resistance gene (TIR) domain as an evolutionarily conserved component of animal and plant innate immune systems. Today, TIR-domain proteins are known to be broadly distributed across the tree of life. The TIR domain was first recognized as an adaptor for the assembly of macromolecular signaling complexes in mammalian innate immune pathways. Work on axon degeneration in animals—as well as on plant, archaeal, and bacterial immune systems—has uncovered additional enzymatic activities for TIR domains. ADVANCES Mammalian axons initiate a self-destruct program upon injury and during disease that is mediated by the sterile alpha and TIR motif containing 1 (SARM1) protein. The SARM1 TIR domain enzymatically consumes the essential metabolic cofactor nicotinamide adenine dinucleotide (NAD + ) to promote axonal death. Identification of the SARM1 NAD + -consuming enzyme (NADase) revealed that TIR domains can function as enzymes. Given the evolutionary conservation of TIR domains, studies investigated whether the SARM1 TIR NADase was also conserved. Indeed, bacteria, archaea, and plant TIR domains possess NADase activity. In prokaryotes, TIR NADase activity is found in an ancient antiphage immune system. In plants, identification of TIR NADase activity and linkage of TIR enzymatic products to downstream signaling components addressed the question of how nucleotide-binding, leucine-rich repeat (NLR) receptors trigger hypersensitive cell death during an immune response. Studies in plants show that their TIR domains can cleave nucleic acids and possess 2′,3′ cyclic adenosine monophosphate (2′,3′-cAMP) and 2′,3′ cyclic guanosine monophosphate (2′,3′-cGMP) synthetase activity that aids cell death programs in plant innate immunity. Thus, TIR domains constitute an ancient family of enzymes that are activated in immune and cell death pathways. OUTLOOK The discovery of TIR-domain enzyme activities carries implications for innate immunity and neurodegeneration. The identification of the SARM1 NADase defined a drug target for a wide number of neurodegenerative diseases that is being exploited in both preclinical and clinical studies. Hyperactive mutations in the SARM1 NADase have been discovered in amyotrophic lateral sclerosis (ALS) patients. Future work will seek to clarify the contribution of the SARM1 axon degeneration pathway to ALS pathogenesis. NAD + biology influences cellular processes from metabolism to DNA repair to aging. How TIR enzymes influence the NAD + metabolome and its associated pathways in bacteria, archaea, plants, and animals will be an exciting area for upcoming investigation. The discovery of the diversity of TIR enzymatic products is revealing signaling pathways across kingdoms. Discovery of TIR enzymatic function in plants and animals may yet inspire studies of enzymatic functions for Toll-like receptors in animals. We anticipate that cross-kingdom studies of TIR-domain function will guide interventions that will span the tree of life, from treating human neurodegenerative disorders and bacterial infections to preventing plant diseases. Conserved TIR-domain enzymatic activity. TIR-domain proteins from prokaryotes and eukaryotes cleave NAD + into nicotinamide (Nam), ADP-ribose (ADPR), cyclic ADP-ribose (cADPR), isomers of cyclic ADP-ribose (2′ or 3′cADPR), and related molecules [e.g., phosphoribosyl adenosine monophosphate (pRib-AMP)]. Plant TIR domains also possess a nuclease activity, can degrade DNA and RNA, and can function as a 2′,3′-cAMP or 2′,3′-cGMP synthetase. TIR enzymatic activity drives cell death and immune pathways across kingdoms. TIR activity can kill cells directly through NAD + depletion or indirectly using enzymatic products as signal molecules. The representative TIR domain structure shown here is Protein Data Bank ID 6O0Q. EDS1, enhanced disease susceptibility 1; ThsA, Thoeris A. 
    more » « less
  4. Abstract

    Current pathway synthesis tools identify possible pathways that can be added to a host to produce the desired target molecule through the exploration of abstract metabolic and reaction network space. However, not many of these tools explore gene‐level information required to physically realize the identified synthesis pathways, and none explore enzyme‐host compatibility. Developing tools that address this disconnect between abstract reactions/metabolic design space and physical genetic sequence design space will enable expedited experimental efforts that avoid exploring unprofitable synthesis pathways. This work describes a workflow, termed Probabilistic Pathway Assembly with Solubility Confidence Scores (ProPASS), which links synthesis pathway construction with the exploration of the physical design space as imposed by the availability of enzymes with predicted characterized activities within the host. Predicted protein solubility propensity scores are used as a confidence level to quantify the compatibility of each pathway enzyme with the hostEscherichia coli(E. coli). This study also presents a database, termed Protein Solubility Database (ProSol DB), which provides solubility confidence scores inE. colifor 240,016 characterized enzymes obtained fromUniProtKB/Swiss‐Prot. The utility ofProPASSis demonstrated by generating genetic implementations of heterologous synthesis pathways inE. colithat target several commercially useful biomolecules.

     
    more » « less
  5. Abstract Motivation

    Accurate modeling of protein–protein interaction interface is essential for high-quality protein complex structure prediction. Existing approaches for estimating the quality of a predicted protein complex structural model utilize only the physicochemical properties or energetic contributions of the interacting atoms, ignoring evolutionarily information or inter-atomic multimeric geometries, including interaction distance and orientations.

    Results

    Here, we present PIQLE, a deep graph learning method for protein–protein interface quality estimation. PIQLE leverages multimeric interaction geometries and evolutionarily information along with sequence- and structure-derived features to estimate the quality of individual interactions between the interfacial residues using a multi-head graph attention network and then probabilistically combines the estimated quality for scoring the overall interface. Experimental results show that PIQLE consistently outperforms existing state-of-the-art methods including DProQA, TRScore, GNN-DOVE and DOVE on multiple independent test datasets across a wide range of evaluation metrics. Our ablation study and comparison with the self-assessment module of AlphaFold-Multimer repurposed for protein complex scoring reveal that the performance gains are connected to the effectiveness of the multi-head graph attention network in leveraging multimeric interaction geometries and evolutionary information along with other sequence- and structure-derived features adopted in PIQLE.

    Availability and implementation

    An open-source software implementation of PIQLE is freely available at https://github.com/Bhattacharya-Lab/PIQLE.

    Supplementary information

    Supplementary data are available at Bioinformatics Advances online.

     
    more » « less