skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Calculated volatilities for all metabolites and their related genomes in the KEGG database
Volatility describes the tendency for a compound to partition into the gas phase and volatile metabolites facilitate unique biological interactions which have an influence on Earth's atmospheric physics and chemistry. Estimating which metabolites may be volatile is difficult, especially for those which do not have measured vapor pressures. Volcalc is a newly developed vapor pressure estimation tool which utilizes the SIMPOL.1 method, allowing users to rapidly identify plausible volatile metabolites within the Kyoto Encyclopedia for Genes and Genomes (KEGG) database. Here, we estimate the volatiles of all KEGG metabolites and associate them with KEGG reactions, enzymes, orthologs (KOs) and whole genomes within the KEGG database. This information may be used to identify which genes or species may be linked to particular forms of volatile metabolism, for the purpose hypothesis generation and integration into additional bioinformatics pipelines. This data is listed as a compliment to the publication "Automating methods for estimating metabolite volatility". The column "Paper" indicates whether the listed species is one from the subset analyzed within the data for Figure 3.For inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to data-management@arizona.edu  more » « less
Award ID(s):
2045332
PAR ID:
10542217
Author(s) / Creator(s):
;
Publisher / Repository:
University of Arizona Research Data Repository
Date Published:
Subject(s) / Keyword(s):
Bioinformatic methods development Proteomics and metabolomics
Format(s):
Medium: X Size: 337736220 Bytes
Size(s):
337736220 Bytes
Right(s):
Creative Commons Attribution 4.0 International
Sponsoring Org:
National Science Foundation
More Like this
  1. The volatility of metabolites can influence their biological roles and inform optimal methods for their detection. Yet, volatility information is not readily available for the large number of described metabolites, limiting the exploration of volatility as a fundamental trait of metabolites. Here, we adapted methods to estimate vapor pressure from the functional group composition of individual molecules (SIMPOL.1) to predict the gas-phase partitioning of compounds in different environments. We implemented these methods in a new open pipeline calledvolcalcthat uses chemoinformatic tools to automate these volatility estimates for all metabolites in an extensive and continuously updated pathway database: the Kyoto Encyclopedia of Genes and Genomes (KEGG) that connects metabolites, organisms, and reactions. We first benchmark the automated pipeline against a manually curated data set and show that the same category of volatility (e.g., nonvolatile, low, moderate, high) is predicted for 93% of compounds. We then demonstrate howvolcalcmight be used to generate and test hypotheses about the role of volatility in biological systems and organisms. Specifically, we estimate that 3.4 and 26.6% of compounds in KEGG have high volatility depending on the environment (soil vs. clean atmosphere, respectively) and that a core set of volatiles is shared among all domains of life (30%) with the largest proportion of kingdom-specific volatiles identified in bacteria. Withvolcalc, we lay a foundation for uncovering the role of the volatilome using an approach that is easily integrated with other bioinformatic pipelines and can be continually refined to consider additional dimensions to volatility. Thevolcalcpackage is an accessible tool to help design and test hypotheses on volatile metabolites and their unique roles in biological systems. 
    more » « less
  2. Data and video files of behavioral responses of Canopy ant species individuals to volatile odors of Azteca trigona ants or control ambient air collected at Barro Colorado Island in Panama. Two videos provide a general examples of the experimental arena and protocol, and the difference between responses to control air and A. trigona headspace air. The single .csv file provides the specific behavioral responses of the individual ants observed and recorded during the behavioral trials.</div></div></div></div>For inquiries regarding the contents of this dataset, please contact the Corresponding Author listed in the README.txt file. Administrative inquiries (e.g., removal requests, trouble downloading, etc.) can be directed to data-management@arizona.edu</i></div></div> 
    more » « less
  3. Abstract The Plant Metabolic Network (PMN) is a free online database of plant metabolism available at https://plantcyc.org. The latest release, PMN 16, provides metabolic databases representing >1200 metabolic pathways, 1.3 million enzymes, >8000 metabolites, >10 000 reactions and >15 000 citations for 155 plant and green algal genomes, as well as a pan-plant reference database called PlantCyc. This release contains 29 additional genomes compared with PMN 15, including species listed by the African Orphan Crop Consortium and nonflowering plant species. Furthermore, 52 new enzymes with experimentally supported function information have been included in this release. The single-species databases contain a combination of experimental information from the literature and computationally predicted information obtained through PMN’s database generation pipeline for a single species, while PlantCyc contains only experimental information but for any species within Viridiplantae. PMN is a comprehensive resource for querying, visualizing, analyzing and interpreting omics data with metabolic knowledge. It also serves as a useful and interactive tool for teaching plant metabolism. 
    more » « less
  4. Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories. 
    more » « less
  5. The reconstruction of complete microbial metabolic pathways using ‘omics data from environmental samples remains challenging. Computational pipelines for pathway reconstruction that utilize machine learning methods to predict the presence or absence of KEGG modules in incomplete genomes are lacking. Here, we present MetaPathPredict, a software tool that incorporates machine learning models to predict the presence of complete KEGG modules within bacterial genomic datasets. Using gene annotation data and information from the KEGG module database, MetaPathPredict employs deep learning models to predict the presence of KEGG modules in a genome. MetaPathPredict can be used as a command line tool or as a Python module, and both options are designed to be run locally or on a compute cluster. Benchmarks show that MetaPathPredict makes robust predictions of KEGG module presence within highly incomplete genomes. 
    more » « less