skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: md_harmonize: A Python Package for Atom-Level Harmonization of Public Metabolic Databases
A major challenge to integrating public metabolic resources is the use of different nomenclatures by individual databases. This paper presents md_harmonize, an open-source Python package for harmonizing compounds and metabolic reactions across various metabolic databases. The md_harmonize package utilizes a neighborhood-specific graph coloring method for generating a unique identifier for each compound via atom identifiers based on a compound’s chemical structure. The resulting harmonized compounds and reactions can be used for various downstream analyses, including the construction of atom-resolved metabolic networks and models for metabolic flux analysis. Parts of the md_harmonize package have been optimized using a variety of computational techniques to allow certain NP-complete problems handled by the software to be tractable for these specific use-cases. The software is available on GitHub and through the Python Package Index, with end-user documentation hosted on GitHub Pages.  more » « less
Award ID(s):
2020026
PAR ID:
10508122
Author(s) / Creator(s):
;
Publisher / Repository:
MDPI
Date Published:
Journal Name:
Metabolites
Volume:
13
Issue:
12
ISSN:
2218-1989
Page Range / eLocation ID:
1199
Subject(s) / Keyword(s):
metabolite database harmonization maximum common substructure Python package
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Metabolic models have been proven to be useful tools in system biology and have been successfully applied to various research fields in a wide range of organisms. A relatively complete metabolic network is a prerequisite for deriving reliable metabolic models. The first step in constructing metabolic network is to harmonize compounds and reactions across different metabolic databases. However, effectively integrating data from various sources still remains a big challenge. Incomplete and inconsistent atomistic details in compound representations across databases is a very important limiting factor. Here, we optimized a subgraph isomorphism detection algorithm to validate generic compound pairs. Moreover, we defined a set of harmonization relationship types between compounds to deal with inconsistent chemical details while successfully capturing atom-level characteristics, enabling a more complete enabling compound harmonization across metabolic databases. In total, 15,704 compound pairs across KEGG (Kyoto Encyclopedia of Genes and Genomes) and MetaCyc databases were detected. Furthermore, utilizing the classification of compound pairs and EC (Enzyme Commission) numbers of reactions, we established hierarchical relationships between metabolic reactions, enabling the harmonization of 3856 reaction pairs. In addition, we created and used atom-specific identifiers to evaluate the consistency of atom mappings within and between harmonized reactions, detecting some consistency issues between the reaction and compound descriptions in these metabolic databases. 
    more » « less
  2. null (Ed.)
    Metabolic flux analysis requires both a reliable metabolic model and reliable metabolic profiles in characterizing metabolic reprogramming. Advances in analytic methodologies enable production of high-quality metabolomics datasets capturing isotopic flux. However, useful metabolic models can be difficult to derive due to the lack of relatively complete atom-resolved metabolic networks for a variety of organisms, including human. Here, we developed a neighborhood-specific graph coloring method that creates unique identifiers for each atom in a compound facilitating construction of an atom-resolved metabolic network. What is more, this method is guaranteed to generate the same identifier for symmetric atoms, enabling automatic identification of possible additional mappings caused by molecular symmetry. Furthermore, a compound coloring identifier derived from the corresponding atom coloring identifiers can be used for compound harmonization across various metabolic network databases, which is an essential first step in network integration. With the compound coloring identifiers, 8865 correspondences between KEGG (Kyoto Encyclopedia of Genes and Genomes) and MetaCyc compounds are detected, with 5451 of them confirmed by other identifiers provided by the two databases. In addition, we found that the Enzyme Commission numbers (EC) of reactions can be used to validate possible correspondence pairs, with 1848 unconfirmed pairs validated by commonality in reaction ECs. Moreover, we were able to detect various issues and errors with compound representation in KEGG and MetaCyc databases by compound coloring identifiers, demonstrating the usefulness of this methodology for database curation. 
    more » « less
  3. Abstract For over 10 years, ModelSEED has been a primary resource for the construction of draft genome-scale metabolic models based on annotated microbial or plant genomes. Now being released, the biochemistry database serves as the foundation of biochemical data underlying ModelSEED and KBase. The biochemistry database embodies several properties that, taken together, distinguish it from other published biochemistry resources by: (i) including compartmentalization, transport reactions, charged molecules and proton balancing on reactions; (ii) being extensible by the user community, with all data stored in GitHub; and (iii) design as a biochemical ‘Rosetta Stone’ to facilitate comparison and integration of annotations from many different tools and databases. The database was constructed by combining chemical data from many resources, applying standard transformations, identifying redundancies and computing thermodynamic properties. The ModelSEED biochemistry is continually tested using flux balance analysis to ensure the biochemical network is modeling-ready and capable of simulating diverse phenotypes. Ontologies can be designed to aid in comparing and reconciling metabolic reconstructions that differ in how they represent various metabolic pathways. ModelSEED now includes 33,978 compounds and 36,645 reactions, available as a set of extensible files on GitHub, and available to search at https://modelseed.org/biochem and KBase. 
    more » « less
  4. In recent years, the FAIR guiding principles and the broader concept of open science has grown in importance in academic research, especially as funding entities have aggressively promoted public sharing of research products. Key to public research sharing is deposition of datasets into online data repositories, but it can be a chore to transform messy unstructured data into the forms required by these repositories. To help generate Metabolomics Workbench depositions, we have developed the MESSES (Metadata from Experimental SpreadSheets Extraction System) software package, implemented in the Python 3 programming language and supported on Linux, Windows, and Mac operating systems. MESSES helps transform tabular data from multiple sources into a Metabolomics Workbench specific deposition format. The package provides three commands, extract, validate, and convert, that implement a natural data transformation workflow. Moreover, MESSES facilitates richer metadata capture than is typically attempted by manual efforts. The source code and extensive documentation is hosted on GitHub and is also available on the Python Package Index for easy installation. 
    more » « less
  5. In the space physics community, processing and combining observational and modeling data from various sources is a demanding task because they often have different formats and use different coordinate systems. The Python package GeospaceLAB has been developed to provide a unified, standardized framework to process data. The package is composed of six core modules, including DataHub as the data manager, Visualization for generating publication quality figures, Express for higher-level interfaces of DataHub and Visualization , SpaceCoordinateSystem for coordinate system transformations, Toolbox for various utilities, and Configuration for preferences. The core modules form a standardized framework for downloading, storing, post-processing and visualizing data in space physics. The object-oriented design makes the core modules of GeospaceLAB easy to modify and extend. So far, GeospaceLAB can process more than twenty kinds of data products from nine databases, and the number will increase in the future. The data sources include, e.g., measurements by EISCAT incoherent scatter radars, DMSP, SWARM, and Grace satellites, OMNI solar wind data, and GITM simulations. In addition, the package provides an interface for the users to add their own data products. Hence, researchers can easily collect, combine, and view multiple kinds of data for their work using GeospaceLAB. Combining data from different sources will lead to a better understanding of the physics of the studied phenomena and may lead to new discoveries. GeospaceLAB is an open source software, which is hosted on GitHub. We welcome everyone in the community to contribute to its future development. 
    more » « less