skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Award ID contains: 2227112

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. {"Abstract":["This data set contains 194778 quasireaction subgraphs extracted from CHO transition networks with 2-6 non-hydrogen atoms (CxHyOz, 2 <= x + z <= 6).<\/p>\n\nThe complete table of subgraphs (including file locations) is in CHO-6-atoms-subgraphs.csv file. The subgraphs are in GraphML format (http://graphml.graphdrawing.org) and are compressed using bzip2. All subgraphs are undirected and unweighted. The reactant and product nodes (initial and final) are labeled in the "type" node attribute. The nodes are represented as multi-molecule SMILES strings. The edges are labeled by the reaction rules in SMARTS representation. The forward and backward reading of the SMARTS string should be considered equivalent.<\/p>\n\nThe generation and analysis of this data set is described in\nD. Rappoport, Statistics and Bias-Free Sampling of Reaction Mechanisms from Reaction Network Models, 2023, submitted. Preprint at ChemrXiv, DOI: 10.26434/chemrxiv-2023-wltcr<\/p>\n\nSimulation parameters\n- CHO networks constructed using polar bond break/bond formation rule set for CHO.\n- High-energy nodes were excluded using the following rules:\n  (i) more than 3 rings, (ii) triple and allene bonds in rings, (iii) double bonds at\n  bridge atoms,(iv) double bonds in fused 3-membered rings.\n- Neutral nodes were defined as containing only neutral molecules.\n- Shortest path lengths were determined for all pairs of neutral nodes.\n- Pairs of neutral nodes with shortest-path length > 8 were excluded.\n- Additionally, pairs of neutral nodes connected only by shortest paths passing through\n  additional neutral nodes (reducible paths) were excluded.<\/p>\n\nFor background and additional details, see paper above.<\/p>"],"Other":["This work was supported in part by the National Science Foundation under Grant No. CHE-2227112."]} 
    more » « less
  2. Selection bias is inevitable in manually curated computational reaction databases but can have a significant impact on generalizability of quantum chemical methods and machine learning models derived from these data sets. Here, we propose quasireaction subgraphs as a discrete, graph-based representation of reaction mechanisms that has a well-defined associated probability space and admits a similarity function using graph kernels. Quasireaction subgraphs are thus well suited for constructing representative or diverse data sets of reactions. Quasireaction subgraphs are defined as subgraphs of a network of formal bond breaks and bond formations (transition network) composed of all shortest paths between reactant and product nodes. However, due to their purely geometric construction, they do not guarantee that the corresponding reaction mechanisms are thermodynamically and kinetically feasible. As a result, a binary classification of feasible (reaction subgraphs) and infeasible (non-reactive subgraphs) must be applied after sampling. In this paper, we describe the construction and properties of quasireaction subgraphs and characterize the statistics of quasireaction subgraphs from CHO transition networks with up to six nonhydrogen atoms. We explore their clustering using Weisfeiler–Lehman graph kernels. 
    more » « less
  3. This dataset contains sequence information, three-dimensional structures (from AlphaFold2 model), and substrate classification labels for 358 short-chain dehydrogenase/reductases (SDRs) and 953 S-adenosylmethionine dependent methyltransferases (SAM-MTases).</p> The aminoacid sequences of these enzymes were obtained from the UniProt Knowledgebase (https://www.uniprot.org). The sets of proteins were obtained by querying using InterPro protein family/domain identifiers corresponding to each family: IPR002347 (SDRs) and IPR029063 (SAM-MTases). The query results were filtered by UniProt annotation score, keeping only those with score above 4-out-of-5, and deduplicated by exact sequence matches.</p> The structures were submitted to the publicly available AlphaFold2 protein structure predictor (J. Jumper et al., Nature, 2021, 596, 583) using the ColabFold notebook (https://colab.research.google.com/github/sokrypton/ColabFold/blob/v1.1-premultimer/batch/AlphaFold2_batch.ipynb, M. Mirdita, S. Ovchinnikov, M. Steinegger, Nature Meth., 2022, 19, 679, https://github.com/sokrypton/ColabFold). The model settings used were  msa_model = MMSeq2(Uniref+Environmental), num_models = 1, use_amber = False, use_templates = True, do_not_overwrite_results = True. The resulting PDB structures are included as ZIP archives</p> The classification labels were obtained from the substrate and product annotations of the enzyme UniProtKB records. Two approaches were used: substrate clustering based on molecular fingerprints and manual substrate type classification. For the substate clustering, Morgan fingerprints were generated for all enzymatic substrates and products with known structures (excluding cofactors) with radius = 3 using RDKit (https://rdkit.org). The fingerprints were projected onto two-dimensional space using the UMAP algorithm (L. McInnes, J. Healy, 2018, arXiv 1802.03426) and Jaccard metric and clustered using k-means. This procedure generated 9 clusters for SDR substrates and 13 clusters for SAM-MTases. The SMILES representations of the substrates are listed in the SDR_substrates_to_cluster_map_2DIMUMAP.csv and SAM_substrates_to_13clusters_map_2DIMUMAP.csv files.</p> The following manually defined classification tasks are included for SDRs: NADP/NAD cofactor classification; phenol substrate, sterol substrate, coenzyme A (CoA) substrate. For SAM-MTases, the manually defined classification tasks are: biopolymer (protein/RNA/DNA) vs. small molecule substrate, phenol subsrates, sterol substrates, nitrogen heterocycle substrates. The SMARTS strings used to define the substrate classes are listed in substructure_search_SMARTS.docx.  </p> 
    more » « less