skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: NNKcat: deep neural network to predict catalytic constants (Kcat) by integrating protein sequence and substrate structure with enhanced data imbalance handling
Abstract Catalytic constant (Kcat) is to describe the efficiency of catalyzing reactions. The Kcat value of an enzyme-substrate pair indicates the rate an enzyme converts saturated substrates into product during the catalytic process. However, it is challenging to construct robust prediction models for this important property. Most of the existing models, including the one recently published by Nature Catalysis (Li et al.), are suffering from the overfitting issue. In this study, we proposed a novel protocol to construct Kcat prediction models, introducing an intermedia step to separately develop substrate and protein processors. The substrate processor leverages analyzing Simplified Molecular Input Line Entry System (SMILES) strings using a graph neural network model, attentive FP, while the protein processor abstracts protein sequence information utilizing long short-term memory architecture. This protocol not only mitigates the impact of data imbalance in the original dataset but also provides greater flexibility in customizing the general-purpose Kcat prediction model to enhance the prediction accuracy for specific enzyme classes. Our general-purpose Kcat prediction model demonstrates significantly enhanced stability and slightly better accuracy (R2 value of 0.54 versus 0.50) in comparison with Li et al.’s model using the same dataset. Additionally, our modeling protocol enables personalization of fine-tuning the general-purpose Kcat model for specific enzyme categories through focused learning. Using Cytochrome P450 (CYP450) enzymes as a case study, we achieved the best R2 value of 0.64 for the focused model. The high-quality performance and expandability of the model guarantee its broad applications in enzyme engineering and drug research & development.  more » « less
Award ID(s):
1955260
PAR ID:
10590797
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Briefings in Bioinformatics
Volume:
26
Issue:
3
ISSN:
1467-5463
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This dataset consists of 800 coordinate files (in the CHARMM psf/cor format) for the QM/MM minimum energy pathways of the acylation reactions between a Class A beta-lactamases (Toho-1) and two beta-lactam antibiotic molecules (ampicillin and cefalexin).</p> These files are:</p> toho_amp.r1-ae.zip: The R1-AE acylation pathways for Toho-1/Ampicillin (200 pathways);</li>toho_amp.r2-ae.zip: The R2-AE acylation pathways for Toho-1/Ampicillin (200 pathways);</li>toho_cex.r1-ae.zip: The R1-AE acylation pathways for Toho-1/Cefalexin (200 pathways);</li>toho_cex.r2-ae.zip: The R2-AE acylation pathways for Toho-1/Cefalexin (200 pathways);</li>energies.zip: the replica energies at B3LYP-D3/6-31+G**/C36 level;</li>chelpgs.zip: the ChElPG charges of all reactant replicas at B3LYP-D3/6-31+G**/C36 level;</li>farrys.zip: the featurzied NumPy arrays for model training;</li>peephole.zip: an example file for how the optimized MEPs look like; </li>dftb3_benchmark.zip: the reference calculations to justify the use of DFTB3/3OB-F/C36 in MEP optimizations, the reference level of theory is B3LYP-D3/6-31G**/C36. </li></ul> The R1-AE pathways are the acylation uses Glu166 as the general base; the R2-AE pathways uses Lys73 and Glu166 as the concerted base. </p> All QM/MM pathways are optimized at the DFTB3/3OB-f/CHARMM36 level of theory. </p> Z. Song et al Mechanistic Insights into Enzyme Catalysis from Explaining Machine-Learned Quantum Mechanical and Molecular Mechanical Minimum Energy Pathways. ACS Physical Chemistry Au, in press. DOI: 10.1021/acsphyschemau.2c00005</p> 
    more » « less
  2. null (Ed.)
    Third-state dynamics (Angluin et al. 2008; Perron et al. 2009) is a well-known process for quickly and robustly computing approximate majority through interactions between randomly-chosen pairs of agents. In this paper, we consider this process in a new model with persistent-state catalytic inputs, as well as in the presence of transient leak faults. Based on models considered in recent protocols for populations with persistent-state agents (Dudek et al. 2017; Alistarh et al. 2017; Alistarh et al. 2020), we formalize a Catalytic Input (CI) model comprising n input agents and m worker agents. For m = Θ(n), we show that computing the parity of the input population with high probability requires at least Ω(n2) total interactions, demonstrating a strong separation between the CI and standard population protocol models. On the other hand, we show that the third-state dynamics can be naturally adapted to this new model to solve approximate majority in O(n log n) total steps with high probability when the input margin is Ω(√(n log n)), which preserves the time and space efficiency of the corresponding protocol in the original model. We then show the robustness of third-state dynamics protocols to the transient leak faults considered by (Alistarh et al. 2017; Alistarh et al 2020). In both the original and CI models, these protocols successfully compute approximate majority with high probability in the presence of leaks occurring at each time step with probability β ≤ O(√(n log n}/n). The resilience of these dynamics to adversarial leaks exhibits a subtle connection to previous results involving Byzantine agents. 
    more » « less
  3. Metformin is one of the most regularly prescribed Type II diabetes drugs in the world, and its use is likely to expand as diabetes diagnoses rise globally. This drug and its main degradation byproduct, guanylurea, are not fully metabolized by humans and cannot be removed through conventional water treatment processes. These compounds have been detected in coastal waters around the world and are currently considered emerging pollutants. The goal of this research was to examine the catalytic mechanism and substrate specificity of Guanylurea Hydrolase (GuuH), a recently discovered enzyme that converts guanylurea to ammonia and guanidine. Bioinformatic analyses were conducted to predict the active site and three-dimensional structure of GuuH. Site-directed mutagenesis was performed to construct mutants in amino acids predicted to be part of the enzyme's catalytic triad and substrate binding site. The mutants created were K138R, N141K, E211D, E211Q, and E211N. The wild-type and mutant enzymes were purified using His-tag affinity chromatography. Enzyme activity was assessed by measuring ammonia released using Berthelot assays. The results showed that the K138R mutant had similar specific activity compared to the wild-type GuuH when reacting with guanylurea, while E211N and E221D showed low specific activity under the same conditions. All of the enzymes had no detectable activity when reacting with biuret, which suggests they have low affinity for this substrate. Future work will focus on kinetic analyses of the wild-type and K138R enzymes and additional mutagenesis to identify the amino acids that determine the substrate specificity to the enzyme. Understanding GuuH's catalytic activity and substrate specificity is essential to using this enzyme in the development of biotechnological applications for water treatment. 
    more » « less
  4. A reliable prediction of liquefaction-induced damage typically requires nonlinear deformation analyses with an advanced constitutive soil model calibrated to the site conditions. The calibration of constitutive models can be performed by relying primarily on a combination of commonly available properties and empirical or semi-empirical relationships, on laboratory tests on site-specific soils, on in-situ penetration tests, or a combination thereof. Chiaradonna et al. (2022) described a laboratory-based calibration approach of the PM4Sand constitutive model and evaluated the prediction accuracy against the response of a centrifuge experiment of a submerged slope. This paper addresses an alternate calibration approach in which the PM4Sand model is calibrated using centrifuge in-situ CPT data. The model performance for the resulting calibration is evaluated against the centrifuge experimental data and prior simulations from Chiaradonna et al. (2022). In this case, the CPT-based calibration resulted in more accurate estimations of the dynamic response and permanent displacements. 
    more » « less
  5. Abstract Prediction of the spatial‐temporal dynamics of the fluid flow in complex subsurface systems, such as geologic storage, is typically performed using advanced numerical simulation methods that solve the underlying governing physical equations. However, numerical simulation is computationally demanding and can limit the implementation of standard field management workflows, such as model calibration and optimization. Standard deep learning models, such as RUNET, have recently been proposed to alleviate the computational burden of physics‐based simulation models. Despite their powerful learning capabilities and computational appeal, deep learning models have important limitations, including lack of interpretability, extensive data needs, weak extrapolation capacity, and physical inconsistency that can affect their adoption in practical applications. We develop a Fluid Flow‐based Deep Learning (FFDL) architecture for spatial‐temporal prediction of important state variables in subsurface flow systems. The new architecture consists of a physics‐based encoder to construct physically meaningful latent variables, and a residual‐based processor to predict the evolution of the state variables. It uses physical operators that serve as nonlinear activation functions and imposes the general structure of the fluid flow equations to facilitate its training with data pertaining to the specific subsurface flow application of interest. A comprehensive investigation of FFDL, based on a field‐scale geologic storage model, is used to demonstrate the superior performance of FFDL compared to RUNET as a standard deep learning model. The results show that FFDL outperforms RUNET in terms of prediction accuracy, extrapolation power, and training data needs. 
    more » « less