skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 16, 2025

Title: Machine Learning Guided Rational Design of a Non‐Heme Iron‐Based Lysine Dioxygenase Improves its Total Turnover Number
Abstract Highly selective C−H functionalization remains an ongoing challenge in organic synthetic methodologies. Biocatalysts are robust tools for achieving these difficult chemical transformations. Biocatalyst engineering has often required directed evolution or structure‐based rational design campaigns to improve their activities. In recent years, machine learning has been integrated into these workflows to improve the discovery of beneficial enzyme variants. In this work, we combine a structure‐based self‐supervised machine learning framework, MutComputeX, with classical molecular dynamics simulations to down select mutations for rational design of a non‐heme iron‐dependent lysine dioxygenase, LDO. This approach consistently resulted in functional LDO mutants and circumvents the need for extensive study of mutational activity before‐hand. Our rationally designed single mutants purified with up to 2‐fold higher expression yields than WT and displayed higher total turnover numbers (TTN). Combining five such single mutations into a pentamutant variant, LPNYI LDO, leads to a 40 % improvement in the TTN (218±3) as compared to WT LDO (TTN=160±2). Overall, this work offers a low‐barrier approach for those seeking to synergize machine learning algorithms with pre‐existing protein engineering strategies.  more » « less
Award ID(s):
2046527 2505865
PAR ID:
10602985
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ChemBioChem
Date Published:
Journal Name:
ChemBioChem
Volume:
25
Issue:
24
ISSN:
1439-4227
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ribozymes are RNA molecules that catalyze biochemical reactions. Self-cleaving ribozymes are a common naturally occurring class of ribozymes that catalyze site-specific cleavage of their own phosphodiester backbone. In addition to their natural functions, self-cleaving ribozymes have been used to engineer control of gene expression because they can be designed to alter RNA processing and stability. However, the rational design of ribozyme activity remains challenging, and many ribozyme-based systems are engineered or improved by random mutagenesis and selection ( in vitro evolution). Improving a ribozyme-based system often requires several mutations to achieve the desired function, but extensive pairwise and higher-order epistasis prevent a simple prediction of the effect of multiple mutations that is needed for rational design. Recently, high-throughput sequencing-based approaches have produced data sets on the effects of numerous mutations in different ribozymes (RNA fitness landscapes). Here we used such high-throughput experimental data from variants of the CPEB3 self-cleaving ribozyme to train a predictive model through machine learning approaches. We trained models using either a random forest or long short-term memory (LSTM) recurrent neural network approach. We found that models trained on a comprehensive set of pairwise mutant data could predict active sequences at higher mutational distances, but the correlation between predicted and experimentally observed self-cleavage activity decreased with increasing mutational distance. Adding sequences with increasingly higher numbers of mutations to the training data improved the correlation at increasing mutational distances. Systematically reducing the size of the training data set suggests that a wide distribution of ribozyme activity may be the key to accurate predictions. Because the model predictions are based only on sequence and activity data, the results demonstrate that this machine learning approach allows readily obtainable experimental data to be used for RNA design efforts even for RNA molecules with unknown structures. The accurate prediction of RNA functions will enable a more comprehensive understanding of RNA fitness landscapes for studying evolution and for guiding RNA-based engineering efforts. 
    more » « less
  2. Machine learning has been proposed as an alternative to theoretical modeling when dealing with complex problems in biological physics. However, in this perspective, we argue that a more successful approach is a proper combination of these two methodologies. We discuss how ideas coming from physical modeling neuronal processing led to early formulations of computational neural networks, e.g., Hopfield networks. We then show how modern learning approaches like Potts models, Boltzmann machines, and the transformer architecture are related to each other, specifically, through a shared energy representation. We summarize recent efforts to establish these connections and provide examples on how each of these formulations integrating physical modeling and machine learning have been successful in tackling recent problems in biomolecular structure, dynamics, function, evolution, and design. Instances include protein structure prediction; improvement in computational complexity and accuracy of molecular dynamics simulations; better inference of the effects of mutations in proteins leading to improved evolutionary modeling and finally how machine learning is revolutionizing protein engineering and design. Going beyond naturally existing protein sequences, a connection to protein design is discussed where synthetic sequences are able to fold to naturally occurring motifs driven by a model rooted in physical principles. We show that this model is “learnable” and propose its future use in the generation of unique sequences that can fold into a target structure. 
    more » « less
  3. Abstract Machine learning has been increasingly used for protein engineering. However, because the general sequence contexts they capture are not specific to the protein being engineered, the accuracy of existing machine learning algorithms is rather limited. Here, we report ECNet (evolutionary context-integrated neural network), a deep-learning algorithm that exploits evolutionary contexts to predict functional fitness for protein engineering. This algorithm integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. As such, it enables accurate mapping from sequence to function and provides generalization from low-order mutants to higher-order mutants. We show that ECNet predicts the sequence-function relationship more accurately as compared to existing machine learning algorithms by using ~50 deep mutational scanning and random mutagenesis datasets. Moreover, we used ECNet to guide the engineering of TEM-1 β-lactamase and identified variants with improved ampicillin resistance with high success rates. 
    more » « less
  4. Stabilizing proteins is a foundational step in protein engineering. However, the evolutionary pressure of all extant proteins makes identifying the scarce number of mutations that will improve thermodynamic stability challenging. Deep learning has recently emerged as a powerful tool for identifying promising mutations. Existing approaches, however, are computationally expensive, as the number of model inferences scales with the number of mutations queried. Our main contribution is a simple, parallel decoding algorithm. Our Mutate Everything is capable of predicting the effect of all single and double mutations in one forward pass. It is even versatile enough to predict higher-order mutations with minimal computational overhead. We build Mutate Everything on top of ESM2 and AlphaFold, neither of which were trained to predict thermodynamic stability. We trained on the Mega-Scale cDNA proteolysis dataset and achieved state-of-the-art performance on single and higher-order mutations on S669, ProTherm, and ProteinGym datasets. 
    more » « less
  5. null (Ed.)
    Background: The giant sarcomere protein titin is important in both heart health and disease. Mutations in the gene encoding for titin ( TTN ) are the leading known cause of familial dilated cardiomyopathy. The uneven distribution of these mutations within TTN motivated us to seek a more complete understanding of this gene and the isoforms it encodes in cardiomyocyte (CM) sarcomere formation and function. Methods: To investigate the function of titin in human CMs, we used CRISPR/Cas9 to generate homozygous truncations in the Z disk (TTN-Z −/− ) and A-band (TTN-A −/− ) regions of the TTN gene in human induced pluripotent stem cells. The resulting CMs were characterized with immunostaining, engineered heart tissue mechanical measurements, and single-cell force and calcium measurements. Results: After differentiation, we were surprised to find that despite the more upstream mutation, TTN-Z −/− -CMs had sarcomeres and visibly contracted, whereas TTN-A −/− -CMs did not. We hypothesized that sarcomere formation was caused by the expression of a recently discovered isoform of titin, Cronos, which initiates downstream of the truncation in TTN-Z −/− -CMs. Using a custom Cronos antibody, we demonstrate that this isoform is expressed and integrated into myofibrils in human CMs. TTN-Z −/− -CMs exclusively express Cronos titin, but these cells produce lower contractile force and have perturbed myofibril bundling compared with controls expressing both full-length and Cronos titin. Cronos titin is highly expressed in human fetal cardiac tissue, and when knocked out in human induced pluripotent stem cell derived CMs, these cells exhibit reduced contractile force and myofibrillar disarray despite the presence of full-length titin. Conclusions: We demonstrate that Cronos titin is expressed in developing human CMs and is able to support partial sarcomere formation in the absence of full-length titin. Furthermore, Cronos titin is necessary for proper sarcomere function in human induced pluripotent stem cell derived CMs. Additional investigation is necessary to understand the molecular mechanisms of this novel isoform and how it contributes to human cardiac disease. 
    more » « less