skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Predicting a Protein's Stability under a Million Mutations
Stabilizing proteins is a foundational step in protein engineering. However, the evolutionary pressure of all extant proteins makes identifying the scarce number of mutations that will improve thermodynamic stability challenging. Deep learning has recently emerged as a powerful tool for identifying promising mutations. Existing approaches, however, are computationally expensive, as the number of model inferences scales with the number of mutations queried. Our main contribution is a simple, parallel decoding algorithm. Our Mutate Everything is capable of predicting the effect of all single and double mutations in one forward pass. It is even versatile enough to predict higher-order mutations with minimal computational overhead. We build Mutate Everything on top of ESM2 and AlphaFold, neither of which were trained to predict thermodynamic stability. We trained on the Mega-Scale cDNA proteolysis dataset and achieved state-of-the-art performance on single and higher-order mutations on S669, ProTherm, and ProteinGym datasets.  more » « less
Award ID(s):
2505865
PAR ID:
10631850
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
https://doi.org/10.48550/arXiv.2310.12979
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Engineering stabilized proteins is a fundamental challenge in the development of industrial and pharmaceutical biotechnologies. We present Stability Oracle: a structure-based graph-transformer framework that achieves SOTA performance on accurately identifying thermodynamically stabilizing mutations. Our framework introduces several innovations to overcome well-known challenges in data scarcity and bias, generalization, and computation time, such as: Thermodynamic Permutations for data augmentation, structural amino acid embeddings to model a mutation with a single structure, a protein structure-specific attention-bias mechanism that makes transformers a viable alternative to graph neural networks. We provide training/test splits that mitigate data leakage and ensure proper model evaluation. Furthermore, to examine our data engineering contributions, we fine-tune ESM2 representations (Prostata-IFML) and achieve SOTA for sequence-based models. Notably, Stability Oracle outperforms Prostata-IFML even though it was pretrained on 2000X less proteins and has 548X less parameters. Our framework establishes a path for fine-tuning structure-based transformers to virtually any phenotype, a necessary task for accelerating the development of protein-based biotechnologies. 
    more » « less
  2. Accurate prediction of protein stability changes resulting from amino acid substitutions is of utmost importance in medicine to better understand which mutations are deleterious, leading to diseases, and which are neutral. Since conducting wet lab experiments to get a better understanding of protein mutations is costly and time consuming, and because of huge number of possible mutations the need of computational methods that could accurately predict effects of amino acid mutations is of greatest importance. In this research, we present a robust methodology to predict the energy changes of a proteins upon mutations. The proposed prediction scheme is based on two step algorithm that is a Holdout Random Sampler followed by a neural network model for regression. The Holdout Random Sampler is utilized to analysis the energy change, the corresponding uncertainty, and to obtain a set of admissible energy changes, expressed as a cumulative distribution function. These values are further utilized to train a simple neural network model that can predict the energy changes. Results were blindly tested (validated) against experimental energy changes, giving Pearson correlation coefficients of 0.66 for Single Point Mutations and 0.77 for Multiple Point Mutations. These results confirm the successfulness of our method, since it outperforms majority of previous studies in this field. 
    more » « less
  3. null (Ed.)
    Packing interaction is a critical driving force in the folding of helical membrane proteins. Despite the importance, packing defects (i.e., cavities including voids, pockets, and pores) are prevalent in membrane-integral enzymes, channels, transporters, and receptors, playing essential roles in function. Then, a question arises regarding how the two competing requirements, packing for stability vs. cavities for function, are reconciled in membrane protein structures. Here, using the intramembrane protease GlpG of Escherichia coli as a model and cavity-filling mutation as a probe, we tested the impacts of native cavities on the thermodynamic stability and function of a membrane protein. We find several stabilizing mutations which induce substantial activity reduction without distorting the active site. Notably, these mutations are all mapped onto the regions of conformational flexibility and functional importance, indicating that the cavities facilitate functional movement of GlpG while compromising the stability. Experiment and molecular dynamics simulation suggest that the stabilization is induced by the coupling between enhanced protein packing and weakly unfavorable lipid desolvation, or solely by favorable lipid solvation on the cavities. Our result suggests that, stabilized by the relatively weak interactions with lipids, cavities are accommodated in membrane proteins without severe energetic cost, which, in turn, serve as a platform to fine-tune the balance between stability and flexibility for optimal activity. 
    more » « less
  4. To build effective therapeutics, biologists iteratively mutate antibody sequences to improve binding and stability. Proposed mutations can be informed by previous measurements or by learning from large antibody databases to predict only typical antibodies. Unfortunately, the space of typical antibodies is enormous to search, and experiments often fail to find suitable antibodies on a budget. We introduce Clone-informed Bayesian Optimization (CloneBO), a Bayesian optimization procedure that efficiently optimizes antibodies in the lab by teaching a generative model how our immune system optimizes antibodies. Our immune system makes antibodies by iteratively evolving specific portions of their sequences to bind their target strongly and stably, resulting in a set of related, evolving sequences known as a clonal family. We train a large language model, CloneLM, on hundreds of thousands of clonal families and use it to design sequences with mutations that are most likely to optimize an antibody within the human immune system. We propose to guide our designs to fit previous measurements with a twisted sequential Monte Carlo procedure. We show that CloneBO optimizes antibodies substantially more efficiently than previous methods in realistic in silico experiments and designs stronger and more stable binders in in vitro wet lab experiments. 
    more » « less
  5. Proteins are constantly undergoing folding and unfolding transitions, with rates that determine their homeostasis in vivo and modulate their biological function. The ability to optimize these rates without affecting overall native stability is hence highly desirable for protein engineering and design. The great challenge is, however, that mutations generally affect folding and unfolding rates with inversely complementary fractions of the net free energy change they inflict on the native state. Here we address this challenge by targeting the folding transition state (FTS) of chymotrypsin inhibitor 2 (CI2), a very slow and stable two‐state folding protein with an FTS known to be refractory to change by mutation. We first discovered that the CI2's FTS is energetically taxed by the desolvation of several, highly conserved, charges that form a buried salt bridge network in the native structure. Based on these findings, we designed a CI2 variant that bears just four mutations and aims to selectively stabilize the FTS. This variant has >250‐fold faster rates in both directions and hence identical native stability, demonstrating the success of our FTS‐centric design strategy. With an optimized FTS, CI2 also becomes 250‐fold more sensitive to proteolytic degradation by its natural substrate chymotrypsin, and completely loses its activity as inhibitor. These results indicate that CI2 has been selected through evolution to have a very unstable FTS in order to attain the kinetic stability needed to effectively function as protease inhibitor. Moreover, the CI2 case showcases that protein (un)folding rates can critically pivot around a few key residues‐interactions, which can strongly modify the general effects of known structural factors such as domain size and fold topology. From a practical standpoint, our results suggest that future efforts should perhaps focus on identifying such critical residues‐interactions in proteins as best strategy to significantly improve our ability to predict and engineer protein (un)folding rates. 
    more » « less