Computational models of evolution are valuable for understanding the dynamics of sequence variation, to infer phylogenetic relationships or potential evolutionary pathways and for biomedical and industrial applications. Despite these benefits, few have validated their propensities to generate outputs with in vivo functionality, which would enhance their value as accurate and interpretable evolutionary algorithms. We demonstrate the power of epistasis inferred from natural protein families to evolve sequence variants in an algorithm we developed called sequence evolution with epistatic contributions (SEEC). Utilizing the Hamiltonian of the joint probability of sequences in the family as fitness metric, we sampled and experimentally tested for in vivo -lactamase activity inEscherichia coliTEM-1 variants. These evolved proteins can have dozens of mutations dispersed across the structure while preserving sites essential for both catalysis and interactions. Remarkably, these variants retain family-like functionality while being more active than their wild-type predecessor. We found that depending on the inference method used to generate the epistatic constraints, different parameters simulate diverse selection strengths. Under weaker selection, local Hamiltonian fluctuations reliably predict relative changes to variant fitness, recapitulating neutral evolution. SEEC has the potential to explore the dynamics of neofunctionalization, characterize viral fitness landscapes, and facilitate vaccine development.
more »
« less
This content will become publicly available on September 1, 2026
Evaluation of machine learning-assisted directed evolution across diverse combinatorial landscapes
Various machine learning-assisted directed evolution (MLDE) strategies have been shown to identify high-fitness protein variants more efficiently than typical wet-lab directed evolution approaches. However, limited understanding of the factors influencing MLDE performance across diverse proteins has hindered optimal strategy selection for wet-lab campaigns. To address this, we systematically analyzed multiple MLDE strategies, including active learning and focused training using six distinct zeroshot predictors, across 16 diverse protein fitness landscapes. By quantifying landscape navigability with six attributes, we found that MLDE offers a greater advantage on landscapes which are more challenging for directed evolution, especially when focused training is combined with active learning. Despite varying levels of advantage across landscapes, focused training with zero-shot predictors leveraging distinct evolutionary, structural, and stability knowledge sources consistently outperforms random sampling for both binding interactions and enzyme activities. Our findings provide practical guidelines for selecting MLDE strategies for protein engineering.
more »
« less
- Award ID(s):
- 1937902
- PAR ID:
- 10635592
- Publisher / Repository:
- Cell Systems
- Date Published:
- Journal Name:
- Cell Systems
- ISSN:
- 2405-4712
- Page Range / eLocation ID:
- 101387
- Subject(s) / Keyword(s):
- combinatorial mutagenesis, directed evolution, epistasis, fitness prediction, machine learning, protein engineering, zero-shot predictor
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We are seeking to incorporate authentic inquiry into an undergraduate biochemistry lab course. Students on six campuses are combining computational (“in silico”) and wet lab (“in vitro”) techniques as they characterize proteins whose three dimensional structures are known but to which functions have not been previously ascribed. The in silico modules include protein visualization with PyMOL, structural alignment using Dali and ProMOL, sequence exploration with BLAST and Pfam, and ligand docking with PyRX and Autodock Vina. The goal is to predict the function of the protein and to identify the most promising substrates for the active sites. In the wet lab, students express and purify their target proteins, then conduct enzyme kinetics with substrates selected from their docking studies. Their learning as students and their growth as scientists is being assessed in terms of research methods, visualization, biological context, and mechanism of protein function. The lab course is an extension of successful undergraduate research efforts at RIT and Dowling College. The modules that are developed will be disseminated to the scientific community via a web site (promol.org), including both protocols and captioned video instruction in the techniques involved. Over the course of the project, we will also be following changes in faculty and teaching assistant competence in two areas: effective teaching with structural biology tools and the development of skills in the area of measuring learning gains by students. As we conduct the lab on these different campuses, we will also focus on advantages of our approach and barriers to implementation that exist on each campus, from the level of student acceptance and faculty training, to resources that are needed to changes in the culture at the departmental and institutional levels. As we analyze the feasibility of this approach on other campuses, we will seek input from other potential adopters about their level of interest and the barriers that they anticipate on their campuses.more » « less
-
Abstract The effective design of combinatorial libraries to balance fitness and diversity facilitates the engineering of useful enzyme functions, particularly those that are poorly characterized or unknown in biology. We introduce MODIFY, a machine learning (ML) algorithm that learns from natural protein sequences to infer evolutionarily plausible mutations and predict enzyme fitness. MODIFY co-optimizes predicted fitness and sequence diversity of starting libraries, prioritizing high-fitness variants while ensuring broad sequence coverage. In silico evaluation shows that MODIFY outperforms state-of-the-art unsupervised methods in zero-shot fitness prediction and enables ML-guided directed evolution with enhanced efficiency. Using MODIFY, we engineer generalist biocatalysts derived from a thermostable cytochromecto achieve enantioselective C-B and C-Si bond formation via a new-to-nature carbene transfer mechanism, leading to biocatalysts six mutations away from previously developed enzymes while exhibiting superior or comparable activities. These results demonstrate MODIFY’s potential in solving challenging enzyme engineering problems beyond the reach of classic directed evolution.more » « less
-
Widespread availability of protein sequence-fitness data would revolutionize both our biochemical understanding of proteins and our ability to engineer them. Unfortunately, even though thousands of protein variants are generated and evaluated for fitness during a typical protein engineering campaign, most are never sequenced, leaving a wealth of potential sequence-fitness information untapped. Primarily, this is because sequencing is unnecessary for many protein engineering strategies; the added cost and effort of sequencing is thus unjustified. It also results from the fact that, even though many lower cost sequencing strategies have been developed, they often require at least some sequencing or computational resources, both of which can be barriers to access. Here, we present every variant sequencing (evSeq), a method and collection of tools/standardized components for sequencing a variable region within every variant gene produced during a protein engineering campaign at a cost of cents per variant. evSeq was designed to democratize low-cost sequencing for protein engineers and, indeed, anyone interested in engineering biological systems. Execution of its wet-lab component is simple, requires no sequencing experience to perform, relies only on resources and services typically available to biology labs, and slots neatly into existing protein engineering workflows. Analysis of evSeq data is likewise made simple by its accompanying software (found at github.com/fhalab/evSeq, documentation at fhalab.github.io/evSeq), which can be run on a personal laptop and was designed to be accessible to users with no computational experience. Low-cost and easy to use, evSeq makes collection of extensive protein variant sequence-fitness data practical.more » « less
-
null (Ed.)Cells adapt to changing environments. Perturb a cell and it returns to a point of homeostasis. Perturb a population and it evolves toward a fitness peak. We review quantitative models of the forces of adaptation and their visualizations on landscapes. While some adaptations result from single mutations or few-gene effects, others are more cooperative, more delocalized in the genome, and more universal and physical. For example, homeostasis and evolution depend on protein folding and aggregation, energy and protein production, protein diffusion, molecular motor speeds and efficiencies, and protein expression levels. Models provide a way to learn about the fitness of cells and cell populations by making and testing hypotheses.more » « less
An official website of the United States government
