NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Evaluation of machine learning-assisted directed evolution across diverse combinatorial landscapes

https://doi.org/10.1016/j.cels.2025.101387

Li, Francesca-Zhoufan; Yang, Jason; Johnston, Kadina E; Gürsoy, Emre; Yue, Yisong; Arnold, Frances H (September 2025, Cell Systems)

Various machine learning-assisted directed evolution (MLDE) strategies have been shown to identify high-fitness protein variants more efficiently than typical wet-lab directed evolution approaches. However, limited understanding of the factors influencing MLDE performance across diverse proteins has hindered optimal strategy selection for wet-lab campaigns. To address this, we systematically analyzed multiple MLDE strategies, including active learning and focused training using six distinct zeroshot predictors, across 16 diverse protein fitness landscapes. By quantifying landscape navigability with six attributes, we found that MLDE offers a greater advantage on landscapes which are more challenging for directed evolution, especially when focused training is combined with active learning. Despite varying levels of advantage across landscapes, focused training with zero-shot predictors leveraging distinct evolutionary, structural, and stability knowledge sources consistently outperforms random sampling for both binding interactions and enzyme activities. Our findings provide practical guidelines for selecting MLDE strategies for protein engineering.
more » « less
Free, publicly-accessible full text available September 1, 2026
Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering

https://doi.org/10.1021/acscentsci.3c01275

Yang, Jason; Li, Francesca-Zhoufan; Arnold, Frances_H (February 2024, ACS Central Science)
evSeq: Cost-Effective Amplicon Sequencing of Every Variant in a Protein Library

https://doi.org/10.1021/acssynbio.1c00592

Wittmann, Bruce J.; Johnston, Kadina E.; Almhjell, Patrick J.; Arnold, Frances H. (February 2022, ACS Synthetic Biology)

Widespread availability of protein sequence-fitness data would revolutionize both our biochemical understanding of proteins and our ability to engineer them. Unfortunately, even though thousands of protein variants are generated and evaluated for fitness during a typical protein engineering campaign, most are never sequenced, leaving a wealth of potential sequence-fitness information untapped. Primarily, this is because sequencing is unnecessary for many protein engineering strategies; the added cost and effort of sequencing is thus unjustified. It also results from the fact that, even though many lower cost sequencing strategies have been developed, they often require at least some sequencing or computational resources, both of which can be barriers to access. Here, we present every variant sequencing (evSeq), a method and collection of tools/standardized components for sequencing a variable region within every variant gene produced during a protein engineering campaign at a cost of cents per variant. evSeq was designed to democratize low-cost sequencing for protein engineers and, indeed, anyone interested in engineering biological systems. Execution of its wet-lab component is simple, requires no sequencing experience to perform, relies only on resources and services typically available to biology labs, and slots neatly into existing protein engineering workflows. Analysis of evSeq data is likewise made simple by its accompanying software (found at github.com/fhalab/evSeq, documentation at fhalab.github.io/evSeq), which can be run on a personal laptop and was designed to be accessible to users with no computational experience. Low-cost and easy to use, evSeq makes collection of extensive protein variant sequence-fitness data practical.
more » « less
Full Text Available
Protein sequence design with deep generative models

https://doi.org/10.1016/j.cbpa.2021.04.004

Wu, Zachary; Johnston, Kadina E.; Arnold, Frances H.; Yang, Kevin K. (December 2021, Current Opinion in Chemical Biology)
null (Ed.)
Full Text Available
Informed training set design enables efficient machine learning-assisted directed protein evolution

https://doi.org/10.1016/j.cels.2021.07.008

Wittmann, Bruce J.; Yue, Yisong; Arnold, Frances H. (August 2021, Cell Systems)
null (Ed.)
Full Text Available
Advances in machine learning for directed evolution

https://doi.org/10.1016/j.sbi.2021.01.008

Wittmann, Bruce J; Johnston, Kadina E; Wu, Zachary; Arnold, Frances H (August 2021, Current Opinion in Structural Biology)
null (Ed.)
Full Text Available
Signal Peptides Generated by Attention-Based Neural Networks

https://doi.org/10.1021/acssynbio.0c00219

Wu, Zachary; Yang, Kevin K.; Liszka, Michael J.; Lee, Alycia; Batzilla, Alina; Wernick, David; Weiner, David P.; Arnold, Frances H. (July 2020, ACS Synthetic Biology)

Short (15–30 residue) chains of amino acids at the amino termini of expressed proteins known as signal peptides (SPs) specify secretion in living cells. We trained an attention-based neural network, the Transformer model, on data from all available organisms in Swiss-Prot to generate SP sequences. Experimental testing demonstrates that the model-generated SPs are functional: when appended to enzymes expressed in an industrial Bacillus subtilis strain, the SPs lead to secreted activity that is competitive with industrially used SPs. Additionally, the model-generated SPs are diverse in sequence, sharing as little as 58% sequence identity to the closest known native signal peptide and 73% ± 9% on average.
more » « less
Full Text Available

Search for: All records