skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Award ID contains: 2019897

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract The effective design of combinatorial libraries to balance fitness and diversity facilitates the engineering of useful enzyme functions, particularly those that are poorly characterized or unknown in biology. We introduce MODIFY, a machine learning (ML) algorithm that learns from natural protein sequences to infer evolutionarily plausible mutations and predict enzyme fitness. MODIFY co-optimizes predicted fitness and sequence diversity of starting libraries, prioritizing high-fitness variants while ensuring broad sequence coverage. In silico evaluation shows that MODIFY outperforms state-of-the-art unsupervised methods in zero-shot fitness prediction and enables ML-guided directed evolution with enhanced efficiency. Using MODIFY, we engineer generalist biocatalysts derived from a thermostable cytochromecto achieve enantioselective C-B and C-Si bond formation via a new-to-nature carbene transfer mechanism, leading to biocatalysts six mutations away from previously developed enzymes while exhibiting superior or comparable activities. These results demonstrate MODIFY’s potential in solving challenging enzyme engineering problems beyond the reach of classic directed evolution. 
    more » « less
  2. Abstract MotivationDespite the advances in sequencing technology, massive proteins with known sequences remain functionally unannotated. Biological network alignment (NA), which aims to find the node correspondence between species’ protein–protein interaction (PPI) networks, has been a popular strategy to uncover missing annotations by transferring functional knowledge across species. Traditional NA methods assumed that topologically similar proteins in PPIs are functionally similar. However, it was recently reported that functionally unrelated proteins can be as topologically similar as functionally related pairs, and a new data-driven or supervised NA paradigm has been proposed, which uses protein function data to discern which topological features correspond to functional relatedness. ResultsHere, we propose GraNA, a deep learning framework for the supervised NA paradigm for the pairwise NA problem. Employing graph neural networks, GraNA utilizes within-network interactions and across-network anchor links for learning protein representations and predicting functional correspondence between across-species proteins. A major strength of GraNA is its flexibility to integrate multi-faceted non-functional relationship data, such as sequence similarity and ortholog relationships, as anchor links to guide the mapping of functionally related proteins across species. Evaluating GraNA on a benchmark dataset composed of several NA tasks between different pairs of species, we observed that GraNA accurately predicted the functional relatedness of proteins and robustly transferred functional annotations across species, outperforming a number of existing NA methods. When applied to a case study on a humanized yeast network, GraNA also successfully discovered functionally replaceable human–yeast protein pairs that were documented in previous studies. Availability and implementationThe code of GraNA is available at https://github.com/luo-group/GraNA. 
    more » « less
  3. Abstract Many of the greatest challenges facing society today likely have molecular solutions that await discovery. However, the process of identifying and manufacturing such molecules has remained slow and highly specialist dependent. Interfacing the fields of artificial intelligence (AI) and synthetic organic chemistry has the potential to powerfully address both limitations. The Molecule Maker Lab Institute (MMLI) brings together a team of chemists, engineers, and AI‐experts from the University of Illinois Urbana‐Champaign (UIUC), Pennsylvania State University, and the Rochester Institute of Technology, with the goal of accelerating the discovery, synthesis and manufacture of complex organic molecules. Advanced AI and machine learning (ML) methods are deployed in four key thrusts: (1) AI‐enabled synthesis planning, (2) AI‐enabled catalyst development, (3) AI‐enabled molecule manufacturing, and (4) AI‐enabled molecule discovery. The MMLI's new AI‐enabled synthesis platform integrates chemical and enzymatic catalysis with literature mining and ML to predict the best way to make new molecules with desirable biological and material properties. The MMLI is transforming chemical synthesis and generating use‐inspired AI advances. Simultaneously, the MMLI is also acting as a training ground for the next generation of scientists with combined expertise in chemistry and AI. Outreach efforts aimed toward high school students and the public are being used to show how AI‐enabled tools can help to make chemical synthesis accessible to nonexperts. 
    more » « less
  4. Proc. 2023 ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (Ed.)
    Instead of relying on human-annotated training samples to build a classifier, weakly supervised scientific paper classification aims to classify papers only using category descriptions (e.g., category names, category-indicative keywords). Existing studies on weakly supervised paper classification are less concerned with two challenges: (1) Papers should be classified into not only coarse-grained research topics but also fine-grained themes, and potentially into multiple themes, given a large and fine-grained label space; and (2) full text should be utilized to complement the paper title and abstract for classification. Moreover, instead of viewing the entire paper as a long linear sequence, one should exploit the structural information such as citation links across papers and the hierarchy of sections and paragraphs in each paper. To tackle these challenges, in this study, we propose FuTex, a framework that uses the cross-paper network structure and the in-paper hierarchy structure to classify full-text scientific papers under weak supervision. A network-aware contrastive fine-tuning module and a hierarchyaware aggregation module are designed to leverage the two types of structural signals, respectively. Experiments on two benchmark datasets demonstrate that FuTex significantly outperforms competitive baselines and is on par with fully supervised classifiers that use 1,000 to 60,000 ground-truth training samples. 
    more » « less
  5. Abstract Machine learning has been increasingly used for protein engineering. However, because the general sequence contexts they capture are not specific to the protein being engineered, the accuracy of existing machine learning algorithms is rather limited. Here, we report ECNet (evolutionary context-integrated neural network), a deep-learning algorithm that exploits evolutionary contexts to predict functional fitness for protein engineering. This algorithm integrates local evolutionary context from homologous sequences that explicitly model residue-residue epistasis for the protein of interest with the global evolutionary context that encodes rich semantic and structural features from the enormous protein sequence universe. As such, it enables accurate mapping from sequence to function and provides generalization from low-order mutants to higher-order mutants. We show that ECNet predicts the sequence-function relationship more accurately as compared to existing machine learning algorithms by using ~50 deep mutational scanning and random mutagenesis datasets. Moreover, we used ECNet to guide the engineering of TEM-1 β-lactamase and identified variants with improved ampicillin resistance with high success rates. 
    more » « less
  6. Free, publicly-accessible full text available December 1, 2026
  7. Free, publicly-accessible full text available December 1, 2026
  8. Small molecule solutions to many contemporary societal challenges await discovery, but the artisanal and manual process via which this class of chemical matter is typically accessed limits the discovery of new functions. Automated assembly of (N‐methyl iminodiacetic acid) MIDA or (tetramethyl N‐methyl iminodiacetic acid) TIDA boronate building blocks via iterative C─C bond formation, an approach we call “block chemistry”, alternatively enables generalized and automated preparation of many different types of small molecules in a modular fashion. But in its current form, this engine cannot also leverage nitrogen atoms as iteration handles. Here, we disclose a new iteration‐enabling group, CbzT (p‐TIDA boronate‐substituted carboxybenzyl), that reversibly attenuates the reactivity of nitrogen atoms and enables generalized catch‐and‐release purification. CbzT is leveraged to achieve the automated modular synthesis of Imatinib (Gleevec), an archetypical clinically approved kinase inhibitor, in which building blocks are iteratively linked by both N─C and C─C bonds. This work substantially expands the types of small molecules that can be iteratively assembled in an automated modular fashion. It also advances the concept of intentionally developing chemistry that machines can do. 
    more » « less
    Free, publicly-accessible full text available August 11, 2026
  9. Raghunathan, Anu (Ed.)
    Computational pathway design and retro-biosynthetic approaches can facilitate the development of innovative biochemical production routes, biodegradation strategies, and the funneling of multiple precursors into a single bioproduct. However, effective pathway design necessitates a comprehensive understanding of biochemistries, enzyme activities, and thermodynamic feasibility. Herein, we introduce novoStoic2.0, an integrated platform that combines tools for estimating overall stoichiometry, designing de novo synthesis pathways, assessing thermodynamic feasibility, and selecting enzymes. novoStoic2.0 offers a unified web-based interface as a part of the AlphaSynthesis platform (http://novostoic.platform.moleculemaker.org/) tailored for the synthesis of thermodynamically viable pathways as well as the selection of enzymes for re-engineering required for novel reaction steps. We exemplify the utility of the platform to identify novel pathways for hydroxytyrosol synthesis, which are shorter than the known pathways and require reduced cofactor usage. In summary, novoStoic2.0 aims to streamline the process of pathway design contributing to the development of sustainable biotechnological solutions. 
    more » « less
    Free, publicly-accessible full text available August 6, 2026