skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on April 1, 2026

Title: A systematic evaluation of the language-of-viral-escape model using multiple machine learning frameworks
Predicting the evolutionary patterns of emerging and endemic viruses is key for mitigating their spread. In particular, it is critical to rapidly identify mutations with the potential for immune escape or increased disease burden. Knowing which circulating mutations pose a concern can inform treatment or mitigation strategies such as alternative vaccines or targeted social distancing. In 2021, Hie B, Zhong ED, Berger B, Bryson B. 2021 Learning the language of viral evolution and escape.Science371, 284–288. (doi:10.1126/science.abd7331) proposed that variants of concern can be identified using two quantities extracted from protein language models, grammaticality and semantic change. These quantities are defined by analogy to concepts from natural language processing. Grammaticality is intended to be a measure of whether a variant viral protein is viable, and semantic change is intended to be a measure of potential for immune escape. Here, we systematically test this hypothesis, taking advantage of several high-throughput datasets that have become available, and also comparing this model with several more recently published machine learning models. We find that grammaticality can be a measure of protein viability, though methods that are trained explicitly to predict mutational effects appear to be more effective. By contrast, we do not find compelling evidence that semantic change is a useful tool for identifying immune escape mutations.  more » « less
Award ID(s):
2505865
PAR ID:
10631132
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Journal of The Royal Society Interface
Date Published:
Journal Name:
Journal of The Royal Society Interface
Volume:
22
Issue:
225
ISSN:
1742-5662
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Mosquitoes can change their feeding behaviours based on past experiences, such as shifting from biting animals to biting humans or avoiding defensive hosts (Wolff & Riffell 2018J. Exp. Biol.221, jeb157131. (doi:10.1242/jeb.157131)). Dopamine is a critical neuromodulator for insects, allowing flexibility in their feeding preferences, but its role in the primary olfactory centre, the antennal lobe (AL), remains unclear (Vinaugeret al.2018Curr. Biol.28, 333–344.e8. (doi:10.1016/j.cub.2017.12.015)). It is also unknown whether mosquitoes can learn some odours and not others, or whether different species learn the same odour cues. We assayed aversive olfactory learning in four mosquito species with different host preferences, and found that they differentially learn odours salient to their preferred host. Mosquitoes that prefer humans learned odours found in mammalian skin, but not a flower odour, and a nectar-feeding species only learned a floral odour. Comparing the brains of these four species revealed significantly different innervation patterns in the AL by dopaminergic neurons. Calcium imaging in theAedes aegyptiAL and three-dimensional image analyses of dopaminergic innervation show that glomeruli tuned to learnable odours have significantly higher dopaminergic innervation. Changes in dopamine expression in the insect AL may be an evolutionary mechanism to adapt olfactory learning circuitry without changing brain structure and confer to mosquitoes an ability to adapt to new hosts. 
    more » « less
  2. Top-down rather than bottom-up change The Larsen-B Ice Shelf in Antarctica collapsed in 2002 because of a regional increase in surface temperature. This finding, reported by Rebescoet al., will surprise many who supposed that the shelf's disintegration probably occurred because of thinning of the ice shelf and the resulting loss of support by the sea floor beneath it. The authors mapped the sea floor beneath the ice shelf before it fell apart, which revealed that the modern ice sheet grounding line was established around 12,000 years ago and has since remained unchanged. If the ice shelf did not collapse because of thinning from below, then it must have been caused by warming from above. Science, this issue p.1354 
    more » « less
  3. Abstract Protein language models, like the popular ESM2, are widely used tools for extracting evolution-based protein representations and have achieved significant success on downstream biological tasks. Representations based on sequence and structure models, however, show significant performance differences depending on the downstream task. A major open problem is to obtain representations that best capture both the evolutionary and structural properties of proteins in general. Here we introduceImplicitStructureModel(ISM), a sequence-only input model with structurally-enriched representations that outperforms state-of-the-art sequence models on several well-studied benchmarks including mutation stability assessment and structure prediction. Our key innovations are a microenvironment-based autoencoder for generating structure tokens and a self-supervised training objective that distills these tokens into ESM2’s pre-trained model. We have madeISM’s structure-enriched weights easily available: integrating ISM into any application using ESM2 requires changing only a single line of code. Our code is available athttps://github.com/jozhang97/ISM. 
    more » « less
  4. Abstract BackgroundProtein S-nitrosylation (SNO) plays a key role in transferring nitric oxide-mediated signals in both animals and plants and has emerged as an important mechanism for regulating protein functions and cell signaling of all main classes of protein. It is involved in several biological processes including immune response, protein stability, transcription regulation, post translational regulation, DNA damage repair, redox regulation, and is an emerging paradigm of redox signaling for protection against oxidative stress. The development of robust computational tools to predict protein SNO sites would contribute to further interpretation of the pathological and physiological mechanisms of SNO. ResultsUsing an intermediate fusion-based stacked generalization approach, we integrated embeddings from supervised embedding layer and contextualized protein language model (ProtT5) and developed a tool called pLMSNOSite (protein language model-based SNO site predictor). On an independent test set of experimentally identified SNO sites, pLMSNOSite achieved values of 0.340, 0.735 and 0.773 for MCC, sensitivity and specificity respectively. These results show that pLMSNOSite performs better than the compared approaches for the prediction of S-nitrosylation sites. ConclusionTogether, the experimental results suggest that pLMSNOSite achieves significant improvement in the prediction performance of S-nitrosylation sites and represents a robust computational approach for predicting protein S-nitrosylation sites. pLMSNOSite could be a useful resource for further elucidation of SNO and is publicly available athttps://github.com/KCLabMTU/pLMSNOSite. 
    more » « less
  5. Abstract The glycosylation on the spike (S) protein of the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the virus that causes COVID-19, modulates the viral infection by altering conformational dynamics, receptor interaction and host immune responses. Several variants of concern (VOCs) of SARS-CoV-2 have evolved during the pandemic, and crucial mutations on the S protein of the virus have led to increased transmissibility and immune escape. In this study, we compare the site-specific glycosylation and overall glycomic profiles of the wild type Wuhan-Hu-1 strain (WT) S protein and five VOCs of SARS-CoV-2: Alpha, Beta, Gamma, Delta and Omicron. Interestingly, both N- and O-glycosylation sites on the S protein are highly conserved among the spike mutant variants, particularly at the sites on the receptor-binding domain (RBD). The conservation of glycosylation sites is noteworthy, as over 2 million SARS-CoV-2 S protein sequences have been reported with various amino acid mutations. Our detailed profiling of the glycosylation at each of the individual sites of the S protein across the variants revealed intriguing possible association of glycosylation pattern on the variants and their previously reported infectivity. While the sites are conserved, we observed changes in the N- and O-glycosylation profile across the variants. The newly emerged variants, which showed higher resistance to neutralizing antibodies and vaccines, displayed a decrease in the overall abundance of complex-type glycans with both fucosylation and sialylation and an increase in the oligomannose-type glycans across the sites. Among the variants, the glycosylation sites with significant changes in glycan profile were observed at both theN-terminal domain and RBD of S protein, with Omicron showing the highest deviation. The increase in oligomannose-type happens sequentially from Alpha through Delta. Interestingly, Omicron does not contain more oligomannose-type glycans compared to Delta but does contain more compared to the WT and other VOCs. O-glycosylation at the RBD showed lower occupancy in the VOCs in comparison to the WT. Our study on the sites and pattern of glycosylation on the SARS-CoV-2 S proteins across the VOCs may help to understand how the virus evolved to trick the host immune system. Our study also highlights how the SARS-CoV-2 virus has conserved bothN- andO- glycosylation sites on the S protein of the most successful variants even after undergoing extensive mutations, suggesting a correlation between infectivity/ transmissibility and glycosylation. 
    more » « less