Protein language models learn evolutionary statistics of interacting sequence motifs

Zhang, Zhidian; Wayment-Steele, Hannah K; Brixi, Garyk; Wang, Haobo; Kern, Dorothee; Ovchinnikov, Sergey

doi:10.1073/pnas.2406285121

Citation Details

This content will become publicly available on November 5, 2025

Protein language models learn evolutionary statistics of interacting sequence motifs

Protein language models (pLMs) have emerged as potent tools for predicting and designing protein structure and function, and the degree to which these models fundamentally understand the inherent biophysics of protein structure stands as an open question. Motivated by a finding that pLM-based structure predictors erroneously predict nonphysical structures for protein isoforms, we investigated the nature of sequence context needed for contact predictions in the pLM Evolutionary Scale Modeling (ESM-2). We demonstrate by use of a “categorical Jacobian” calculation that ESM-2 stores statistics of coevolving residues, analogously to simpler modeling approaches like Markov Random Fields and Multivariate Gaussian models. We further investigated how ESM-2 “stores” information needed to predict contacts by comparing sequence masking strategies, and found that providing local windows of sequence information allowed ESM-2 to best recover predicted contacts. This suggests that pLMs predict contacts by storing motifs of pairwise contacts. Our investigation highlights the limitations of current pLMs and underscores the importance of understanding the underlying mechanisms of these models. more »

Award ID(s):: 2032259

PAR ID:: 10631241

Author(s) / Creator(s):: Zhang, Zhidian; Wayment-Steele, Hannah K; Brixi, Garyk; Wang, Haobo; Kern, Dorothee; Ovchinnikov, Sergey

Publisher / Repository:: PNAS

Date Published:: 2024-11-05

Journal Name:: Proceedings of the National Academy of Sciences

Volume:: 121

Issue:: 45

ISSN:: 0027-8424

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on November 5, 2025
Journal Article:
https://doi.org/10.1073/pnas.2406285121

More Like this