NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

In the twilight zone of protein sequence homology: do protein language models learn protein structure?

https://doi.org/10.1093/bioadv/vbae119

Kabir, Anowarul; Moldwin, Asher; Bromberg, Yana; Shehu, Amarda; Gogovi, ed., Gideon (August 2024, Bioinformatics Advances)

Abstract MotivationProtein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent. ResultsWe address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the “twilight zone” of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak. Availability and implementationWe believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.
more » « less
Better AI For Understanding Life on Earth: Predict First, Design Later

https://doi.org/10.1137/1.9781611978520.48

Bromberg, Yana; Shehu, Amarda (January 2025, Society for Industrial and Applied Mathematics)

Free, publicly-accessible full text available January 1, 2026
Variant Effect Prediction in the Age of Machine Learning

https://doi.org/10.1101/cshperspect.a041467

Bromberg, Yana; Prabakaran, R; Kabir, Anowarul; Shehu, Amarda (July 2024, Cold Spring Harbor Perspectives in Biology)

Over the years, many computational methods have been created for the analysis of the impact of single amino acid substitutions resulting from single-nucleotide variants in genome coding regions. Historically, all methods have been supervised and thus limited by the inadequate sizes of experimentally curated data sets and by the lack of a standardized definition of variant effect. The emergence of unsupervised, deep learning (DL)-based methods raised an important question: Canmachines learn the language of life fromthe unannotated protein sequence data well enough to identify significant errors in the protein “sentences”? Our analysis suggests that some unsupervised methods perform as well or better than existing supervised methods. Unsupervised methods are also faster and can, thus, be useful in large-scale variant evaluations. For all other methods, however, their performance varies by both evaluation metrics and by the type of variant effect being predicted.We also note that the evaluation of method performance is still lacking on less-studied, nonhuman proteins where unsupervised methods hold the most promise.
more » « less
Full Text Available

Search for: All records