In the twilight zone of protein sequence homology: do protein language models learn protein structure?

Kabir, Anowarul (ORCID:0000000180602084); Moldwin, Asher (ORCID:0000000292155653); Bromberg, Yana (ORCID:0000000283510844); Shehu, Amarda; Gogovi, ed., Gideon

doi:10.1093/bioadv/vbae119

Abstract MotivationProtein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent. ResultsWe address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the “twilight zone” of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak. Availability and implementationWe believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.

More Like this