Recent studies exploring the abilities of transformer-based protein language models have highlighted their performance on the task of remote homology detection, but have not provided datasets or evaluation procedures geared toward properly measuring performance on this task. With the goal of obtaining more informative and reproducible results, we offer a detailed procedure for constructing datasets and evaluating remote homology detection performance in a way that allows detailed analyses to be performed that shed light on the remote homology detection performance throughout the “twilight zone” of low sequence similarity. Using the proposed procedures, we found that three stateof-the-art protein language models exhibit diminishing performance when the pairwise sequence similarity between the query sequence and other proteins is restricted to below 35% identity.
more »
« less
Evaluating the Significance of Embedding-Based Protein Sequence Alignment with Clustering and Double Dynamic Programming for Remote Homology
Accurate detection of protein sequence homology is essential for understanding evolutionary relationships and predicting protein functions, particularly for detecting remote homology in the “twilight zone” (20-35% sequence similarity), where traditional sequence alignment methods often fail. Recent studies show that embeddings from protein language models (pLM) can improve remote homology detection over traditional methods. Alignment-based approaches, such as those combining pLMs with dynamic programming alignment, further improve performance but often suffer from noise in the resulting similarity matrices. To address this, we evaluate a newly developed embedding-based sequence alignment approach that refines residue-level embedding similarity using K-means clustering and double dynamic programming (DDP). We show that the incorporation of clustering and DDP contributes substantially to the improved performance in detecting remote homology. Experimental results demonstrate that our approach outperforms both traditional and state-of-the-art approaches based on pLMs on several benchmarks. Our study illustrates embedding-based alignment refined with clustering and DDP offers a powerful approach for identifying remote homology, with potential to evolve further as pLMs continue to advance.
more »
« less
- Award ID(s):
- 2435093
- PAR ID:
- 10645242
- Publisher / Repository:
- bioRxiv
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Recent studies exploring the abilities of transformer-based protein language models have highlighted their performance on the task of remote homology detection, but have not provided datasets or evaluation procedures geared toward properly measuring performance on this task. With the goal of obtaining more informative and reproducible results, we offer a detailed procedure for constructing datasets and evaluating remote homology detection performance in a way that allows detailed analyses to be performed that shed light on the remote homology detection performance throughout the “twilight zone” of low sequence similarity. Using the proposed procedures, we found that three stateof-the-art protein language models exhibit diminishing performance when the pairwise sequence similarity between the query sequence and other proteins is restricted to below 35% identity.more » « less
-
Gogovi, Gideon (Ed.)Abstract MotivationProtein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent. ResultsWe address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the “twilight zone” of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak. Availability and implementationWe believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.more » « less
-
Protein language models (PLMs) are often assumed to capture evolutionary information by training on large protein sequence datasets. Yet it remains unclear whether PLMs can reason about evolution—that is, infer evolutionary relationships between sequences. We test this capability by evaluating whether standard PLM usage, frozen or fine-tuned embeddings with distance-based comparison, supports evolutionary reasoning. Existing PLMs consistently fail to recover phylogenetic structure, despite strong performance on sequence-level tasks such as masked-token and contact prediction. We present PHYLA, a hybrid state-space and transformer model that jointly processes multiple sequences and is trained using a tree-based objective across 3,000 phylogenies spanning diverse protein families. PHYLA outperforms the next-best PLM by 9% on tree reconstruction and 23% on taxonomic clustering while remaining alignment- and guide-tree-free. Although classical alignment pipelines achieve higher absolute accuracy, PHYLA narrows the gap and achieves markedly lower end-to-end runtime. Applied to real data, PHYLA reconstructs biologically accurate clades in the tree of life and resolves genome-scale relationships among Mycobacterium tuberculosis isolates. These findings suggest that, under standard usage, evolutionary reasoning does not reliably emerge from large-scale sequence modeling. Instead, PHYLA shows that models trained with phylogenetic supervision can reason about evolution more effectively, offering a biologically grounded path toward evolutionary foundation models.more » « less
-
null (Ed.)Sequence-based protein homology detection has emerged as one of the most sensitive and accurate approaches to protein structure prediction. Despite the success, homology detection remains very challenging for weakly homologous proteins with divergent evolutionary profile. Very recently, deep neural network architectures have shown promising progress in mining the coevolutionary signal encoded in multiple sequence alignments, leading to reasonably accurate estimation of inter-residue interaction maps, which serve as a rich source of additional information for improved homology detection. Here, we summarize the latest developments in protein homology detection driven by inter-residue interaction map threading. We highlight the emerging trends in distant-homology protein threading through the alignment of predicted interaction maps at various granularities ranging from binary contact maps to finer-grained distance and orientation maps as well as their combination. We also discuss some of the current limitations and possible future avenues to further enhance the sensitivity of protein homology detection.more » « less
An official website of the United States government

