Evaluating the Significance of Embedding-Based Protein Sequence Alignment with Clustering and Double Dynamic Programming for Remote Homology

Spicer, Robert; Raychawdhary, Nilanjana; Seals, Cheryl; Bhattacharya, Sutanu (ORCID:0000000237805174)

doi:10.1101/2025.07.28.666913

Citation Details

Evaluating the Significance of Embedding-Based Protein Sequence Alignment with Clustering and Double Dynamic Programming for Remote Homology

Accurate detection of protein sequence homology is essential for understanding evolutionary relationships and predicting protein functions, particularly for detecting remote homology in the “twilight zone” (20-35% sequence similarity), where traditional sequence alignment methods often fail. Recent studies show that embeddings from protein language models (pLM) can improve remote homology detection over traditional methods. Alignment-based approaches, such as those combining pLMs with dynamic programming alignment, further improve performance but often suffer from noise in the resulting similarity matrices. To address this, we evaluate a newly developed embedding-based sequence alignment approach that refines residue-level embedding similarity using K-means clustering and double dynamic programming (DDP). We show that the incorporation of clustering and DDP contributes substantially to the improved performance in detecting remote homology. Experimental results demonstrate that our approach outperforms both traditional and state-of-the-art approaches based on pLMs on several benchmarks. Our study illustrates embedding-based alignment refined with clustering and DDP offers a powerful approach for identifying remote homology, with potential to evolve further as pLMs continue to advance. more »

Award ID(s):: 2435093

PAR ID:: 10645242

Author(s) / Creator(s):: Spicer, Robert; Raychawdhary, Nilanjana; Seals, Cheryl; Bhattacharya, Sutanu

Publisher / Repository:: bioRxiv

Date Published:: 2025-07-31

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Posted Content:
https://doi.org/10.1101/2025.07.28.666913

More Like this