A More Informative and Reproducible Remote Homology Evaluation for Protein Language Models

Moldwin, Asher; Kabir, Anowarul; Shehu, Amarda

Citation Details

Recent studies exploring the abilities of transformer-based protein language models have highlighted their performance on the task of remote homology detection, but have not provided datasets or evaluation procedures geared toward properly measuring performance on this task. With the goal of obtaining more informative and reproducible results, we offer a detailed procedure for constructing datasets and evaluating remote homology detection performance in a way that allows detailed analyses to be performed that shed light on the remote homology detection performance throughout the “twilight zone” of low sequence similarity. Using the proposed procedures, we found that three stateof-the-art protein language models exhibit diminishing performance when the pairwise sequence similarity between the query sequence and other proteins is restricted to below 35% identity. more »

Award ID(s):: 2310113

PAR ID:: 10526869

Author(s) / Creator(s):: Moldwin, Asher; Kabir, Anowarul; Shehu, Amarda

Publisher / Repository:: LLMs4Bio

Date Published:: 2024-02-26

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Workshop Report:
The DOI is not currently available.

More Like this