Distilling Structural Representations into Protein Sequence Models

Ouyang-Zhang, Jeffrey; Gong, Chengyue; Zhao, Yue; Krähenbühl, Philipp; Klivans, Adam R; Diaz, Daniel J

doi:10.1101/2024.11.08.622579

Citation Details

Distilling Structural Representations into Protein Sequence Models

Abstract Protein language models, like the popular ESM2, are widely used tools for extracting evolution-based protein representations and have achieved significant success on downstream biological tasks. Representations based on sequence and structure models, however, show significant performance differences depending on the downstream task. A major open problem is to obtain representations that best capture both the evolutionary and structural properties of proteins in general. Here we introduceImplicitStructureModel(ISM), a sequence-only input model with structurally-enriched representations that outperforms state-of-the-art sequence models on several well-studied benchmarks including mutation stability assessment and structure prediction. Our key innovations are a microenvironment-based autoencoder for generating structure tokens and a self-supervised training objective that distills these tokens into ESM2’s pre-trained model. We have madeISM’s structure-enriched weights easily available: integrating ISM into any application using ESM2 requires changing only a single line of code. Our code is available athttps://github.com/jozhang97/ISM. more »

Award ID(s):: 2505865

PAR ID:: 10631086

Author(s) / Creator(s):: Ouyang-Zhang, Jeffrey; Gong, Chengyue; Zhao, Yue; Krähenbühl, Philipp; Klivans, Adam R; Diaz, Daniel J

Publisher / Repository:: bioRxiv

Date Published:: 2024-11-11

Format(s):: Medium: X

Institution:: bioRxiv

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Posted Content:
https://doi.org/10.1101/2024.11.08.622579

More Like this