Accurate prediction of nucleic acid binding proteins using protein language model

Wu, Siwen; Xu, Jinbo; Guo, Jun-tao

doi:10.1093/bioadv/vbaf008

Abstract MotivationNucleic acid binding proteins (NABPs) play critical roles in various and essential biological processes. Many machine learning-based methods have been developed to predict different types of NABPs. However, most of these studies have limited applications in predicting the types of NABPs for any given protein with unknown functions, due to several factors such as dataset construction, prediction scope and features used for training and testing. In addition, single-stranded DNA binding proteins (DBP) (SSBs) have not been extensively investigated for identifying novel SSBs from proteins with unknown functions. ResultsTo improve prediction accuracy of different types of NABPs for any given protein, we developed hierarchical and multi-class models with machine learning-based methods and a feature extracted from protein language model ESM2. Our results show that by combining the feature from ESM2 and machine learning methods, we can achieve high prediction accuracy up to 95% for each stage in the hierarchical approach, and 85% for overall prediction accuracy from the multi-class approach. More importantly, besides the much improved prediction of other types of NABPs, the models can be used to accurately predict single-stranded DBPs, which is underexplored. Availability and implementationThe datasets and code can be found at https://figshare.com/projects/Prediction_of_nucleic_acid_binding_proteins_using_protein_language_model/211555.

More Like this