skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on October 1, 2026

Title: Prediction of nucleic acid binding residues in protein sequences: Recent advances and future prospects
Computational prediction of DNA-binding residues (DBRs) and the RNA-binding residues (RBRs) in protein sequences is an active area of research, with about 90 predictors and 20 that were published over the last two years. The new predictors rely on sophisticated deep neural networks and protein language models, produce accurate predictions, and are conveniently available as code and/or web servers. However, we identified shortage of tools that predict these interactions in intrinsically disordered regions and tools capable of predicting residues that interact with specific RNA and DNA types. Moreover, cross-predictions between RBRs and DBRs should be quantified and minimized to ensure that future tools accurately differentiate between these two distinct types of nucleic acids.  more » « less
Award ID(s):
2125218 2146027
PAR ID:
10646941
Author(s) / Creator(s):
; ;
Publisher / Repository:
Elsevier
Date Published:
Journal Name:
Diabetes metabolic syndrome
ISSN:
1878-0334
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Current predictors of DNA-binding residues (DBRs) from protein sequences belong to two distinct groups, those trained on binding annotations extracted from structured protein-DNA complexes (structure-trained) vs. intrinsically disordered proteins (disorder-trained). We complete the first empirical analysis of predictive performance across the structure- and disorder-annotated proteins for a representative collection of ten predictors. Majority of the structure-trained tools perform well on the structure-annotated proteins while doing relatively poorly on the disorder-annotated proteins, and vice versa. Several methods make accurate predictions for the structure-annotated proteins or the disorder-annotated proteins, but none performs highly accurately for both annotation types. Moreover, most predictors make excessive cross-predictions for the disorder-annotated proteins, where residues that interact with non-DNA ligand types are predicted as DBRs. Motivated by these results, we design, validate and deploy an innovative meta-model, hybridDBRpred, that uses deep transformer network to combine predictions generated by three best current predictors. HybridDBRpred provides accurate predictions and low levels of cross-predictions across the two annotation types, and is statistically more accurate than each of the ten tools and baseline meta-predictors that rely on averaging and logistic regression. We deploy hybridDBRpred as a convenient web server at http://biomine.cs.vcu.edu/servers/hybridDBRpred/ and provide the corresponding source code at https://github.com/jianzhang-xynu/hybridDBRpred. 
    more » « less
  2. Abstract Computational prediction of nucleic acid-binding residues in protein sequences is an active field of research, with over 80 methods that were released in the past 2 decades. We identify and discuss 87 sequence-based predictors that include dozens of recently published methods that are surveyed for the first time. We overview historical progress and examine multiple practical issues that include availability and impact of predictors, key features of their predictive models, and important aspects related to their training and assessment. We observe that the past decade has brought increased use of deep neural networks and protein language models, which contributed to substantial gains in the predictive performance. We also highlight advancements in vital and challenging issues that include cross-predictions between deoxyribonucleic acid (DNA)-binding and ribonucleic acid (RNA)-binding residues and targeting the two distinct sources of binding annotations, structure-based versus intrinsic disorder-based. The methods trained on the structure-annotated interactions tend to perform poorly on the disorder-annotated binding and vice versa, with only a few methods that target and perform well across both annotation types. The cross-predictions are a significant problem, with some predictors of DNA-binding or RNA-binding residues indiscriminately predicting interactions with both nucleic acid types. Moreover, we show that methods with web servers are cited substantially more than tools without implementation or with no longer working implementations, motivating the development and long-term maintenance of the web servers. We close by discussing future research directions that aim to drive further progress in this area. 
    more » « less
  3. Abstract Dozens of impactful methods that predict intrinsically disordered regions (IDRs) in protein sequences that interact with proteins and/or nucleic acids were developed. Their training and assessment rely on the IDR‐level binding annotations, while the equivalent structure‐trained methods predict more granular annotations of binding amino acids (AA). We compiled a new benchmark dataset that annotates binding AA in IDRs and applied it to complete a first‐of‐its‐kind assessment of predictions of the disordered binding residues. We evaluated a representative collection of 14 methods, used several hundred low‐similarity test proteins, and focused on the challenging task of differentiating these binding residues from other disordered AA and considering ligand type‐specific predictions (protein–protein vs. protein–nucleic acid interactions). We found that current methods struggle to accurately predict binding IDRs among disordered residues; however, better‐than‐random tools predict disordered binding residues significantly better than binding IDRs. We identified at least one relatively accurate tool for predicting disordered protein‐binding and disordered nucleic acid‐binding AA. Analysis of cross‐predictions between interactions with protein and nucleic acids revealed that most methods are ligand‐type‐agnostic. Only two predictors of the nucleic acid‐binding IDRs and two predictors of the protein‐binding IDRs can be considered as ligand‐type‐specific. We also discussed several potential future directions that would move this field forward by producing more accurate methods that target the prediction of binding residues, reduce cross‐predictions, and cover a broader range of ligand types. 
    more » « less
  4. Disordered binding regions (DBRs), which are embedded within intrinsically disordered proteins or regions (IDPs or IDRs), enable IDPs or IDRs to mediate multiple protein-protein interactions. DBR-protein complexes were collected from the Protein Data Bank for which two or more DBRs having different amino acid sequences bind to the same (100% sequence identical) globular protein partner, a type of interaction herein called many-to-one binding. Two distinct binding profiles were identified: independent and overlapping. For the overlapping binding profiles, the distinct DBRs interact by means of almost identical binding sites (herein called “similar”), or the binding sites contain both common and divergent interaction residues (herein called “intersecting”). Further analysis of the sequence and structural differences among these three groups indicate how IDP flexibility allows different segments to adjust to similar, intersecting, and independent binding pockets. 
    more » « less
  5. Abstract RNA‐protein interactions play essential roles in regulating gene expression. While some RNA‐protein interactions are “specific”, that is, the RNA‐binding proteins preferentially bind to particular RNA sequence or structural motifs, others are “non‐RNA specific.” Deciphering the protein‐RNA recognition code is essential for comprehending the functional implications of these interactions and for developing new therapies for many diseases. Because of the high cost of experimental determination of protein‐RNA interfaces, there is a need for computational methods to identify RNA‐binding residues in proteins. While most of the existing computational methods for predicting RNA‐binding residues in RNA‐binding proteins are oblivious to the characteristics of the partner RNA, there is growing interest in methods for partner‐specific prediction of RNA binding sites in proteins. In this work, we assess the performance of two recently published partner‐specific protein‐RNA interface prediction tools, PS‐PRIP, and PRIdictor, along with our own new tools. Specifically, we introduce a novel metric, RNA‐specificity metric (RSM), for quantifying the RNA‐specificity of the RNA binding residues predicted by such tools. Our results show that the RNA‐binding residues predicted by previously published methods are oblivious to the characteristics of the putative RNA binding partner. Moreover, when evaluated using partner‐agnostic metrics, RNA partner‐specific methods are outperformed by the state‐of‐the‐art partner‐agnostic methods. We conjecture that either (a) the protein‐RNA complexes in PDB are not representative of the protein‐RNA interactions in nature, or (b) the current methods for partner‐specific prediction of RNA‐binding residues in proteins fail to account for the differences in RNA partner‐specific versus partner‐agnostic protein‐RNA interactions, or both. 
    more » « less