Abstract Dozens of impactful methods that predict intrinsically disordered regions (IDRs) in protein sequences that interact with proteins and/or nucleic acids were developed. Their training and assessment rely on the IDR‐level binding annotations, while the equivalent structure‐trained methods predict more granular annotations of binding amino acids (AA). We compiled a new benchmark dataset that annotates binding AA in IDRs and applied it to complete a first‐of‐its‐kind assessment of predictions of the disordered binding residues. We evaluated a representative collection of 14 methods, used several hundred low‐similarity test proteins, and focused on the challenging task of differentiating these binding residues from other disordered AA and considering ligand type‐specific predictions (protein–protein vs. protein–nucleic acid interactions). We found that current methods struggle to accurately predict binding IDRs among disordered residues; however, better‐than‐random tools predict disordered binding residues significantly better than binding IDRs. We identified at least one relatively accurate tool for predicting disordered protein‐binding and disordered nucleic acid‐binding AA. Analysis of cross‐predictions between interactions with protein and nucleic acids revealed that most methods are ligand‐type‐agnostic. Only two predictors of the nucleic acid‐binding IDRs and two predictors of the protein‐binding IDRs can be considered as ligand‐type‐specific. We also discussed several potential future directions that would move this field forward by producing more accurate methods that target the prediction of binding residues, reduce cross‐predictions, and cover a broader range of ligand types.
more »
« less
DR-BERT: A protein language model to annotate disordered regions
Despite their lack of a rigid structure, intrinsically disordered regions (IDRs) in proteins play important roles in cellular functions, including mediating protein-protein interactions. Therefore, it is important to computationally annotate IDRs with high accuracy. In this study, we present Disordered Region prediction using Bidirectional Encoder Representations from Transformers (DR-BERT), a compact protein language model. Unlike most popular tools, DR-BERT is pretrained on unannotated proteins and trained to predict IDRs without relying on explicit evolutionary or biophysical data. Despite this, DR-BERT demonstrates significant improvement over existing methods on the Critical Assessment of protein Intrinsic Disorder (CAID) evaluation dataset and outperforms competitors on two out of four test cases in the CAID 2 dataset, while maintaining competitiveness in the others. This performance is due to the information learned during pretraining and DR- BERT’s ability to use contextual information.
more »
« less
- Award ID(s):
- 2107344
- PAR ID:
- 10549924
- Publisher / Repository:
- Cell Press
- Date Published:
- Journal Name:
- Structure
- Volume:
- 32
- Issue:
- 8
- ISSN:
- 0969-2126
- Page Range / eLocation ID:
- 1260 to 1268.e3
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Phase separation processes facilitate the formation of membrane-less organelles and involve interactions within structured domains and intrinsically disordered regions (IDRs) in protein sequences. The literature suggests that the involvement of proteins in phase separation can be predicted from their sequences, leading to the development of over 30 computational predictors. We focused on intrinsic disorder due to its fundamental role in related diseases, and because recent analysis has shown that phase separation can be accurately predicted for structured proteins. We evaluated eight representative amino acid-level predictors of phase separation, capable of identifying phase-separating IDRs, using a well-annotated, low-similarity test dataset under two complementary evaluation scenarios. Several methods generate accurate predictions in the easier scenario that includes both structured and disordered sequences. However, we demonstrate that modern disorder predictors perform equally well in this scenario by effectively differentiating phase-separating IDRs from structured regions. In the second, more challenging scenario—considering only predictions in disordered regions—disorder predictors underperform, and most phase separation predictors produce only modestly accurate results. Moreover, some predictors are broadly biased to classify disordered residues as phase-separating, which results in low predictive performance in this scenario. Finally, we recommend PSPHunter as the most accurate tool for identifying phase-separating IDRs in both scenarios.more » « less
-
Intrinsically disordered regions (IDRs) are important components of protein functionality, with their charge distribution serving as a key factor in determining their roles. Notably, many proteins possess IDRs that are highly negatively charged, characterized by sequences rich in aspartate (D) or glutamate (E) residues. Bioinformatic analyses indicate that negatively charged low-complexity IDRs are significantly more common than their positively charged counterparts rich in arginine (R) or lysine (K). For instance, sequences of 10 or more consecutive negatively charged residues (D or E) are present in 268 human proteins. In contrast, corresponding sequences of 10 or more consecutive positively charged residues (K or R) are present in only 12 human proteins. Interestingly, about 50% of proteins containing D/E tracts function as DNA-binding or RNA-binding proteins. Negatively charged IDRs can electrostatically mimic nucleic acids and dynamically compete with them for the DNA-binding domains (DBDs) or RNA-binding domains (RBDs) that are positively charged. This leads to a phenomenon known as autoinhibition, in which the negatively charged IDRs inhibit binding to nucleic acids by occupying the binding interfaces within the proteins through intramolecular interactions. Rather than merely reducing binding activity, negatively charged IDRs offer significant advantages for the functions of DNA/RNA-binding proteins. The dynamic competition between negatively charged IDRs and nucleic acids can accelerate the target search processes for these proteins. When a protein encounters DNA or RNA, the electrostatic repulsion force between the nucleic acids and the negatively charged IDRs can trigger conformational changes that allow the nucleic acids to access DBDs or RBDs. Additionally, when proteins are trapped at high-affinity non-target sites on DNA or RNA ("decoys"), the electrostatic repulsion from the negatively charged IDRs can rescue the proteins from these traps. Negatively charged IDRs act as gatekeepers, rejecting nonspecific ligands while allowing the target to access the molecular interfaces of DBDs or RBDs, which increases binding specificity. These IDRs can also promote proper protein folding, facilitate chromatin remodeling by displacing other proteins bound to DNA, and influence phase separation, affecting local pH. The functions of negatively charged IDRs can be regulated through protein-protein interactions, post-translational modifications, and proteolytic processing. These characteristics can be harnessed as tools for protein engineering. Some frame-shift mutations that convert negatively charged IDRs into positively charged ones are linked to human diseases. Therefore, it is crucial to understand the physicochemical properties and functional roles of negatively charged IDRs that compete with nucleic acids.more » « less
-
Intrinsically disordered regions (IDRs) carry out many cellular functions and vary in length and placement in protein sequences. This diversity leads to variations in the underlying compositional biases, which were demonstrated for the short vs. long IDRs. We analyze compositional biases across four classes of disorder: fully disordered proteins; short IDRs; long IDRs; and binding IDRs. We identify three distinct biases: for the fully disordered proteins, the short IDRs and the long and binding IDRs combined. We also investigate compositional bias for putative disorder produced by leading disorder predictors and find that it is similar to the bias of the native disorder. Interestingly, the accuracy of disorder predictions across different methods is correlated with the correctness of the compositional bias of their predictions highlighting the importance of the compositional bias. The predictive quality is relatively low for the disorder classes with compositional bias that is the most different from the “generic” disorder bias, while being much higher for the classes with the most similar bias. We discover that different predictors perform best across different classes of disorder. This suggests that no single predictor is universally best and motivates the development of new architectures that combine models that target specific disorder classes.more » « less
-
Disordered binding regions (DBRs), which are embedded within intrinsically disordered proteins or regions (IDPs or IDRs), enable IDPs or IDRs to mediate multiple protein-protein interactions. DBR-protein complexes were collected from the Protein Data Bank for which two or more DBRs having different amino acid sequences bind to the same (100% sequence identical) globular protein partner, a type of interaction herein called many-to-one binding. Two distinct binding profiles were identified: independent and overlapping. For the overlapping binding profiles, the distinct DBRs interact by means of almost identical binding sites (herein called “similar”), or the binding sites contain both common and divergent interaction residues (herein called “intersecting”). Further analysis of the sequence and structural differences among these three groups indicate how IDP flexibility allows different segments to adjust to similar, intersecting, and independent binding pockets.more » « less
An official website of the United States government

