Phase separation processes facilitate the formation of membrane-less organelles and involve interactions within structured domains and intrinsically disordered regions (IDRs) in protein sequences. The literature suggests that the involvement of proteins in phase separation can be predicted from their sequences, leading to the development of over 30 computational predictors. We focused on intrinsic disorder due to its fundamental role in related diseases, and because recent analysis has shown that phase separation can be accurately predicted for structured proteins. We evaluated eight representative amino acid-level predictors of phase separation, capable of identifying phase-separating IDRs, using a well-annotated, low-similarity test dataset under two complementary evaluation scenarios. Several methods generate accurate predictions in the easier scenario that includes both structured and disordered sequences. However, we demonstrate that modern disorder predictors perform equally well in this scenario by effectively differentiating phase-separating IDRs from structured regions. In the second, more challenging scenario—considering only predictions in disordered regions—disorder predictors underperform, and most phase separation predictors produce only modestly accurate results. Moreover, some predictors are broadly biased to classify disordered residues as phase-separating, which results in low predictive performance in this scenario. Finally, we recommend PSPHunter as the most accurate tool for identifying phase-separating IDRs in both scenarios.
more »
« less
Assessment of Disordered Linker Predictions in the CAID2 Experiment
Disordered linkers (DLs) are intrinsically disordered regions that facilitate movement between adjacent functional regions/domains, contributing to many key cellular functions. The recently completed second Critical Assessments of protein Intrinsic Disorder prediction (CAID2) experiment evaluated DL predictions by considering a rather narrow scenario when predicting 40 proteins that are already known to have DLs. We expand this evaluation by using a much larger set of nearly 350 test proteins from CAID2 and by investigating three distinct scenarios: (1) prediction residues in DLs vs. in non-DL regions (typical use of DL predictors); (2) prediction of residues in DLs vs. other disordered residues (to evaluate whether predictors can differentiate residues in DLs from other types of intrinsically disordered residues); and (3) prediction of proteins harboring DLs. We find that several methods provide relatively accurate predictions of DLs in the first scenario. However, only one method, APOD, accurately identifies DLs among other types of disordered residues (scenario 2) and predicts proteins harboring DLs (scenario 3). We also find that APOD’s predictive performance is modest, motivating further research into the development of new and more accurate DL predictors. We note that these efforts will benefit from a growing amount of training data and the availability of sophisticated deep network models and emphasize that future methods should provide accurate results across the three scenarios.
more »
« less
- PAR ID:
- 10536295
- Publisher / Repository:
- MDPI
- Date Published:
- Journal Name:
- Biomolecules
- Volume:
- 14
- Issue:
- 3
- ISSN:
- 2218-273X
- Page Range / eLocation ID:
- 287
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Dozens of impactful methods that predict intrinsically disordered regions (IDRs) in protein sequences that interact with proteins and/or nucleic acids were developed. Their training and assessment rely on the IDR‐level binding annotations, while the equivalent structure‐trained methods predict more granular annotations of binding amino acids (AA). We compiled a new benchmark dataset that annotates binding AA in IDRs and applied it to complete a first‐of‐its‐kind assessment of predictions of the disordered binding residues. We evaluated a representative collection of 14 methods, used several hundred low‐similarity test proteins, and focused on the challenging task of differentiating these binding residues from other disordered AA and considering ligand type‐specific predictions (protein–protein vs. protein–nucleic acid interactions). We found that current methods struggle to accurately predict binding IDRs among disordered residues; however, better‐than‐random tools predict disordered binding residues significantly better than binding IDRs. We identified at least one relatively accurate tool for predicting disordered protein‐binding and disordered nucleic acid‐binding AA. Analysis of cross‐predictions between interactions with protein and nucleic acids revealed that most methods are ligand‐type‐agnostic. Only two predictors of the nucleic acid‐binding IDRs and two predictors of the protein‐binding IDRs can be considered as ligand‐type‐specific. We also discussed several potential future directions that would move this field forward by producing more accurate methods that target the prediction of binding residues, reduce cross‐predictions, and cover a broader range of ligand types.more » « less
-
Abstract Current predictors of DNA-binding residues (DBRs) from protein sequences belong to two distinct groups, those trained on binding annotations extracted from structured protein-DNA complexes (structure-trained) vs. intrinsically disordered proteins (disorder-trained). We complete the first empirical analysis of predictive performance across the structure- and disorder-annotated proteins for a representative collection of ten predictors. Majority of the structure-trained tools perform well on the structure-annotated proteins while doing relatively poorly on the disorder-annotated proteins, and vice versa. Several methods make accurate predictions for the structure-annotated proteins or the disorder-annotated proteins, but none performs highly accurately for both annotation types. Moreover, most predictors make excessive cross-predictions for the disorder-annotated proteins, where residues that interact with non-DNA ligand types are predicted as DBRs. Motivated by these results, we design, validate and deploy an innovative meta-model, hybridDBRpred, that uses deep transformer network to combine predictions generated by three best current predictors. HybridDBRpred provides accurate predictions and low levels of cross-predictions across the two annotation types, and is statistically more accurate than each of the ten tools and baseline meta-predictors that rely on averaging and logistic regression. We deploy hybridDBRpred as a convenient web server at http://biomine.cs.vcu.edu/servers/hybridDBRpred/ and provide the corresponding source code at https://github.com/jianzhang-xynu/hybridDBRpred.more » « less
-
Computational prediction of DNA-binding residues (DBRs) and the RNA-binding residues (RBRs) in protein sequences is an active area of research, with about 90 predictors and 20 that were published over the last two years. The new predictors rely on sophisticated deep neural networks and protein language models, produce accurate predictions, and are conveniently available as code and/or web servers. However, we identified shortage of tools that predict these interactions in intrinsically disordered regions and tools capable of predicting residues that interact with specific RNA and DNA types. Moreover, cross-predictions between RBRs and DBRs should be quantified and minimized to ensure that future tools accurately differentiate between these two distinct types of nucleic acids.more » « less
-
null (Ed.)Abstract Intrinsically disordered proteins, defying the traditional protein structure–function paradigm, are a challenge to study experimentally. Because a large part of our knowledge rests on computational predictions, it is crucial that their accuracy is high. The Critical Assessment of protein Intrinsic Disorder prediction (CAID) experiment was established as a community-based blind test to determine the state of the art in prediction of intrinsically disordered regions and the subset of residues involved in binding. A total of 43 methods were evaluated on a dataset of 646 proteins from DisProt. The best methods use deep learning techniques and notably outperform physicochemical methods. The top disorder predictor has F max = 0.483 on the full dataset and F max = 0.792 following filtering out of bona fide structured regions. Disordered binding regions remain hard to predict, with F max = 0.231. Interestingly, computing times among methods can vary by up to four orders of magnitude.more » « less
An official website of the United States government

