skip to main content


Title: Partner‐specific prediction of RNA‐binding residues in proteins: A critical assessment
Abstract

RNA‐protein interactions play essential roles in regulating gene expression. While some RNA‐protein interactions are “specific”, that is, the RNA‐binding proteins preferentially bind to particular RNA sequence or structural motifs, others are “non‐RNA specific.” Deciphering the protein‐RNA recognition code is essential for comprehending the functional implications of these interactions and for developing new therapies for many diseases. Because of the high cost of experimental determination of protein‐RNA interfaces, there is a need for computational methods to identify RNA‐binding residues in proteins. While most of the existing computational methods for predicting RNA‐binding residues in RNA‐binding proteins are oblivious to the characteristics of the partner RNA, there is growing interest in methods for partner‐specific prediction of RNA binding sites in proteins. In this work, we assess the performance of two recently published partner‐specific protein‐RNA interface prediction tools, PS‐PRIP, and PRIdictor, along with our own new tools. Specifically, we introduce a novel metric, RNA‐specificity metric (RSM), for quantifying the RNA‐specificity of the RNA binding residues predicted by such tools. Our results show that the RNA‐binding residues predicted by previously published methods are oblivious to the characteristics of the putative RNA binding partner. Moreover, when evaluated using partner‐agnostic metrics, RNA partner‐specific methods are outperformed by the state‐of‐the‐art partner‐agnostic methods. We conjecture that either (a) the protein‐RNA complexes in PDB are not representative of the protein‐RNA interactions in nature, or (b) the current methods for partner‐specific prediction of RNA‐binding residues in proteins fail to account for the differences in RNA partner‐specific versus partner‐agnostic protein‐RNA interactions, or both.

 
more » « less
Award ID(s):
1640834
NSF-PAR ID:
10461980
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Proteins: Structure, Function, and Bioinformatics
Volume:
87
Issue:
3
ISSN:
0887-3585
Page Range / eLocation ID:
p. 198-211
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Binding sites in proteins can be either specifically functional binding sites (active sites) that bind specific substrates with high affinity or regulatory binding sites (allosteric sites), that modulate the activity of functional binding sites through effector molecules. Owing to their significance in determining protein function, the identification of protein functional and regulatory binding sites is widely acknowledged as an important biological problem. In this work, we present a novel binding site prediction method, Active and Regulatory site Prediction (AR‐Pred), which supplements protein geometry, evolutionary, and physicochemical features with information about protein dynamics to predict putative active and allosteric site residues. As the intrinsic dynamics of globular proteins plays an essential role in controlling binding events, we find it to be an important feature for the identification of protein binding sites. We train and validate our predictive models on multiple balanced training and validation sets with random forest machine learning and obtain an ensemble of discrete models for each prediction type. Our models for active site prediction yield a median area under the curve (AUC) of 91% and Matthews correlation coefficient (MCC) of 0.68, whereas the less well‐defined allosteric sites are predicted at a lower level with a median AUC of 80% and MCC of 0.48. When tested on an independent set of proteins, our models for active site prediction show comparable performance to two existing methods and gains compared to two others, while the allosteric site models show gains when tested against three existing prediction methods. AR‐Pred is available as a free downloadable package athttps://github.com/sambitmishra0628/AR-PRED_source.

     
    more » « less
  2. Abstract

    The barley MLA nucleotide-binding leucine-rich-repeat (NLR) receptor and its orthologs confer recognition specificity to many fungal diseases, including powdery mildew, stem-, and stripe rust. We used interolog inference to construct a barley protein interactome (Hordeum vulgare predicted interactome, HvInt) comprising 66,133 edges and 7,181 nodes, as a foundation to explore signaling networks associated with MLA. HvInt was compared with the experimentally validated Arabidopsis interactome of 11,253 proteins and 73,960 interactions, verifying that the 2 networks share scale-free properties, including a power-law distribution and small-world network. Then, by successive layering of defense-specific “omics” datasets, HvInt was customized to model cellular response to powdery mildew infection. Integration of HvInt with expression quantitative trait loci (eQTL) enabled us to infer disease modules and responses associated with fungal penetration and haustorial development. Next, using HvInt and infection–time–course RNA sequencing of immune signaling mutants, we assembled resistant and susceptible subnetworks. The resulting differentially coexpressed (resistant – susceptible) interactome is essential to barley immunity, facilitates the flow of signaling pathways and is linked to mildew resistance locus a (Mla) through trans eQTL associations. Lastly, we anchored HvInt with new and previously identified interactors of the MLA coiled coli + nucleotide-binding domains and extended these to additional MLA alleles, orthologs, and NLR outgroups to predict receptor localization and conservation of signaling response. These results link genomic, transcriptomic, and physical interactions during MLA-specified immunity.

     
    more » « less
  3. Musier-Forsyth, Karin (Ed.)
    RNA-binding proteins play crucial roles in various cellular functions, and contain abundant disordered protein regions. The disordered regions in RNA-binding proteins are rich in repetitive sequences, such as poly-K/R, poly-N/Q, poly-A, and poly-G residues. Our bioinformatic analysis identified a largely neglected repetitive sequence family we define as electronegative clusters (ENCs) that contain acidic residues and/or phosphorylation sites. The abundance and length of ENCs exceed other known repetitive sequences. Despite their abundance, the functions of ENCs in RNA-binding proteins are still elusive. To investigate the impacts of ENCs on protein stability, RNA-binding affinity, and specificity, we selected one RNA-binding protein, the ribosomal biogenesis factor 15 (Nop15) as a model. We found that the Nop15 ENC increases protein stability and inhibits nonspecific RNA binding, but minimally interferes with specific RNA binding. To investigate the effect of ENCs on sequence specificity of RNA binding, we grafted an ENC to another RNA-binding protein, Ser/Arg-rich splicing factor 3 (SRSF3). Using RNA Bind-n-Seq, we found that the engineered ENC inhibits disparate RNA motifs differently, instead of weakening all RNA motifs to the same extent. The motif site directly involved in electrostatic interaction is more susceptible to the ENC inhibition. These results suggest that one of functions of ENCs is to regulate RNA binding via electrostatic interaction. This is consistent with our finding that ENCs are also overrepresented in DNA-binding proteins, while underrepresented in halophiles, in which nonspecific nucleic acid binding is inhibited by high concentrations of salts. 
    more » « less
  4. Abstract

    ProteinS-sulfenylation, which results from oxidation of free thiols on cysteine residues, has recently emerged as an important post-translational modification that regulates the structure and function of proteins involved in a variety of physiological and pathological processes. By altering the size and physiochemical properties of modified cysteine residues, sulfenylation can impact the cellular function of proteins in several different ways. Thus, the ability to rapidly and accurately identify putative sulfenylation sites in proteins will provide important insights into redox-dependent regulation of protein function in a variety of cellular contexts. Though bottom-up proteomic approaches, such as tandem mass spectrometry (MS/MS), provide a wealth of information about global changes in the sulfenylation state of proteins, MS/MS-based experiments are often labor-intensive, costly and technically challenging. Therefore, to complement existing proteomic approaches, researchers have developed a series of computational tools to identify putative sulfenylation sites on proteins. However, existing methods often suffer from low accuracy, specificity, and/or sensitivity. In this study, we developed SVM-SulfoSite, a novel sulfenylation prediction tool that uses support vector machines (SVM) to identify key determinants of sulfenylation among five feature classes: binary code, physiochemical properties, k-space amino acid pairs, amino acid composition and high-quality physiochemical indices. Using 10-fold cross-validation, SVM-SulfoSite achieved 95% sensitivity and 83% specificity, with an overall accuracy of 89% and Matthew’s correlation coefficient (MCC) of 0.79. Likewise, using an independent test set of experimentally identified sulfenylation sites, our method achieved scores of 74%, 62%, 80% and 0.42 for accuracy, sensitivity, specificity and MCC, with an area under the receiver operator characteristic (ROC) curve of 0.81. Moreover, in side-by-side comparisons, SVM-SulfoSite performed as well as or better than existing sulfenylation prediction tools. Together, these results suggest that our method represents a robust and complementary technique for advanced exploration of protein S-sulfenylation.

     
    more » « less
  5. Abstract Motivation

    Alternative polyadenylation (polyA) sites near the 3′ end of a pre-mRNA create multiple mRNA transcripts with different 3′ untranslated regions (3′ UTRs). The sequence elements of a 3′ UTR are essential for many biological activities such as mRNA stability, sub-cellular localization, protein translation, protein binding and translation efficiency. Moreover, numerous studies in the literature have reported the correlation between diseases and the shortening (or lengthening) of 3′ UTRs. As alternative polyA sites are common in mammalian genes, several machine learning tools have been published for predicting polyA sites from sequence data. These tools either consider limited sequence features or use relatively old algorithms for polyA site prediction. Moreover, none of the previous tools consider RNA secondary structures as a feature to predict polyA sites.

    Results

    In this paper, we propose a new deep learning model, called DeepPASTA, for predicting polyA sites from both sequence and RNA secondary structure data. The model is then extended to predict tissue-specific polyA sites. Moreover, the tool can predict the most dominant (i.e. frequently used) polyA site of a gene in a specific tissue and relative dominance when two polyA sites of the same gene are given. Our extensive experiments demonstrate that DeepPASTA signisficantly outperforms the existing tools for polyA site prediction and tissue-specific relative and absolute dominant polyA site prediction.

    Availability and implementation

    https://github.com/arefeen/DeepPASTA

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less