skip to main content


Title: Improved prediction of DNA and RNA binding proteins with deep learning models
Abstract

Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play important roles in essential biological processes. To facilitate functional annotation and accurate prediction of different types of NABPs, many machine learning-based computational approaches have been developed. However, the datasets used for training and testing as well as the prediction scopes in these studies have limited their applications. In this paper, we developed new strategies to overcome these limitations by generating more accurate and robust datasets and developing deep learning-based methods including both hierarchical and multi-class approaches to predict the types of NABPs for any given protein. The deep learning models employ two layers of convolutional neural network and one layer of long short-term memory. Our approaches outperform existing DBP and RBP predictors with a balanced prediction between DBPs and RBPs, and are more practically useful in identifying novel NABPs. The multi-class approach greatly improves the prediction accuracy of DBPs and RBPs, especially for the DBPs with ~12% improvement. Moreover, we explored the prediction accuracy of single-stranded DNA binding proteins and their effect on the overall prediction accuracy of NABP predictions.

 
more » « less
PAR ID:
10513546
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Briefings in Bioinformatics
Volume:
25
Issue:
4
ISSN:
1467-5463
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    RNAs transmit information from DNA to encode proteins that perform all cellular processes and regulate gene expression in multiple ways. From the time of synthesis to degradation, RNA molecules are associated with proteins called RNA-binding proteins (RBPs). The RBPs play diverse roles in many aspects of gene expression including pre-mRNA processing and post-transcriptional and translational regulation. In the last decade, the application of modern techniques to identify RNA–protein interactions with individual proteins, RNAs, and the whole transcriptome has led to the discovery of a hidden landscape of these interactions in plants. Global approaches such as RNA interactome capture (RIC) to identify proteins that bind protein-coding transcripts have led to the identification of close to 2000 putative RBPs in plants. Interestingly, many of these were found to be metabolic enzymes with no known canonical RNA-binding domains. Here, we review the methods used to analyze RNA–protein interactions in plants thus far and highlight the understanding of plant RNA–protein interactions these techniques have provided us. We also review some recent protein-centric, RNA-centric, and global approaches developed with non-plant systems and discuss their potential application to plants. We also provide an overview of results from classical studies of RNA–protein interaction in plants and discuss the significance of the increasingly evident ubiquity of RNA–protein interactions for the study of gene regulation and RNA biology in plants. 
    more » « less
  2. Abstract Motivation

    Human immunodeficiency virus type 1 (HIV-1) genome integration is closely related to clinical latency and viral rebound. In addition to human DNA sequences that directly interact with the integration machinery, the selection of HIV integration sites has also been shown to depend on the heterogeneous genomic context around a large region, which greatly hinders the prediction and mechanistic studies of HIV integration.

    Results

    We have developed an attention-based deep learning framework, named DeepHINT, to simultaneously provide accurate prediction of HIV integration sites and mechanistic explanations of the detected sites. Extensive tests on a high-density HIV integration site dataset showed that DeepHINT can outperform conventional modeling strategies by automatically learning the genomic context of HIV integration from primary DNA sequence alone or together with epigenetic information. Systematic analyses on diverse known factors of HIV integration further validated the biological relevance of the prediction results. More importantly, in-depth analyses of the attention values output by DeepHINT revealed intriguing mechanistic implications in the selection of HIV integration sites, including potential roles of several DNA-binding proteins. These results established DeepHINT as an effective and explainable deep learning framework for the prediction and mechanistic study of HIV integration.

    Availability and implementation

    DeepHINT is available as an open-source software and can be downloaded from https://github.com/nonnerdling/DeepHINT.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Abstract

    Protein language models (pLMs) trained on a large corpus of protein sequences have shown unprecedented scalability and broad generalizability in a wide range of predictive modeling tasks, but their power has not yet been harnessed for predicting protein–nucleic acid binding sites, critical for characterizing the interactions between proteins and nucleic acids. Here, we present EquiPNAS, a new pLM-informed E(3) equivariant deep graph neural network framework for improved protein–nucleic acid binding site prediction. By combining the strengths of pLM and symmetry-aware deep graph learning, EquiPNAS consistently outperforms the state-of-the-art methods for both protein–DNA and protein–RNA binding site prediction on multiple datasets across a diverse set of predictive modeling scenarios ranging from using experimental input to AlphaFold2 predictions. Our ablation study reveals that the pLM embeddings used in EquiPNAS are sufficiently powerful to dramatically reduce the dependence on the availability of evolutionary information without compromising on accuracy, and that the symmetry-aware nature of the E(3) equivariant graph-based neural architecture offers remarkable robustness and performance resilience. EquiPNAS is freely available at https://github.com/Bhattacharya-Lab/EquiPNAS.

     
    more » « less
  4. Abstract New drug production, from target identification to marketing approval, takes over 12 years and can cost around $2.6 billion. Furthermore, the COVID-19 pandemic has unveiled the urgent need for more powerful computational methods for drug discovery. Here, we review the computational approaches to predicting protein–ligand interactions in the context of drug discovery, focusing on methods using artificial intelligence (AI). We begin with a brief introduction to proteins (targets), ligands (e.g. drugs) and their interactions for nonexperts. Next, we review databases that are commonly used in the domain of protein–ligand interactions. Finally, we survey and analyze the machine learning (ML) approaches implemented to predict protein–ligand binding sites, ligand-binding affinity and binding pose (conformation) including both classical ML algorithms and recent deep learning methods. After exploring the correlation between these three aspects of protein–ligand interaction, it has been proposed that they should be studied in unison. We anticipate that our review will aid exploration and development of more accurate ML-based prediction strategies for studying protein–ligand interactions. 
    more » « less
  5. Abstract:The newer technologies such as data mining, machine learning, artificial intelligence and data analytics have revolutionized medical sector in terms of using the existing big data to predict the various patterns emerging from the datasets available inthe healthcare repositories. The predictions based on the existing datasets in the healthcare sector have rendered several benefits such as helping clinicians to make accurate and informed decisions while managing the patients’ health leading to better management of patients’ wellbeing and health-care coordination. The millions of people have been affected by the coronary artery disease (CAD). There are several machine learning including ensemble learning approach and deep neural networks-based algorithms have shown promising outcomes in improving prediction accuracy for early diagnosis of CAD. This paper analyses the deep neural network variant DRN, Rider Optimization Algorithm-Neural network (RideNN) and Deep Neural Network-Fuzzy Neural Network (DNFN) with application of ensemble learning method for improvement in the prediction accuracy of CAD. The experimental outcomes showed the proposed ensemble classifier achieved the highest accuracy compared to the other machine learning models. Keywords:Heart disease prediction, Deep Residual Network (DRN), Ensemble classifiers, coronary artery disease. 
    more » « less