skip to main content

Title: EnsembleSplice: ensemble deep learning model for splice site prediction
Abstract Background

Identifying splice site regions is an important step in the genomic DNA sequencing pipelines of biomedical and pharmaceutical research. Within this research purview, efficient and accurate splice site detection is highly desirable, and a variety of computational models have been developed toward this end. Neural network architectures have recently been shown to outperform classical machine learning approaches for the task of splice site prediction. Despite these advances, there is still considerable potential for improvement, especially regarding model prediction accuracy, and error rate.


Given these deficits, we propose EnsembleSplice, an ensemble learning architecture made up of four (4) distinct convolutional neural networks (CNN) model architecture combination that outperform existing splice site detection methods in the experimental evaluation metrics considered including the accuracies and error rates. We trained and tested a variety of ensembles made up of CNNs and DNNs using the five-fold cross-validation method to identify the model that performed the best across the evaluation and diversity metrics. As a result, we developed our diverse and highly effective splice site (SS) detection model, which we evaluated using two (2) genomicHomo sapiensdatasets and theArabidopsis thalianadataset. The results showed that for of theHomo sapiensEnsembleSplice achieved accuracies of 94.16% for one of the acceptor splice sites and 95.97% for donor splice sites, with an error rate for the sameHomo sapiensdataset, 4.03% for the donor splice sites and 5.84% for theacceptor splice sites datasets.


Our five-fold cross validation ensured the prediction accuracy of our models are consistent. For reproducibility, all the datasets used, models generated, and results in our work are publicly available in our GitHub repository here:

more » « less
Author(s) / Creator(s):
; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
BMC Bioinformatics
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Binding sites in proteins can be either specifically functional binding sites (active sites) that bind specific substrates with high affinity or regulatory binding sites (allosteric sites), that modulate the activity of functional binding sites through effector molecules. Owing to their significance in determining protein function, the identification of protein functional and regulatory binding sites is widely acknowledged as an important biological problem. In this work, we present a novel binding site prediction method, Active and Regulatory site Prediction (AR‐Pred), which supplements protein geometry, evolutionary, and physicochemical features with information about protein dynamics to predict putative active and allosteric site residues. As the intrinsic dynamics of globular proteins plays an essential role in controlling binding events, we find it to be an important feature for the identification of protein binding sites. We train and validate our predictive models on multiple balanced training and validation sets with random forest machine learning and obtain an ensemble of discrete models for each prediction type. Our models for active site prediction yield a median area under the curve (AUC) of 91% and Matthews correlation coefficient (MCC) of 0.68, whereas the less well‐defined allosteric sites are predicted at a lower level with a median AUC of 80% and MCC of 0.48. When tested on an independent set of proteins, our models for active site prediction show comparable performance to two existing methods and gains compared to two others, while the allosteric site models show gains when tested against three existing prediction methods. AR‐Pred is available as a free downloadable package at

    more » « less
  2. Abstract Background

    Protein S-nitrosylation (SNO) plays a key role in transferring nitric oxide-mediated signals in both animals and plants and has emerged as an important mechanism for regulating protein functions and cell signaling of all main classes of protein. It is involved in several biological processes including immune response, protein stability, transcription regulation, post translational regulation, DNA damage repair, redox regulation, and is an emerging paradigm of redox signaling for protection against oxidative stress. The development of robust computational tools to predict protein SNO sites would contribute to further interpretation of the pathological and physiological mechanisms of SNO.


    Using an intermediate fusion-based stacked generalization approach, we integrated embeddings from supervised embedding layer and contextualized protein language model (ProtT5) and developed a tool called pLMSNOSite (protein language model-based SNO site predictor). On an independent test set of experimentally identified SNO sites, pLMSNOSite achieved values of 0.340, 0.735 and 0.773 for MCC, sensitivity and specificity respectively. These results show that pLMSNOSite performs better than the compared approaches for the prediction of S-nitrosylation sites.


    Together, the experimental results suggest that pLMSNOSite achieves significant improvement in the prediction performance of S-nitrosylation sites and represents a robust computational approach for predicting protein S-nitrosylation sites. pLMSNOSite could be a useful resource for further elucidation of SNO and is publicly available at

    more » « less
  3. Abstract

    Immunotherapies have shown promising results in treating patients with hematological malignancies like multiple myeloma, which is an incurable but treatable bone marrow-resident plasma cell cancer. Choosing the most efficacious treatment for a patient remains a challenge in such cancers. However, pre-clinical assays involving patient-derived tumor cells co-cultured in anex vivoreconstruction of immune-tumor micro-environment have gained considerable notoriety over the past decade. Such assays can characterize a patient’s response to several therapeutic agents including immunotherapies in a high-throughput manner, where bright-field images of tumor (target) cells interacting with effector cells (T cells, Natural Killer (NK) cells, and macrophages) are captured once every 30 minutes for upto six days. Cell detection, tracking, and classification of thousands of cells of two or more types in each frame is bound to test the limits of some of the most advanced computer vision tools developed to date and requires a specialized approach. We propose TLCellClassifier (time-lapse cell classifier) for live cell detection, cell tracking, and cell type classification, with enhanced accuracy and efficiency obtained by integrating convolutional neural networks (CNN), metric learning, and long short-term memory (LSTM) networks, respectively. State-of-the-art computer vision software like KTH-SE and YOLOv8 are compared with TLCellClassifier, which shows improved accuracy in detection (CNN) and tracking (metric learning). A two-stage LSTM-based cell type classification method is implemented to distinguish between multiple myeloma (tumor/target) cells and macrophages/monocytes (immune/effector cells). Validation of cell type classification was done both using synthetic datasets andex vivoexperiments involving patient-derived tumor/immune cells.

    Availability and implementation classification ml

    more » « less
  4. Abstract Background

    Fusion of RNA-binding proteins (RBPs) to RNA base-editing enzymes (such as APOBEC1 or ADAR) has emerged as a powerful tool for the discovery of RBP binding sites. However, current methods that analyze sequencing data from RNA-base editing experiments are vulnerable to false positives due to off-target editing, genetic variation and sequencing errors.


    We present FLagging Areas of RNA-editing Enrichment (FLARE), a Snakemake-based pipeline that builds on the outputs of the SAILOR edit site discovery tool to identify regions statistically enriched for RNA editing. FLARE can be configured to analyze any type of RNA editing, including C to U and A to I. We applied FLARE to C-to-U editing data from a RBFOX2-APOBEC1 STAMP experiment, to show that our approach attains high specificity for detecting RBFOX2 binding sites. We also applied FLARE to detect regions of exogenously introduced as well as endogenous A-to-I editing.


    FLARE is a fast and flexible workflow that identifies significantly edited regions from RNA-seq data. The FLARE codebase is available at

    more » « less
  5. Li, Jinyan (Ed.)

    Artificial intelligence-powered protein structure prediction methods have led to a paradigm-shift in computational structural biology, yet contemporary approaches for predicting the interfacial residues (i.e., sites) of protein-protein interaction (PPI) still rely on experimental structures. Recent studies have demonstrated benefits of employing graph convolution for PPI site prediction, but ignore symmetries naturally occurring in 3-dimensional space and act only on experimental coordinates. Here we present EquiPPIS, an E(3) equivariant graph neural network approach for PPI site prediction. EquiPPIS employs symmetry-aware graph convolutions that transform equivariantly with translation, rotation, and reflection in 3D space, providing richer representations for molecular data compared to invariant convolutions. EquiPPIS substantially outperforms state-of-the-art approaches based on the same experimental input, and exhibits remarkable robustness by attaining better accuracy with predicted structural models from AlphaFold2 than what existing methods can achieve even with experimental structures. Freely available at, EquiPPIS enables accurate PPI site prediction at scale.

    more » « less