Studying miRNA activity at the single-cell level presents a significant challenge due to the limitations of existing single-cell technologies in capturing miRNAs. To address this, we introduce two deep learning models: Cross-modality (CM) and single-modality (SM), both based on encoder-decoder architectures. These models predict miRNA expression at both bulk and single-cell levels using mRNA data. We evaluated the performance of CM and SM against the state-of-the-art miRSCAPE approach, using both bulk and single-cell datasets. Our results demonstrate that both CM and SM outperform miRSCAPE in accuracy. Furthermore, incorporating miRNA target information substantially enhanced performance compared to models that utilized all genes. These models provide powerful tools for predicting miRNA expression from single-cell mRNA data.
more »
« less
A deep learning method to integrate extracelluar miRNA with mRNA for cancer studies
Abstract MotivationExtracellular miRNAs (exmiRs) and intracellular mRNAs both can serve as promising biomarkers and therapeutic targets for various diseases. However, exmiR expression data is often noisy, and obtaining intracellular mRNA expression data usually involves intrusive procedures. To gain valuable insights into disease mechanisms, it is thus essential to improve the quality of exmiR expression data and develop noninvasive methods for assessing intracellular mRNA expression. ResultsWe developed CrossPred, a deep-learning multi-encoder model for the cross-prediction of exmiRs and mRNAs. Utilizing contrastive learning, we created a shared embedding space to integrate exmiRs and mRNAs. This shared embedding was then used to predict intracellular mRNA expression from noisy exmiR data and to predict exmiR expression from intracellular mRNA data. We evaluated CrossPred on three types of cancers and assessed its effectiveness in predicting the expression levels of exmiRs and mRNAs. CrossPred outperformed the baseline encoder-decoder model, exmiR or mRNA-based models, and variational autoencoder models. Moreover, the integration of exmiR and mRNA data uncovered important exmiRs and mRNAs associated with cancer. Our study offers new insights into the bidirectional relationship between mRNAs and exmiRs. Availability and implementationThe datasets and tool are available at https://doi.org/10.5281/zenodo.13891508.
more »
« less
- Award ID(s):
- 2120907
- PAR ID:
- 10555215
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Bioinformatics
- Volume:
- 40
- Issue:
- 11
- ISSN:
- 1367-4811
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract BackgroundAlzheimer’s Disease (AD) is a widespread neurodegenerative disease with Mild Cognitive Impairment (MCI) acting as an interim phase between normal cognitive state and AD. The irreversible nature of AD and the difficulty in early prediction present significant challenges for patients, caregivers, and the healthcare sector. Deep learning (DL) methods such as Recurrent Neural Networks (RNN) have been utilized to analyze Electronic Health Records (EHR) to model disease progression and predict diagnosis. However, these models do not address some inherent irregularities in EHR data such as irregular time intervals between clinical visits. Furthermore, most DL models are not interpretable. To address these issues, we developed a novel DL architecture called Time‐Aware RNN (TA‐RNN) to predict MCI to AD conversion at the next clinical visit. MethodTA‐RNN comprises of a time embedding layer, attention‐based RNN, and prediction layer based on multi‐layer perceptron (MLP) (Figure 1). For interpretability, a dual‐level attention mechanism within the RNN identifies significant visits and features impacting predictions. TA‐RNN addresses irregular time intervals by incorporating time embedding into longitudinal cognitive and neuroimaging data based on attention weights to create a patient embedding. The MLP, trained on demographic data and the patient embedding predicts AD conversion. TA‐RNN was evaluated on Alzheimer’s Disease Neuroimaging Initiative (ADNI) and National Alzheimer’s Coordinating Center (NACC) datasets based on F2 score and sensitivity. ResultMultiple TA‐RNN models were trained with two, three, five, or six visits to predict the diagnosis at the next visit. In one setup, the models were trained and tested on ADNI. In another setup, the models were trained on the entire ADNI dataset and evaluated on the entire NACC dataset. The results indicated superior performance of TA‐RNN compared to state‐of‐the‐art (SOTA) and baseline approaches for both setups (Figure 2A and 2B). Based on attention weights, we also highlighted significant visits (Figure 3A) and features (Figure 3B) and observed that CDRSB and FAQ features and the most recent visit had highest influence in predictions. ConclusionWe propose TA‐RNN, an interpretable model to predict MCI to AD conversion while handling irregular time intervals. TA‐RNN outperformed SOTA and baseline methods in multiple experiments.more » « less
-
Abstract MotivationRecent advancements in natural language processing have highlighted the effectiveness of global contextualized representations from protein language models (pLMs) in numerous downstream tasks. Nonetheless, strategies to encode the site-of-interest leveraging pLMs for per-residue prediction tasks, such as crotonylation (Kcr) prediction, remain largely uncharted. ResultsHerein, we adopt a range of approaches for utilizing pLMs by experimenting with different input sequence types (full-length protein sequence versus window sequence), assessing the implications of utilizing per-residue embedding of the site-of-interest as well as embeddings of window residues centered around it. Building upon these insights, we developed a novel residual ConvBiLSTM network designed to process window-level embeddings of the site-of-interest generated by the ProtT5-XL-UniRef50 pLM using full-length sequences as input. This model, termed T5ResConvBiLSTM, surpasses existing state-of-the-art Kcr predictors in performance across three diverse datasets. To validate our approach of utilizing full sequence-based window-level embeddings, we also delved into the interpretability of ProtT5-derived embedding tensors in two ways: firstly, by scrutinizing the attention weights obtained from the transformer’s encoder block; and secondly, by computing SHAP values for these tensors, providing a model-agnostic interpretation of the prediction results. Additionally, we enhance the latent representation of ProtT5 by incorporating two additional local representations, one derived from amino acid properties and the other from supervised embedding layer, through an intermediate fusion stacked generalization approach, using an n-mer window sequence (or, peptide/fragment). The resultant stacked model, dubbed LMCrot, exhibits a more pronounced improvement in predictive performance across the tested datasets. Availability and implementationLMCrot is publicly available at https://github.com/KCLabMTU/LMCrot.more » « less
-
Abstract MotivationAlternative polyadenylation (polyA) sites near the 3′ end of a pre-mRNA create multiple mRNA transcripts with different 3′ untranslated regions (3′ UTRs). The sequence elements of a 3′ UTR are essential for many biological activities such as mRNA stability, sub-cellular localization, protein translation, protein binding and translation efficiency. Moreover, numerous studies in the literature have reported the correlation between diseases and the shortening (or lengthening) of 3′ UTRs. As alternative polyA sites are common in mammalian genes, several machine learning tools have been published for predicting polyA sites from sequence data. These tools either consider limited sequence features or use relatively old algorithms for polyA site prediction. Moreover, none of the previous tools consider RNA secondary structures as a feature to predict polyA sites. ResultsIn this paper, we propose a new deep learning model, called DeepPASTA, for predicting polyA sites from both sequence and RNA secondary structure data. The model is then extended to predict tissue-specific polyA sites. Moreover, the tool can predict the most dominant (i.e. frequently used) polyA site of a gene in a specific tissue and relative dominance when two polyA sites of the same gene are given. Our extensive experiments demonstrate that DeepPASTA signisficantly outperforms the existing tools for polyA site prediction and tissue-specific relative and absolute dominant polyA site prediction. Availability and implementationhttps://github.com/arefeen/DeepPASTA Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Abstract Bacteria use a multi-layered regulatory strategy to precisely and rapidly tune gene expression in response to environmental cues. Small RNAs (sRNAs) form an important layer of gene expression control and most act post-transcriptionally to control translation and stability of mRNAs. We have shown that at least five different sRNAs inEscherichia coliregulate the cyclopropane fatty acid synthase (cfa) mRNA. These sRNAs bind at different sites in the long 5’ untranslated region (UTR) ofcfamRNA and previous work suggested that they modulate RNase E-dependent mRNA turnover. Recently, thecfa5’ UTR was identified as a site of Rho-dependent transcription termination, leading us to hypothesize that the sRNAs might also regulatecfatranscription elongation. In this study we find that a pyrimidine-rich region flanked by sRNA binding sites in thecfa5’ UTR is required for premature Rho-dependent termination. We discovered that both the activating sRNA RydC and repressing sRNA CpxQ regulatecfaprimarily by modulating Rho-dependent termination ofcfatranscription, with only a minor effect on RNase E-mediated turnover ofcfamRNA. A stem-loop structure in thecfa5’ UTR sequesters the pyrimidine-rich region required for Rho-dependent termination. CpxQ binding to the 5’ portion of the stem increases Rho-dependent termination whereas RydC binding downstream of the stem decreases termination. These results reveal the versatile mechanisms sRNAs use to regulate target gene expression at transcriptional and post-transcriptional levels and demonstrate that regulation by sRNAs in long UTRs can involve modulation of transcription elongation. ImportanceBacteria respond to stress by rapidly regulating gene expression. Regulation can occur through control of messenger RNA (mRNA) production (transcription elongation), stability of mRNAs, or translation of mRNAs. Bacteria can use small RNAs (sRNAs) to regulate gene expression at each of these steps, but we often do not understand how this works at a molecular level. In this study, we find that sRNAs inEscherichia coliregulate gene expression at the level of transcription elongation by promoting or inhibiting transcription termination by a protein called Rho. These results help us understand new molecular mechanisms of gene expression regulation in bacteria.more » « less
An official website of the United States government
