Abstract MotivationAlternative splicing generates multiple isoforms from a single gene, greatly increasing the functional diversity of a genome. Although gene functions have been well studied, little is known about the specific functions of isoforms, making accurate prediction of isoform functions highly desirable. However, the existing approaches to predicting isoform functions are far from satisfactory due to at least two reasons: (i) unlike genes, isoform-level functional annotations are scarce. (ii) The information of isoform functions is concealed in various types of data including isoform sequences, co-expression relationship among isoforms, etc. ResultsIn this study, we present a novel approach, DIFFUSE (Deep learning-based prediction of IsoForm FUnctions from Sequences and Expression), to predict isoform functions. To integrate various types of data, our approach adopts a hybrid framework by first using a deep neural network (DNN) to predict the functions of isoforms from their genomic sequences and then refining the prediction using a conditional random field (CRF) based on co-expression relationship. To overcome the lack of isoform-level ground truth labels, we further propose an iterative semi-supervised learning algorithm to train both the DNN and CRF together. Our extensive computational experiments demonstrate that DIFFUSE could effectively predict the functions of isoforms and genes. It achieves an average area under the receiver operating characteristics curve of 0.840 and area under the precision–recall curve of 0.581 over 4184 GO functional categories, which are significantly higher than the state-of-the-art methods. We further validate the prediction results by analyzing the correlation between functional similarity, sequence similarity, expression similarity and structural similarity, as well as the consistency between the predicted functions and some well-studied functional features of isoform sequences. Availability and implementationhttps://github.com/haochenucr/DIFFUSE. Supplementary informationSupplementary data are available at Bioinformatics online.
more »
« less
DeepIsoFun: a deep domain adaptation approach to predict isoform functions
Abstract Motivation Isoforms are mRNAs produced from the same gene locus by alternative splicing and may have different functions. Although gene functions have been studied extensively, little is known about the specific functions of isoforms. Recently, some computational approaches based on multiple instance learning have been proposed to predict isoform functions from annotated gene functions and expression data, but their performance is far from being desirable primarily due to the lack of labeled training data. To improve the performance on this problem, we propose a novel deep learning method, DeepIsoFun, that combines multiple instance learning with domain adaptation. The latter technique helps to transfer the knowledge of gene functions to the prediction of isoform functions and provides additional labeled training data. Our model is trained on a deep neural network architecture so that it can adapt to different expression distributions associated with different gene ontology terms. Results We evaluated the performance of DeepIsoFun on three expression datasets of human and mouse collected from SRA studies at different times. On each dataset, DeepIsoFun performed significantly better than the existing methods. In terms of area under the receiver operating characteristics curve, our method acquired at least 26% improvement and in terms of area under the precision-recall curve, it acquired at least 10% improvement over the state-of-the-art methods. In addition, we also study the divergence of the functions predicted by our method for isoforms from the same gene and the overall correlation between expression similarity and the similarity of predicted functions. Availability and implementation https://github.com/dls03/DeepIsoFun/ Supplementary information Supplementary data are available at Bioinformatics online.
more »
« less
- Award ID(s):
- 1646333
- PAR ID:
- 10112684
- Date Published:
- Journal Name:
- Bioinformatics
- Volume:
- 35
- Issue:
- 15
- ISSN:
- 1367-4803
- Page Range / eLocation ID:
- 2535 to 2544
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Background: Cell type specialization is a hallmark of complex multicellular organisms and is usually established through implementation of cell-type-specific gene expression programs. The multicellular green alga Volvox carteri has just two cell types, germ and soma, that have previously been shown to have very different transcriptome com- positions which match their specialized roles. Here we interrogated another potential mechanism for differentiation in V. carteri, cell type specific alternative transcript isoforms (CTSAI). Methods: We used pre-existing predictions of alternative transcripts and de novo transcript assembly with HISAT2 and Ballgown software to compile a list of loci with two or more transcript isoforms, identified a small subset that were candidates for CTSAI, and manually curated this subset of genes to remove false positives. We experimentally verified three candidates using semi-quantitative RT-PCR to assess relative isoform abundance in each cell type. Results: Of the 1978 loci with two or more predicted transcript isoforms 67 of these also showed cell type isoform expression biases. After curation 15 strong candidates for CTSAI were identified, three of which were experimen- tally verified, and their predicted gene product functions were evaluated in light of potential cell type specific roles. A comparison of genes with predicted alternative splicing from Chlamydomonas reinhardtii, a unicellular relative of V. carteri, identified little overlap between ortholog pairs with alternative splicing in both species. Finally, we inter- rogated cell type expression patterns of 126 V. carteri predicted RBP encoding genes and found 40 that showed either somatic or germ cell expression bias. These RBPs are potential mediators of CTSAI in V. carteri and suggest possible pre-adaptation for cell type specific RNA processing and a potential path for generating CTSAI in the early ancestors of metazoans and plants. Conclusions: We predicted numerous instances of alternative transcript isoforms in Volvox, only a small subset of which showed cell type specific isoform expression bias. However, the validated examples of CTSAI supported existing hypotheses about cell type specialization in V. carteri, and also suggested new hypotheses about mecha- nisms of functional specialization for their gene products. Our data imply that CTSAI operates as a minor but impor- tant component of V. carteri cellular differentiation and could be used as a model for how alternative isoforms emerge and co-evolve with cell type specialization.more » « less
-
Alternative splicing extends the coding potential of genomes by creating multiple isoforms from one gene. Isoforms can render transcript specificity and diversity to initiate multiple responses required during transcriptome adjustments in stressed environments. Although the prevalence of alternative splicing is widely recognized, how diverse isoforms facilitate stress adaptation in plants that thrive in extreme environments are unexplored. Here we examine how an extremophyte model, Schrenkiella parvula, coordinates alternative splicing in response to high salinity compared to a salt-stress sensitive model, Arabidopsis thaliana. We use Iso-Seq to generate full length reference transcripts and RNA-seq to quantify differential isoform usage in response to salinity changes. We find that single-copy orthologs where S. parvula has a higher number of isoforms than A. thaliana as well as S. parvula genes observed and predicted using machine learning to have multiple isoforms are enriched in stress associated functions. Genes that showed differential isoform usage were largely mutually exclusive from genes that were differentially expressed in response to salt. S. parvula transcriptomes maintained specificity in isoform usage assessed via a measure of expression disorderdness during transcriptome reprogramming under salt. Our study adds a novel resource and insight to study plant stress tolerance evolved in extreme environments.more » « less
-
Abstract Single-cell RNA sequencing is a powerful technique that continues to expand across various biological applications. However, incomplete 3′-UTR annotations can impede single-cell analysis resulting in genes that are partially or completely uncounted. Performing single-cell RNA sequencing with incomplete 3′-UTR annotations can hinder the identification of cell identities and gene expression patterns and lead to erroneous biological inferences. We demonstrate that performing single-cell isoform sequencing in tandem with single-cell RNA sequencing can rapidly improve 3′-UTR annotations. Using threespine stickleback fish (Gasterosteus aculeatus), we show that gene models resulting from a minimal embryonic single-cell isoform sequencing dataset retained 26.1% greater single-cell RNA sequencing reads than gene models from Ensembl alone. Furthermore, pooling our single-cell sequencing isoforms with a previously published adult bulk Iso-Seq dataset from stickleback, and merging the annotation with the Ensembl gene models, resulted in a marginal improvement (+0.8%) over the single-cell isoform sequencing only dataset. In addition, isoforms identified by single-cell isoform sequencing included thousands of new splicing variants. The improved gene models obtained using single-cell isoform sequencing led to successful identification of cell types and increased the reads identified of many genes in our single-cell RNA sequencing stickleback dataset. Our work illuminates single-cell isoform sequencing as a cost-effective and efficient mechanism to rapidly annotate genomes for single-cell RNA sequencing.more » « less
-
Martelli, Pier Luigi (Ed.)Abstract Motivation Transferring knowledge between species is challenging: different species contain distinct proteomes and cellular architectures, which cause their proteins to carry out different functions via different interaction networks. Many approaches to protein functional annotation use sequence similarity to transfer knowledge between species. These approaches cannot produce accurate predictions for proteins without homologues of known function, as many functions require cellular context for meaningful prediction. To supply this context, network-based methods use protein-protein interaction (PPI) networks as a source of information for inferring protein function and have demonstrated promising results in function prediction. However, most of these methods are tied to a network for a single species, and many species lack biological networks. Results In this work, we integrate sequence and network information across multiple species by computing IsoRank similarity scores to create a meta-network profile of the proteins of multiple species. We use this integrated multispecies meta-network as input to train a maxout neural network with Gene Ontology terms as target labels. Our multispecies approach takes advantage of more training examples, and consequently leads to significant improvements in function prediction performance compared to two network-based methods, a deep learning sequence-based method and the BLAST annotation method used in the Critial Assessment of Functional Annotation. We are able to demonstrate that our approach performs well even in cases where a species has no network information available: when an organism’s PPI network is left out we can use our multi-species method to make predictions for the left-out organism with good performance. Availability and implementation The code is freely available at https://github.com/nowittynamesleft/NetQuilt. The data, including sequences, PPI networks and GO annotations are available at https://string-db.org/. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
An official website of the United States government

