skip to main content

Title: DIFFUSE: predicting isoform functions from sequences and expression profiles via deep learning
Abstract Motivation Alternative splicing generates multiple isoforms from a single gene, greatly increasing the functional diversity of a genome. Although gene functions have been well studied, little is known about the specific functions of isoforms, making accurate prediction of isoform functions highly desirable. However, the existing approaches to predicting isoform functions are far from satisfactory due to at least two reasons: (i) unlike genes, isoform-level functional annotations are scarce. (ii) The information of isoform functions is concealed in various types of data including isoform sequences, co-expression relationship among isoforms, etc. Results In this study, we present a novel approach, DIFFUSE (Deep learning-based prediction of IsoForm FUnctions from Sequences and Expression), to predict isoform functions. To integrate various types of data, our approach adopts a hybrid framework by first using a deep neural network (DNN) to predict the functions of isoforms from their genomic sequences and then refining the prediction using a conditional random field (CRF) based on co-expression relationship. To overcome the lack of isoform-level ground truth labels, we further propose an iterative semi-supervised learning algorithm to train both the DNN and CRF together. Our extensive computational experiments demonstrate that DIFFUSE could effectively predict the functions of isoforms and genes. more » It achieves an average area under the receiver operating characteristics curve of 0.840 and area under the precision–recall curve of 0.581 over 4184 GO functional categories, which are significantly higher than the state-of-the-art methods. We further validate the prediction results by analyzing the correlation between functional similarity, sequence similarity, expression similarity and structural similarity, as well as the consistency between the predicted functions and some well-studied functional features of isoform sequences. Availability and implementation Supplementary information Supplementary data are available at Bioinformatics online. « less
; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
Page Range or eLocation-ID:
i284 to i294
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation Isoforms are mRNAs produced from the same gene locus by alternative splicing and may have different functions. Although gene functions have been studied extensively, little is known about the specific functions of isoforms. Recently, some computational approaches based on multiple instance learning have been proposed to predict isoform functions from annotated gene functions and expression data, but their performance is far from being desirable primarily due to the lack of labeled training data. To improve the performance on this problem, we propose a novel deep learning method, DeepIsoFun, that combines multiple instance learning with domain adaptation. The lattermore »technique helps to transfer the knowledge of gene functions to the prediction of isoform functions and provides additional labeled training data. Our model is trained on a deep neural network architecture so that it can adapt to different expression distributions associated with different gene ontology terms. Results We evaluated the performance of DeepIsoFun on three expression datasets of human and mouse collected from SRA studies at different times. On each dataset, DeepIsoFun performed significantly better than the existing methods. In terms of area under the receiver operating characteristics curve, our method acquired at least 26% improvement and in terms of area under the precision-recall curve, it acquired at least 10% improvement over the state-of-the-art methods. In addition, we also study the divergence of the functions predicted by our method for isoforms from the same gene and the overall correlation between expression similarity and the similarity of predicted functions. Availability and implementation Supplementary information Supplementary data are available at Bioinformatics online.« less
  2. Valencia, Alfonso (Ed.)
    Abstract Motivation Protein function prediction, based on the patterns of connection in a protein–protein interaction (or association) network, is perhaps the most studied of the classical, fundamental inference problems for biological networks. A highly successful set of recent approaches use random walk-based low-dimensional embeddings that tend to place functionally similar proteins into coherent spatial regions. However, these approaches lose valuable local graph structure from the network when considering only the embedding. We introduce GLIDER, a method that replaces a protein–protein interaction or association network with a new graph-based similarity network. GLIDER is based on a variant of our previous GLIDEmore »method, which was designed to predict missing links in protein–protein association networks, capturing implicit local and global (i.e. embedding-based) graph properties. Results GLIDER outperforms competing methods on the task of predicting GO functional labels in cross-validation on a heterogeneous collection of four human protein–protein association networks derived from the 2016 DREAM Disease Module Identification Challenge, and also on three different protein–protein association networks built from the STRING database. We show that this is due to the strong functional enrichment that is present in the local GLIDER neighborhood in multiple different types of protein–protein association networks. Furthermore, we introduce the GLIDER graph neighborhood as a way for biologists to visualize the local neighborhood of a disease gene. As an application, we look at the local GLIDER neighborhoods of a set of known Parkinson’s Disease GWAS genes, rediscover many genes which have known involvement in Parkinson’s disease pathways, plus suggest some new genes to study. Availability and implementation All code is publicly available and can be accessed here: Supplementary information Supplementary data are available at Bioinformatics online.« less
  3. Martelli, Pier Luigi (Ed.)
    Abstract Motivation Transferring knowledge between species is challenging: different species contain distinct proteomes and cellular architectures, which cause their proteins to carry out different functions via different interaction networks. Many approaches to protein functional annotation use sequence similarity to transfer knowledge between species. These approaches cannot produce accurate predictions for proteins without homologues of known function, as many functions require cellular context for meaningful prediction. To supply this context, network-based methods use protein-protein interaction (PPI) networks as a source of information for inferring protein function and have demonstrated promising results in function prediction. However, most of these methods are tiedmore »to a network for a single species, and many species lack biological networks. Results In this work, we integrate sequence and network information across multiple species by computing IsoRank similarity scores to create a meta-network profile of the proteins of multiple species. We use this integrated multispecies meta-network as input to train a maxout neural network with Gene Ontology terms as target labels. Our multispecies approach takes advantage of more training examples, and consequently leads to significant improvements in function prediction performance compared to two network-based methods, a deep learning sequence-based method and the BLAST annotation method used in the Critial Assessment of Functional Annotation. We are able to demonstrate that our approach performs well even in cases where a species has no network information available: when an organism’s PPI network is left out we can use our multi-species method to make predictions for the left-out organism with good performance. Availability and implementation The code is freely available at The data, including sequences, PPI networks and GO annotations are available at Supplementary information Supplementary data are available at Bioinformatics online.« less
  4. Abstract Background Identification of genes responsible for anatomical entities is a major requirement in many fields including developmental biology, medicine, and agriculture. Current wet lab techniques used for this purpose, such as gene knockout, are high in resource and time consumption. Protein–protein interaction (PPI) networks are frequently used to predict disease genes for humans and gene candidates for molecular functions, but they are rarely used to predict genes for anatomical entities. Moreover, PPI networks suffer from network quality issues, which can be a limitation for their usage in predicting candidate genes. Therefore, we developed an integrative framework to improve themore »candidate gene prediction accuracy for anatomical entities by combining existing experimental knowledge about gene-anatomical entity relationships with PPI networks using anatomy ontology annotations. We hypothesized that this integration improves the quality of the PPI networks by reducing the number of false positive and false negative interactions and is better optimized to predict candidate genes for anatomical entities. We used existing Uberon anatomical entity annotations for zebrafish and mouse genes to construct gene networks by calculating semantic similarity between the genes. These anatomy-based gene networks were semantic networks, as they were constructed based on the anatomy ontology annotations that were obtained from the experimental data in the literature. We integrated these anatomy-based gene networks with mouse and zebrafish PPI networks retrieved from the STRING database and compared the performance of their network-based candidate gene predictions. Results According to evaluations of candidate gene prediction performance tested under four different semantic similarity calculation methods (Lin, Resnik, Schlicker, and Wang), the integrated networks, which were semantically improved PPI networks, showed better performances by having higher area under the curve values for receiver operating characteristic and precision-recall curves than PPI networks for both zebrafish and mouse. Conclusion Integration of existing experimental knowledge about gene-anatomical entity relationships with PPI networks via anatomy ontology improved the candidate gene prediction accuracy and optimized them for predicting candidate genes for anatomical entities.« less
  5. Abstract

    Haplotype phasing maize genetic variants is important for genome interpretation, population genetic analysis and functional analysis of allelic activity. We performed an isoform-level phasing study using two maize inbred lines and their reciprocal crosses, based on single-molecule, full-length cDNA sequencing. To phase and analyze transcripts between hybrids and parents, we developed IsoPhase. Using this tool, we validated the majority of SNPs called against matching short-read data from embryo, endosperm and root tissues, and identified allele-specific, gene-level and isoform-level differential expression between the inbred parental lines and hybrid offspring. After phasing 6907 genes in the reciprocal hybrids, we annotated themore »SNPs and identified large-effect genes. In addition, we identified parent-of-origin isoforms, distinct novel isoforms in maize parent and hybrid lines, and imprinted genes from different tissues. Finally, we characterized variation in cis- and trans-regulatory effects. Our study provides measures of haplotypic expression that could increase accuracy in studies of allelic expression.

    « less