skip to main content


Title: siVAE: interpretable deep generative models for single-cell transcriptomes
Abstract

Neural networks such as variational autoencoders (VAE) perform dimensionality reduction for the visualization and analysis of genomic data, but are limited in their interpretability: it is unknown which data features are represented by each embedding dimension. We present siVAE, a VAE that is interpretable by design, thereby enhancing downstream analysis tasks. Through interpretation, siVAE also identifies gene modules and hubs without explicit gene network inference. We use siVAE to identify gene modules whose connectivity is associated with diverse phenotypes such as iPSC neuronal differentiation efficiency and dementia, showcasing the wide applicability of interpretable generative models for genomic data analysis.

 
more » « less
Award ID(s):
1846559
NSF-PAR ID:
10397951
Author(s) / Creator(s):
; ;
Publisher / Repository:
Springer Science + Business Media
Date Published:
Journal Name:
Genome Biology
Volume:
24
Issue:
1
ISSN:
1474-760X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Deep neural networks implementing generative models for dimensionality reduction have been extensively used for the visualization and analysis of genomic data. One of their key limitations is lack of interpretability: it is challenging to quantitatively identify which input features are used to construct the embedding dimensions, thus preventing insight into why cells are organized in a particular data visualization, for example. Here we present a scalable, interpretable variational autoencoder (siVAE) that is interpretable by design: it learns feature embeddings that guide the interpretation of the cell embeddings in a manner analogous to factor loadings of factor analysis. siVAE is as powerful and nearly as fast to train as the standard VAE but achieves full interpretability of the embedding dimensions. Using siVAE, we exploit a number of connections between dimensionality reduction and gene network inference to identify gene neighborhoods and gene hubs, without the explicit need for gene network inference. We observe a systematic difference in the gene neighborhoods identified by dimensionality reduction methods and gene network inference algorithms in general, suggesting they provide complementary information about the underlying structure of the gene co-expression network. Finally, we apply siVAE to implicitly learn gene networks for individual iPSC lines and uncover a correlation between neuronal differentiation efficiency and loss of co-expression of several mitochondrial complexes, including NADH dehydrogenase, cytochrome C oxidase, and cytochrome b. 
    more » « less
  2. Abstract

    Natural populations are characterized by abundant genetic diversity driven by a range of different types of mutation. The tractability of sequencing complete genomes has allowed new insights into the variable composition of genomes, summarized as a species pan‐genome. These analyses demonstrate that many genes are absent from the first reference genomes, whose analysis dominated the initial years of the genomic era. Our field now turns towards understanding the functional consequence of these highly variable genomes. Here, we analysed weighted gene coexpression networks from leaf transcriptome data for drought response in the purple false bromeBrachypodium distachyonand the differential expression of genes putatively involved in adaptation to this stressor. We specifically asked whether genes with variable “occupancy” in the pan‐genome – genes which are either present in all studied genotypes or missing in some genotypes – show different distributions among coexpression modules. Coexpression analysis united genes expressed in drought‐stressed plants into nine modules covering 72 hub genes (87 hub isoforms), and genes expressed under controlled water conditions into 13 modules, covering 190 hub genes (251 hub isoforms). We find that low occupancy pan‐genes are under‐represented among several modules, while other modules are over‐enriched for low‐occupancy pan‐genes. We also provide new insight into the regulation of drought response inB. distachyon, specifically identifying one module with an apparent role in primary metabolism that is strongly responsive to drought. Our work shows the power of integrating pan‐genomic analysis with transcriptomic data using factorial experiments to understand the functional genomics of environmental response.

     
    more » « less
  3. Abstract Motivation

    High-throughput sequencing technologies, in particular RNA sequencing (RNA-seq), have become the basic practice for genomic studies in biomedical research. In addition to studying genes individually, for example, through differential expression analysis, investigating co-ordinated expression variations of genes may help reveal the underlying cellular mechanisms to derive better understanding and more effective prognosis and intervention strategies. Although there exists a variety of co-expression network based methods to analyze microarray data for this purpose, instead of blindly extending these methods for microarray data that may introduce unnecessary bias, it is crucial to develop methods well adapted to RNA-seq data to identify the functional modules of genes with similar expression patterns.

    Results

    We have developed a fully Bayesian covariate-dependent negative binomial factor analysis (dNBFA) method—dNBFA—for RNA-seq count data, to capture coordinated gene expression changes, while considering effects from covariates reflecting different influencing factors. Unlike existing co-expression network based methods, our proposed model does not require multiple ad-hoc choices on data processing, transformation, as well as co-expression measures and can be directly applied to RNA-seq data. Furthermore, being capable of incorporating covariate information, the proposed method can tackle setups with complex confounding factors in different experiment designs. Finally, the natural model parameterization removes the need for a normalization preprocessing step, as commonly adopted to compensate for the effect of sequencing-depth variations. Efficient Bayesian inference of model parameters is derived by exploiting conditional conjugacy via novel data augmentation techniques. Experimental results on several real-world RNA-seq datasets on complex diseases suggest dNBFA as a powerful tool for discovering the gene modules with significant differential expression and meaningful biological insight.

    Availability and implementation

    dNBFA is implemented in R language and is available at https://github.com/siamakz/dNBFA.

     
    more » « less
  4. Summary

    Cultivated cotton (Gossypium hirsutum) is the most important fibre crop in the world. Cotton leaf curl disease (CLCuD) is the major limiting factor and a threat to textile industry in India and Pakistan. All the local cotton cultivars exhibit moderate to no resistance againstCLCuD. In this study, we evaluated an exotic cotton accession Mac7 as a resistance source toCLCuD by challenging it with viruliferous whiteflies and performingqPCRto evaluate the presence/absence and relative titre ofCLCuD‐associated geminiviruses/betasatellites. The results indicated that replication of pathogenicity determinant betasatellite is significantly attenuated in Mac7 and probably responsible for resistance phenotype. Afterwards, to decipher the genetic basis ofCLCuD resistance in Mac7, we performedRNAsequencing onCLCuD‐infested Mac7 and validatedRNA‐Seq data withqPCRon 24 independent genes. We performed co‐expression network and pathway analysis for regulation of geminivirus/betasatellite‐interacting genes. We identified nine novel modules with 52 hubs of highly connected genes in network topology within the co‐expression network. Analysis of these hubs indicated the differential regulation of auxin stimulus and cellular localization pathways in response toCLCuD. We also analysed the differential regulation of geminivirus/betasatellite‐interacting genes in Mac7. We further performed the functional validation of selected candidate genes via virus‐induced gene silencing (VIGS). Finally, we evaluated the genomic context of resistance responsive genes and found that these genes are not specific to A or D sub‐genomes ofG. hirsutum. These results have important implications in understandingCLCuD resistance mechanism and developing a durable resistance in cultivated cotton.

     
    more » « less
  5. <bold>Abstract</bold>

    Deciphering the relationship between a gene and its genomic context is fundamental to understanding and engineering biological systems. Machine learning has shown promise in learning latent relationships underlying the sequence-structure-function paradigm from massive protein sequence datasets. However, to date, limited attempts have been made in extending this continuum to include higher order genomic context information. Evolutionary processes dictate the specificity of genomic contexts in which a gene is found across phylogenetic distances, and these emergent genomic patterns can be leveraged to uncover functional relationships between gene products. Here, we train a genomic language model (gLM) on millions of metagenomic scaffolds to learn the latent functional and regulatory relationships between genes. gLM learns contextualized protein embeddings that capture the genomic context as well as the protein sequence itself, and encode biologically meaningful and functionally relevant information (e.g. enzymatic function, taxonomy). Our analysis of the attention patterns demonstrates that gLM is learning co-regulated functional modules (i.e. operons). Our findings illustrate that gLM’s unsupervised deep learning of the metagenomic corpus is an effective and promising approach to encode functional semantics and regulatory syntax of genes in their genomic contexts and uncover complex relationships between genes in a genomic region.

     
    more » « less