skip to main content


Title: Discovering a sparse set of pairwise discriminating features in high-dimensional data
Abstract Motivation Recent technological advances produce a wealth of high-dimensional descriptions of biological processes, yet extracting meaningful insight and mechanistic understanding from these data remains challenging. For example, in developmental biology, the dynamics of differentiation can now be mapped quantitatively using single-cell RNA sequencing, yet it is difficult to infer molecular regulators of developmental transitions. Here, we show that discovering informative features in the data is crucial for statistical analysis as well as making experimental predictions. Results We identify features based on their ability to discriminate between clusters of the data points. We define a class of problems in which linear separability of clusters is hidden in a low-dimensional space. We propose an unsupervised method to identify the subset of features that define a low-dimensional subspace in which clustering can be conducted. This is achieved by averaging over discriminators trained on an ensemble of proposed cluster configurations. We then apply our method to single-cell RNA-seq data from mouse gastrulation, and identify 27 key transcription factors (out of 409 total), 18 of which are known to define cell states through their expression levels. In this inferred subspace, we find clear signatures of known cell types that eluded classification prior to discovery of the correct low-dimensional subspace. Availability and implementation https://github.com/smelton/SMD. Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1764269
NSF-PAR ID:
10328966
Author(s) / Creator(s):
;
Editor(s):
Luigi Martelli, Pier
Date Published:
Journal Name:
Bioinformatics
Volume:
37
Issue:
2
ISSN:
1367-4803
Page Range / eLocation ID:
202 to 212
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Methods for the global measurement of transcript abundance such as microarrays and RNA-Seq generate datasets in which the number of measured features far exceeds the number of observations. Extracting biologically meaningful and experimentally tractable insights from such data therefore requires high-dimensional prediction. Existing sparse linear approaches to this challenge have been stunningly successful, but some important issues remain. These methods can fail to select the correct features, predict poorly relative to non-sparse alternatives or ignore any unknown grouping structures for the features.

    Results

    We propose a method called SuffPCR that yields improved predictions in high-dimensional tasks including regression and classification, especially in the typical context of omics with correlated features. SuffPCR first estimates sparse principal components and then estimates a linear model on the recovered subspace. Because the estimated subspace is sparse in the features, the resulting predictions will depend on only a small subset of genes. SuffPCR works well on a variety of simulated and experimental transcriptomic data, performing nearly optimally when the model assumptions are satisfied. We also demonstrate near-optimal theoretical guarantees.

    Availability and implementation

    Code and raw data are freely available at https://github.com/dajmcdon/suffpcr. Package documentation may be viewed at https://dajmcdon.github.io/suffpcr.

    Contact

    daniel@stat.ubc.ca

    Supplementary information

    Supplementary data are available at Bioinformatics Advances online.

     
    more » « less
  2. Abstract Motivation

    Due to their high genomic variability, RNA viruses and retroviruses present a unique opportunity for detailed study of molecular evolution. Lentiviruses, with HIV being a notable example, are one of the best studied viral groups: hundreds of thousands of sequences are available together with experimentally resolved three-dimensional structures for most viral proteins. In this work, we use these data to study specific patterns of evolution of the viral proteins, and their relationship to protein interactions and immunogenicity.

    Results

    We propose a method for identification of two types of surface residues clusters with abnormal conservation: extremely conserved and extremely variable clusters. We identify them on the surface of proteins from HIV and other animal immunodeficiency viruses. Both types of clusters are overrepresented on the interaction interfaces of viral proteins with other proteins, nucleic acids or low molecular-weight ligands, both in the viral particle and between the virus and its host. In the immunodeficiency viruses, the interaction interfaces are not more conserved than the corresponding proteins on an average, and we show that extremely conserved clusters coincide with protein–protein interaction hotspots, predicted as the residues with the largest energetic contribution to the interaction. Extremely variable clusters have been identified here for the first time. In the HIV-1 envelope protein gp120, they overlap with known antigenic sites. These antigenic sites also contain many residues from extremely conserved clusters, hence representing a unique interacting interface enriched both in extremely conserved and in extremely variable clusters of residues. This observation may have important implication for antiretroviral vaccine development.

    Availability and Implementation

    A Python package is available at https://bioinf.mpi-inf.mpg.de/publications/viral-ppi-pred/

    Contact

    voitenko@mpi-inf.mpg.de or kalinina@mpi-inf.mpg.de

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. ABSTRACT: Motivation Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data. Results Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data. Availability and implementation The R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  4. Birol, Inanc (Ed.)
    Abstract Motivation

    Single-cell RNA sequencing (scRNA-seq) is widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-seq experiments aim to identify and quantify all cell types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are mapped to cell types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious ad hoc effort that requires expert biological knowledge.

    Results

    Here, we introduce CellMeSH—a new automated approach to identifying cell types for clusters based on prior literature. CellMeSH combines a database of gene–cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and is easily updated with new data. The probabilistic query method enables reliable information retrieval even though the gene–cell-type associations extracted from the literature are noisy. CellMeSH is also able to optionally utilize prior knowledge about tissues or cells for further annotation improvement. CellMeSH achieves top-one and top-three accuracies on a number of mouse and human datasets that are consistently better than existing approaches.

    Availability and implementation

    Web server at https://uncurl.cs.washington.edu/db_query and API at https://github.com/shunfumao/cellmesh.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. INTRODUCTION Neurons are by far the most diverse of all cell types in animals, to the extent that “cell types” in mammalian brains are still mostly heterogeneous groups, and there is no consensus definition of the term. The Drosophila optic lobes, with approximately 200 well-defined cell types, provides a tractable system with which to address the genetic basis of neuronal type diversity. We previously characterized the distinct developmental gene expression program of each of these types using single-cell RNA sequencing (scRNA-seq), with one-to-one correspondence to the known morphological types. RATIONALE The identity of fly neurons is determined by temporal and spatial patterning mechanisms in stem cell progenitors, but it remained unclear how these cell fate decisions are implemented and maintained in postmitotic neurons. It was proposed in Caenorhabditis elegans that unique combinations of terminal selector transcription factors (TFs) that are continuously expressed in each neuron control nearly all of its type-specific gene expression. This model implies that it should be possible to engineer predictable and complete switches of identity between different neurons just by modifying these sustained TFs. We aimed to test this prediction in the Drosophila visual system. RESULTS Here, we used our developmental scRNA-seq atlases to identify the potential terminal selector genes in all optic lobe neurons. We found unique combinations of, on average, 10 differentially expressed and stably maintained (across all stages of development) TFs in each neuron. Through genetic gain- and loss-of-function experiments in postmitotic neurons, we showed that modifications of these selector codes are sufficient to induce predictable switches of identity between various cell types. Combinations of terminal selectors jointly control both developmental (e.g., morphology) and functional (e.g., neurotransmitters and their receptors) features of neurons. The closely related Transmedullary 1 (Tm1), Tm2, Tm4, and Tm6 neurons (see the figure) share a similar code of terminal selectors, but can be distinguished from each other by three TFs that are continuously and specifically expressed in one of these cell types: Drgx in Tm1, Pdm3 in Tm2, and SoxN in Tm6. We showed that the removal of each of these selectors in these cell types reprograms them to the default Tm4 fate. We validated these conversions using both morphological features and molecular markers. In addition, we performed scRNA-seq to show that ectopic expression of pdm3 in Tm4 and Tm6 neurons converts them to neurons with transcriptomes that are nearly indistinguishable from that of wild-type Tm2 neurons. We also show that Drgx expression in Tm1 neurons is regulated by Klumpfuss, a TF expressed in stem cells that instructs this fate in progenitors, establishing a link between the regulatory programs that specify neuronal fates and those that implement them. We identified an intronic enhancer in the Drgx locus whose chromatin is specifically accessible in Tm1 neurons and in which Klu motifs are enriched. Genomic deletion of this region knocked down Drgx expression specifically in Tm1 neurons, leaving it intact in the other cell types that normally express it. We further validated this concept by demonstrating that ectopic expression of Vsx (visual system homeobox) genes in Mi15 neurons not only converts them morphologically to Dm2 neurons, but also leads to the loss of their aminergic identity. Our results suggest that selector combinations can be further sculpted by receptor tyrosine kinase signaling after neurogenesis, providing a potential mechanism for postmitotic plasticity of neuronal fates. Finally, we combined our transcriptomic datasets with previously generated chromatin accessibility datasets to understand the mechanisms that control brain wiring downstream of terminal selectors. We built predictive computational models of gene regulatory networks using the Inferelator framework. Experimental validations of these networks revealed how selectors interact with ecdysone-responsive TFs to activate a large and specific repertoire of cell surface proteins and other effectors in each neuron at the onset of synapse formation. We showed that these network models can be used to identify downstream effectors that mediate specific cellular decisions during circuit formation. For instance, reduced levels of cut expression in Tm2 neurons, because of its negative regulation by pdm3 , controls the synaptic layer targeting of their axons. Knockdown of cut in Tm1 neurons is sufficient to redirect their axons to the Tm2 layer in the lobula neuropil without affecting other morphological features. CONCLUSION Our results support a model in which neuronal type identity is primarily determined by a relatively simple code of continuously expressed terminal selector TFs in each cell type throughout development. Our results provide a unified framework of how specific fates are initiated and maintained in postmitotic neurons and open new avenues to understanding synaptic specificity through gene regulatory networks. The conservation of this regulatory logic in both C. elegans and Drosophila makes it likely that the terminal selector concept will also be useful in understanding and manipulating the neuronal diversity of mammalian brains. Terminal selectors enable predictive cell fate reprogramming. Tm1, Tm2, Tm4, and Tm6 neurons of the Drosophila visual system share a core set of TFs continuously expressed by each cell type (simplified). The default Tm4 fate is overridden by the expression of a single additional terminal selector to generate Tm1 ( Drgx ), Tm2 ( pdm3 ), or Tm6 ( SoxN ) fates. 
    more » « less