skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.

Title: Clustering single-cell RNA-seq data by rank constrained similarity learning
Abstract Motivation Recent breakthroughs of single-cell RNA sequencing (scRNA-seq) technologies offer an exciting opportunity to identify heterogeneous cell types in complex tissues. However, the unavoidable biological noise and technical artifacts in scRNA-seq data as well as the high dimensionality of expression vectors make the problem highly challenging. Consequently, although numerous tools have been developed, their accuracy remains to be improved. Results Here, we introduce a novel clustering algorithm and tool RCSL (Rank Constrained Similarity Learning) to accurately identify various cell types using scRNA-seq data from a complex tissue. RCSL considers both local similarity and global similarity among the cells to discern the subtle differences among cells of the same type as well as larger differences among cells of different types. RCSL uses Spearman’s rank correlations of a cell’s expression vector with those of other cells to measure its global similarity, and adaptively learns neighbor representation of a cell as its local similarity. The overall similarity of a cell to other cells is a linear combination of its global similarity and local similarity. RCSL automatically estimates the number of cell types defined in the similarity matrix, and identifies them by constructing a block-diagonal matrix, such that its distance to the similarity matrix is minimized. Each block-diagonal submatrix is a cell cluster/type, corresponding to a connected component in the cognate similarity graph. When tested on 16 benchmark scRNA-seq datasets in which the cell types are well-annotated, RCSL substantially outperformed six state-of-the-art methods in accuracy and robustness as measured by three metrics. Availability and implementation The RCSL algorithm is implemented in R and can be freely downloaded at Supplementary information Supplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Mathelier, Anthony
Date Published:
Journal Name:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Birol, Inanc (Ed.)
    Abstract Motivation

    Single-cell RNA sequencing (scRNA-seq) is widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-seq experiments aim to identify and quantify all cell types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are mapped to cell types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious ad hoc effort that requires expert biological knowledge.


    Here, we introduce CellMeSH—a new automated approach to identifying cell types for clusters based on prior literature. CellMeSH combines a database of gene–cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and is easily updated with new data. The probabilistic query method enables reliable information retrieval even though the gene–cell-type associations extracted from the literature are noisy. CellMeSH is also able to optionally utilize prior knowledge about tissues or cells for further annotation improvement. CellMeSH achieves top-one and top-three accuracies on a number of mouse and human datasets that are consistently better than existing approaches.

    Availability and implementation

    Web server at and API at

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    more » « less
  2. ABSTRACT: Motivation Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data. Results Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data. Availability and implementation The R package is open-access and available at The data used in this work are available at Zenodo: Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  3. INTRODUCTION Neurons are by far the most diverse of all cell types in animals, to the extent that “cell types” in mammalian brains are still mostly heterogeneous groups, and there is no consensus definition of the term. The Drosophila optic lobes, with approximately 200 well-defined cell types, provides a tractable system with which to address the genetic basis of neuronal type diversity. We previously characterized the distinct developmental gene expression program of each of these types using single-cell RNA sequencing (scRNA-seq), with one-to-one correspondence to the known morphological types. RATIONALE The identity of fly neurons is determined by temporal and spatial patterning mechanisms in stem cell progenitors, but it remained unclear how these cell fate decisions are implemented and maintained in postmitotic neurons. It was proposed in Caenorhabditis elegans that unique combinations of terminal selector transcription factors (TFs) that are continuously expressed in each neuron control nearly all of its type-specific gene expression. This model implies that it should be possible to engineer predictable and complete switches of identity between different neurons just by modifying these sustained TFs. We aimed to test this prediction in the Drosophila visual system. RESULTS Here, we used our developmental scRNA-seq atlases to identify the potential terminal selector genes in all optic lobe neurons. We found unique combinations of, on average, 10 differentially expressed and stably maintained (across all stages of development) TFs in each neuron. Through genetic gain- and loss-of-function experiments in postmitotic neurons, we showed that modifications of these selector codes are sufficient to induce predictable switches of identity between various cell types. Combinations of terminal selectors jointly control both developmental (e.g., morphology) and functional (e.g., neurotransmitters and their receptors) features of neurons. The closely related Transmedullary 1 (Tm1), Tm2, Tm4, and Tm6 neurons (see the figure) share a similar code of terminal selectors, but can be distinguished from each other by three TFs that are continuously and specifically expressed in one of these cell types: Drgx in Tm1, Pdm3 in Tm2, and SoxN in Tm6. We showed that the removal of each of these selectors in these cell types reprograms them to the default Tm4 fate. We validated these conversions using both morphological features and molecular markers. In addition, we performed scRNA-seq to show that ectopic expression of pdm3 in Tm4 and Tm6 neurons converts them to neurons with transcriptomes that are nearly indistinguishable from that of wild-type Tm2 neurons. We also show that Drgx expression in Tm1 neurons is regulated by Klumpfuss, a TF expressed in stem cells that instructs this fate in progenitors, establishing a link between the regulatory programs that specify neuronal fates and those that implement them. We identified an intronic enhancer in the Drgx locus whose chromatin is specifically accessible in Tm1 neurons and in which Klu motifs are enriched. Genomic deletion of this region knocked down Drgx expression specifically in Tm1 neurons, leaving it intact in the other cell types that normally express it. We further validated this concept by demonstrating that ectopic expression of Vsx (visual system homeobox) genes in Mi15 neurons not only converts them morphologically to Dm2 neurons, but also leads to the loss of their aminergic identity. Our results suggest that selector combinations can be further sculpted by receptor tyrosine kinase signaling after neurogenesis, providing a potential mechanism for postmitotic plasticity of neuronal fates. Finally, we combined our transcriptomic datasets with previously generated chromatin accessibility datasets to understand the mechanisms that control brain wiring downstream of terminal selectors. We built predictive computational models of gene regulatory networks using the Inferelator framework. Experimental validations of these networks revealed how selectors interact with ecdysone-responsive TFs to activate a large and specific repertoire of cell surface proteins and other effectors in each neuron at the onset of synapse formation. We showed that these network models can be used to identify downstream effectors that mediate specific cellular decisions during circuit formation. For instance, reduced levels of cut expression in Tm2 neurons, because of its negative regulation by pdm3 , controls the synaptic layer targeting of their axons. Knockdown of cut in Tm1 neurons is sufficient to redirect their axons to the Tm2 layer in the lobula neuropil without affecting other morphological features. CONCLUSION Our results support a model in which neuronal type identity is primarily determined by a relatively simple code of continuously expressed terminal selector TFs in each cell type throughout development. Our results provide a unified framework of how specific fates are initiated and maintained in postmitotic neurons and open new avenues to understanding synaptic specificity through gene regulatory networks. The conservation of this regulatory logic in both C. elegans and Drosophila makes it likely that the terminal selector concept will also be useful in understanding and manipulating the neuronal diversity of mammalian brains. Terminal selectors enable predictive cell fate reprogramming. Tm1, Tm2, Tm4, and Tm6 neurons of the Drosophila visual system share a core set of TFs continuously expressed by each cell type (simplified). The default Tm4 fate is overridden by the expression of a single additional terminal selector to generate Tm1 ( Drgx ), Tm2 ( pdm3 ), or Tm6 ( SoxN ) fates. 
    more » « less
  4. Abstract Summary

    With the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells, etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or unique molecular identifier (UMI) deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-sequencing (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as polymerase chain reaction amplification, cellular barcodes (CB) and UMI selection and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    more » « less
  5. null (Ed.)
    Single cell RNA-sequencing (scRNA-seq) technology enables comprehensive transcriptomic profiling of thousands of cells with distinct phenotypic and physiological states in a complex tissue. Substantial efforts have been made to characterize single cells of distinct identities from scRNA-seq data, including various cell clustering techniques. While existing approaches can handle single cells in terms of different cell (sub)types at a high resolution, identification of the functional variability within the same cell type remains unsolved. In addition, there is a lack of robust method to handle the inter-subject variation that often brings severe confounding effects for the functional clustering of single cells. In this study, we developed a novel data denoising and cell clustering approach, namely CIBS, to provide biologically explainable functional classification for scRNA-seq data. CIBS is based on a systems biology model of transcriptional regulation that assumes a multi-modality distribution of the cells’ activation status, and it utilizes a Boolean matrix factorization approach on the discretized expression status to robustly derive functional modules. CIBS is empowered by a novel fast Boolean Matrix Factorization method, namely PFAST, to increase the computational feasibility on large scale scRNA-seq data. Application of CIBS on two scRNA-seq datasets collected from cancer tumor micro-environment successfully identified subgroups of cancer cells with distinct expression patterns of epithelial-mesenchymal transition and extracellular matrix marker genes, which was not revealed by the existing cell clustering analysis tools. The identified cell groups were significantly associated with the clinically confirmed lymph-node invasion and metastasis events across different patients. Index Terms—Cell clustering analysis, Data denoising, Boolean matrix factorization, Cancer microenvirionment, Metastasis. 
    more » « less