skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A model-based constrained deep learning clustering approach for spatially resolved single-cell data
Spatially resolved scRNA-seq (sp-scRNA-seq) technologies provide the potential to comprehensively profile gene expression patterns in tissue context. However, the development of computational methods lags behind the advances in these technologies, which limits the fulfillment of their potential. In this study, we develop a deep learning approach for clustering sp-scRNA-seq data, named Deep Spatially constrained Single-cell Clustering (DSSC). In this model, we integrate the spatial information of cells into the clustering process in two steps: (1) the spatial information is encoded by using a graphical neural network model, and (2) cell-to-cell constraints are built based on the spatial expression pattern of the marker genes and added in the model to guide the clustering process. Then, a deep embedding clustering is performed on the bottleneck layer of autoencoder by Kullback–Leibler (KL) divergence along with the learning of feature representation. DSSC is the first model that can use information from both spatial coordinates and marker genes to guide cell/spot clustering. Extensive experiments on both simulated and real data sets show that DSSC boosts clustering performance significantly compared with the state-of-the-art methods. It has robust performance across different data sets with various cell type/tissue organization and/or cell type/tissue spatial dependency. We conclude that DSSC is a promising tool for clustering sp-scRNA-seq data.  more » « less
Award ID(s):
1659472
PAR ID:
10378518
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Genome Research
ISSN:
1088-9051
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The selection of marker gene panels is critical for capturing the cellular and spatial heterogeneity in the expanding atlases of single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics data. Most current approaches to marker gene selection operate in a label-based framework, which is inherently limited by its dependency on predefined cell type labels or clustering results. In contrast, existing label-free methods often struggle to identify genes that characterize rare cell types or subtle spatial patterns, and they frequently fail to scale efficiently with large data sets. Here, we introduce geneCover, a label-free combinatorial method that selects an optimal panel of minimally redundant marker genes based on gene-gene correlations. Our method demonstrates excellent scalability to large data sets and identifies marker gene panels that capture distinct correlation structures across the transcriptome. This allows geneCover to distinguish cell states in various tissues of living organisms effectively, including those associated with rare or otherwise difficult-to-identify cell types. We evaluate the performance of geneCover across various scRNA-seq and spatial transcriptomics data sets, comparing it to other label-free algorithms to highlight its utility and potential in diverse biological contexts. 
    more » « less
  2. null (Ed.)
    With the advent of single-cell RNA sequencing (scRNA-seq) technologies, there has been a spike in stud-ies involving scRNA-seq of several tissues across diverse species includingDrosophila. Although a fewdatabases exist for users to query genes of interest within the scRNA-seq studies, search tools that enableusers to find orthologous genes and their cell type-specific expression patterns across species are limited.Here, we built a new search database, DRscDB (https://www.flyrnai.org/tools/single_cell/web/), toaddress this need. DRscDB serves as a comprehensive repository for published scRNA-seq datasets forDrosophilaand relevant datasets from human and other model organisms. DRscDB is based on manualcuration ofDrosophilascRNA-seq studies of various tissue types and their corresponding analogoustissues in vertebrates including zebrafish, mouse, and human. Of note, our search database provides mostof the literature-derived marker genes, thus preserving the original analysis of the published scRNA-seqdatasets. Finally, DRscDB serves as a web-based user interface that allows users to mine gene expressiondata from scRNA-seq studies and perform cell cluster enrichment analyses pertaining to variousscRNA-seq studies, both within and across species. 
    more » « less
  3. null (Ed.)
    Single cell RNA-sequencing (scRNA-seq) technology enables comprehensive transcriptomic profiling of thousands of cells with distinct phenotypic and physiological states in a complex tissue. Substantial efforts have been made to characterize single cells of distinct identities from scRNA-seq data, including various cell clustering techniques. While existing approaches can handle single cells in terms of different cell (sub)types at a high resolution, identification of the functional variability within the same cell type remains unsolved. In addition, there is a lack of robust method to handle the inter-subject variation that often brings severe confounding effects for the functional clustering of single cells. In this study, we developed a novel data denoising and cell clustering approach, namely CIBS, to provide biologically explainable functional classification for scRNA-seq data. CIBS is based on a systems biology model of transcriptional regulation that assumes a multi-modality distribution of the cells’ activation status, and it utilizes a Boolean matrix factorization approach on the discretized expression status to robustly derive functional modules. CIBS is empowered by a novel fast Boolean Matrix Factorization method, namely PFAST, to increase the computational feasibility on large scale scRNA-seq data. Application of CIBS on two scRNA-seq datasets collected from cancer tumor micro-environment successfully identified subgroups of cancer cells with distinct expression patterns of epithelial-mesenchymal transition and extracellular matrix marker genes, which was not revealed by the existing cell clustering analysis tools. The identified cell groups were significantly associated with the clinically confirmed lymph-node invasion and metastasis events across different patients. Index Terms—Cell clustering analysis, Data denoising, Boolean matrix factorization, Cancer microenvirionment, Metastasis. 
    more » « less
  4. When analyzing scRNA-seq data with clustering algorithms, annotating the clusters with cell types is an essential step toward biological interpretation of the data. Annotations can be performed manually using known cell type marker genes. Annotations can also be automated using knowledge-driven or data-driven machine learning algorithms. Majority of cell type annotation algorithms are designed to predict cell types for individual cells in a new dataset. Since biological interpretation of scRNA-seq data is often made on cell clusters rather than individual cells, several algorithms have been developed to annotate cell clusters. In this study, we compared five cell type annotation algorithms, Azimuth, SingleR, Garnett, scCATCH, and SCSA, which cover the spectrum of knowledge-driven and data-driven approaches to annotate either individual cells or cell clusters. We applied these five algorithms to two scRNA-seq datasets of peripheral blood mononuclear cells (PBMC) samples from COVID-19 patients and healthy controls, and evaluated their annotation performance. From this comparison, we observed that methods for annotating individual cells outperformed methods for annotation cell clusters. We applied the cell-based annotation algorithm Azimuth to the two scRNA-seq datasets to examine the immune response during COVID-19 infection. Both datasets presented significant depletion of plasmacytoid dendritic cells (pDCs), where differential expression in this cell type and pathway analysis revealed strong activation of type I interferon signaling pathway in response to the infection. 
    more » « less
  5. Definition of cell classes across the tissues of living organisms is central in the analysis of growing atlases of single-cell RNA sequencing (scRNA-seq) data across biomedicine. Marker genes for cell classes are most often defined by differential expression (DE) methods that serially assess individual genes across landscapes of diverse cells. This serial approach has been extremely useful, but is limited because it ignores possible redundancy or complementarity across genes that can only be captured by analyzing multiple genes simultaneously. Interrogating binarized expression data, we aim to identify discriminating panels of genes that are specific to, not only enriched in, individual cell types. To efficiently explore the vast space of possible marker panels, leverage the large number of cells often sequenced, and overcome zero-inflation in scRNA-seq data, we propose viewing marker gene panel selection as a variation of the “minimal set-covering problem” in combinatorial optimization. Using scRNA-seq data from blood and brain tissue, we show that this new method, CellCover, performs as good or better than DE and other methods in defining cell-type discriminating gene panels, while reducing gene redundancy and capturing cell-class-specific signals that are distinct from those defined by DE methods. Transfer learning experiments across mouse, primate, and human data demonstrate that CellCover identifies markers of conserved cell classes in neocortical neurogenesis, as well as developmental progression in both progenitors and neurons. Exploring markers of human outer radial glia (oRG, or basal RG) across mammals, we show that transcriptomic elements of this key cell type in the expansion of the human cortex likely appeared in gliogenic precursors of the rodent before the full program emerged in neurogenic cells of the primate lineage. We have assembled the public datasets we use in this report within the NeMO Analytics multi-omic data exploration environment [1], where the expression of individual genes (NeMO: Individual genes in cortex and NeMO: Individual genes in blood) and marker gene panels (NeMO: Telley 3 CellCover Panels, NeMO: Telley 12 CellCover Panels, NeMO: Sorted Brain Cell CellCover Panels, and NeMO: Blood 34 CellCover Panels) can be freely explored without coding expertise. CellCover is available in CellCover R and CellCover Python. 
    more » « less