The selection of marker gene panels is critical for capturing the cellular and spatial heterogeneity in the expanding atlases of single-cell RNA sequencing (scRNA-seq) and spatial transcriptomics data. Most current approaches to marker gene selection operate in a label-based framework, which is inherently limited by its dependency on predefined cell type labels or clustering results. In contrast, existing label-free methods often struggle to identify genes that characterize rare cell types or subtle spatial patterns, and they frequently fail to scale efficiently with large data sets. Here, we introduce geneCover, a label-free combinatorial method that selects an optimal panel of minimally redundant marker genes based on gene-gene correlations. Our method demonstrates excellent scalability to large data sets and identifies marker gene panels that capture distinct correlation structures across the transcriptome. This allows geneCover to distinguish cell states in various tissues of living organisms effectively, including those associated with rare or otherwise difficult-to-identify cell types. We evaluate the performance of geneCover across various scRNA-seq and spatial transcriptomics data sets, comparing it to other label-free algorithms to highlight its utility and potential in diverse biological contexts.
more »
« less
Optimal marker gene selection for cell type discrimination in single cell analyses
Abstract Single-cell technologies characterize complex cell populations across multiple data modalities at unprecedented scale and resolution. Multi-omic data for single cell gene expression, in situ hybridization, or single cell chromatin states are increasingly available across diverse tissue types. When isolating specific cell types from a sample of disassociated cells or performing in situ sequencing in collections of heterogeneous cells, one challenging task is to select a small set of informative markers that robustly enable the identification and discrimination of specific cell types or cell states as precisely as possible. Given single cell RNA-seq data and a set of cellular labels to discriminate, scGeneFit selects gene markers that jointly optimize cell label recovery using label-aware compressive classification methods. This results in a substantially more robust and less redundant set of markers than existing methods, most of which identify markers that separate each cell label from the rest. When applied to a data set given a hierarchy of cell types as labels, the markers found by our method improves the recovery of the cell type hierarchy with fewer markers than existing methods using a computationally efficient and principled optimization.
more »
« less
- PAR ID:
- 10214427
- Publisher / Repository:
- Nature Publishing Group
- Date Published:
- Journal Name:
- Nature Communications
- Volume:
- 12
- Issue:
- 1
- ISSN:
- 2041-1723
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Synopsis Single-cell RNA sequencing (scRNAseq) is a powerful tool to describe cell types in multicellular organisms across the animal kingdom. In standard scRNAseq analysis pipelines, clusters of cells with similar transcriptional signatures are given cell type labels based on marker genes that infer specialized known characteristics. Since these analyses are designed for model organisms, such as humans and mice, problems arise when attempting to label cell types of distantly related, non-model species that have unique or divergent cell types. Consequently, this leads to limited discovery of novel species-specific cell types and potential mis-annotation of cell types in non-model species while using scRNAseq. To address this problem, we discuss recently published approaches that help annotate scRNAseq clusters for any non-model organism. We first suggest that annotating with an evolutionary context of cell lineages will aid in the discovery of novel cell types and provide a marker-free approach to compare cell types across distantly related species. Secondly, machine learning has greatly improved bioinformatic analyses, so we highlight some open-source programs that use reference-free approaches to annotate cell clusters. Lastly, we propose the use of unannotated genes as potential cell markers for non-model organisms, as many do not have fully annotated genomes and these data are often disregarded. Improving single-cell annotations will aid the discovery of novel cell types and enhance our understanding of non-model organisms at a cellular level. By unifying approaches to annotate cell types in non-model organisms, we can increase the confidence of cell annotation label transfer and the flexibility to discover novel cell types.more » « less
-
Intermediate cell states (ICSs) during the epithelial–mesenchymal transition (EMT) are emerging as a driving force of cancer invasion and metastasis. ICSs typically exhibit hybrid epithelial/mesenchymal characteristics as well as cancer stem cell (CSC) traits including proliferation and drug resistance. Here, we analyze several single-cell RNA-seq (scRNA-seq) datasets to investigate the relation between several axes of cancer progression including EMT, CSC traits, and cell–cell signaling. To accomplish this task, we integrate computational methods for clustering and trajectory inference with analysis of EMT gene signatures, CSC markers, and cell–cell signaling pathways, and highlight conserved and specific processes across the datasets. Our analysis reveals that “standard” measures of pluripotency often used in developmental contexts do not necessarily correlate with EMT progression and expression of CSC-related markers. Conversely, an EMT circuit energy that quantifies the co-expression of epithelial and mesenchymal genes consistently increases along EMT trajectories across different cancer types and anatomical locations. Moreover, despite the high context specificity of signal transduction across different cell types, cells undergoing EMT always increased their potential to send and receive signals from other cells.more » « less
-
Abstract Single cell data integration methods aim to integrate cells across data batches and modalities, and data integration tasks can be categorized into horizontal, vertical, diagonal, and mosaic integration, where mosaic integration is the most general and challenging case with few methods developed. We propose scMoMaT, a method that is able to integrate single cell multi-omics data under the mosaic integration scenario using matrix tri-factorization. During integration, scMoMaT is also able to uncover the cluster specific bio-markers across modalities. These multi-modal bio-markers are used to interpret and annotate the clusters to cell types. Moreover, scMoMaT can integrate cell batches with unequal cell type compositions. Applying scMoMaT to multiple real and simulated datasets demonstrated these features of scMoMaT and showed that scMoMaT has superior performance compared to existing methods. Specifically, we show that integrated cell embedding combined with learned bio-markers lead to cell type annotations of higher quality or resolution compared to their original annotations.more » « less
-
Definition of cell classes across the tissues of living organisms is central in the analysis of growing atlases of single-cell RNA sequencing (scRNA-seq) data across biomedicine. Marker genes for cell classes are most often defined by differential expression (DE) methods that serially assess individual genes across landscapes of diverse cells. This serial approach has been extremely useful, but is limited because it ignores possible redundancy or complementarity across genes that can only be captured by analyzing multiple genes simultaneously. Interrogating binarized expression data, we aim to identify discriminating panels of genes that are specific to, not only enriched in, individual cell types. To efficiently explore the vast space of possible marker panels, leverage the large number of cells often sequenced, and overcome zero-inflation in scRNA-seq data, we propose viewing marker gene panel selection as a variation of the “minimal set-covering problem” in combinatorial optimization. Using scRNA-seq data from blood and brain tissue, we show that this new method, CellCover, performs as good or better than DE and other methods in defining cell-type discriminating gene panels, while reducing gene redundancy and capturing cell-class-specific signals that are distinct from those defined by DE methods. Transfer learning experiments across mouse, primate, and human data demonstrate that CellCover identifies markers of conserved cell classes in neocortical neurogenesis, as well as developmental progression in both progenitors and neurons. Exploring markers of human outer radial glia (oRG, or basal RG) across mammals, we show that transcriptomic elements of this key cell type in the expansion of the human cortex likely appeared in gliogenic precursors of the rodent before the full program emerged in neurogenic cells of the primate lineage. We have assembled the public datasets we use in this report within the NeMO Analytics multi-omic data exploration environment [1], where the expression of individual genes (NeMO: Individual genes in cortex and NeMO: Individual genes in blood) and marker gene panels (NeMO: Telley 3 CellCover Panels, NeMO: Telley 12 CellCover Panels, NeMO: Sorted Brain Cell CellCover Panels, and NeMO: Blood 34 CellCover Panels) can be freely explored without coding expertise. CellCover is available in CellCover R and CellCover Python.more » « less
An official website of the United States government
