skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Subject clustering by IF-PCA and several recent methods
Subject clustering (i.e., the use of measured features to cluster subjects, such as patients or cells, into multiple groups) is a problem of significant interest. In recent years, many approaches have been proposed, among which unsupervised deep learning (UDL) has received much attention. Two interesting questions are 1) how to combine the strengths of UDL and other approaches and 2) how these approaches compare to each other. We combine the variational auto-encoder (VAE), a popular UDL approach, with the recent idea of influential feature-principal component analysis (IF-PCA) and propose IF-VAE as a new method for subject clustering. We study IF-VAE and compare it with several other methods (including IF-PCA, VAE, Seurat, and SC3) on 10 gene microarray data sets and eight single-cell RNA-seq data sets. We find that IF-VAE shows significant improvement over VAE, but still underperforms compared to IF-PCA. We also find that IF-PCA is quite competitive, slightly outperforming Seurat and SC3 over the eight single-cell data sets. IF-PCA is conceptually simple and permits delicate analysis. We demonstrate that IF-PCA is capable of achieving phase transition in a rare/weak model. Comparatively, Seurat and SC3 are more complex and theoretically difficult to analyze (for these reasons, their optimality remains unclear).  more » « less
Award ID(s):
2015469 1943902
PAR ID:
10432207
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Frontiers in Genetics
Volume:
14
ISSN:
1664-8021
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract One key challenge encountered in single-cell data clustering is to combine clustering results of data sets acquired from multiple sources. We propose to represent the clustering result of each data set by a Gaussian mixture model (GMM) and produce an integrated result based on the notion of Wasserstein barycenter. However, the precise barycenter of GMMs, a distribution on the same sample space, is computationally infeasible to solve. Importantly, the barycenter of GMMs may not be a GMM containing a reasonable number of components. We thus propose to use the minimized aggregated Wasserstein (MAW) distance to approximate the Wasserstein metric and develop a new algorithm for computing the barycenter of GMMs under MAW. Recent theoretical advances further justify using the MAW distance as an approximation for the Wasserstein metric between GMMs. We also prove that the MAW barycenter of GMMs has the same expectation as the Wasserstein barycenter. Our proposed algorithm for clustering integration scales well with the data dimension and the number of mixture components, with complexity independent of data size. We demonstrate that the new method achieves better clustering results on several single-cell RNA-seq data sets than some other popular methods. 
    more » « less
  2. Abstract Principal Component Analysis (PCA) has long been a cornerstone in dimensionality reduction for high-dimensional data, including single-cell RNA sequencing (scRNA-seq). However, PCA’s performance typically degrades with increasing data size, can be sensitive to outliers, and assumes linearity. Recently, Random Projection (RP) methods have emerged as promising alternatives, addressing some of these limitations. This study systematically and comprehensively evaluates PCA and RP approaches, including Singular Value Decomposition (SVD) and randomized SVD, alongside Sparse and Gaussian Random Projection algorithms, with a focus on computational efficiency and downstream analysis effectiveness. We benchmark performance using multiple scRNA-seq datasets including labeled and unlabeled publicly available datasets. We apply Hierarchical Clustering and Spherical K-Means clustering algorithms to assess downstream clustering quality. For labeled datasets, clustering accuracy is measured using the Hungarian algorithm and Mutual Information. For unlabeled datasets, the Dunn Index and Gap Statistic capture cluster separation. Across both dataset types, the Within-Cluster Sum of Squares (WCSS) metric is used to assess variability. Additionally, locality preservation is examined, with RP outperforming PCA in several of the evaluated metrics. Our results demonstrate that RP not only surpasses PCA in computational speed but also rivals and, in some cases, exceeds PCA in preserving data variability and clustering quality. By providing a thorough benchmarking of PCA and RP methods, this work offers valuable insights into selecting optimal dimensionality reduction techniques, balancing computational performance, scalability, and the quality of downstream analyses. 
    more » « less
  3. null (Ed.)
    Abstract Single-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. We observe that snRNA-seq is commonly subject to contamination by high amounts of ambient RNA, which can lead to biased downstream analyses, such as identification of spurious cell types if overlooked. We present a novel approach to quantify contamination and filter droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: (1) human differentiating preadipocytes in vitro, (2) fresh mouse brain tissue, and (3) human frozen adipose tissue (AT) from six individuals. All three data sets showed evidence of extranuclear RNA contamination, and we observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq, our clustering strategy also successfully filtered single-cell RNA-seq data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem . 
    more » « less
  4. Abstract Single-cell RNA sequencing trades read-depth for dimensionality, often leading to loss of critical signaling gene information that is typically present in bulk data sets. We introduce DURIAN (Deconvolution and mUltitask-Regression-based ImputAtioN), an integrative method for recovery of gene expression in single-cell data. Through systematic benchmarking, we demonstrate the accuracy, robustness and empirical convergence of DURIAN using both synthetic and published data sets. We show that use of DURIAN improves single-cell clustering, low-dimensional embedding, and recovery of intercellular signaling networks. Our study resolves several inconsistent results of cell–cell communication analysis using single-cell or bulk data independently. The method has broad application in biomarker discovery and cell signaling analysis using single-cell transcriptomics data sets. 
    more » « less
  5. null (Ed.)
    Single cell RNA-sequencing (scRNA-seq) technology enables comprehensive transcriptomic profiling of thousands of cells with distinct phenotypic and physiological states in a complex tissue. Substantial efforts have been made to characterize single cells of distinct identities from scRNA-seq data, including various cell clustering techniques. While existing approaches can handle single cells in terms of different cell (sub)types at a high resolution, identification of the functional variability within the same cell type remains unsolved. In addition, there is a lack of robust method to handle the inter-subject variation that often brings severe confounding effects for the functional clustering of single cells. In this study, we developed a novel data denoising and cell clustering approach, namely CIBS, to provide biologically explainable functional classification for scRNA-seq data. CIBS is based on a systems biology model of transcriptional regulation that assumes a multi-modality distribution of the cells’ activation status, and it utilizes a Boolean matrix factorization approach on the discretized expression status to robustly derive functional modules. CIBS is empowered by a novel fast Boolean Matrix Factorization method, namely PFAST, to increase the computational feasibility on large scale scRNA-seq data. Application of CIBS on two scRNA-seq datasets collected from cancer tumor micro-environment successfully identified subgroups of cancer cells with distinct expression patterns of epithelial-mesenchymal transition and extracellular matrix marker genes, which was not revealed by the existing cell clustering analysis tools. The identified cell groups were significantly associated with the clinically confirmed lymph-node invasion and metastasis events across different patients. Index Terms—Cell clustering analysis, Data denoising, Boolean matrix factorization, Cancer microenvirionment, Metastasis. 
    more » « less