Abstract Single-cell RNA sequencing (scRNA-seq) is a widely used technique for characterizing individual cells and studying gene expression at the single-cell level. Clustering plays a vital role in grouping similar cells together for various downstream analyses. However, the high sparsity and dimensionality of large scRNA-seq data pose challenges to clustering performance. Although several deep learning-based clustering algorithms have been proposed, most existing clustering methods have limitations in capturing the precise distribution types of the data or fully utilizing the relationships between cells, leaving a considerable scope for improving the clustering performance, particularly in detecting rare cell populations from large scRNA-seq data. We introduce DeepScena, a novel single-cell hierarchical clustering tool that fully incorporates nonlinear dimension reduction, negative binomial-based convolutional autoencoder for data fitting, and a self-supervision model for cell similarity enhancement. In comprehensive evaluation using multiple large-scale scRNA-seq datasets, DeepScena consistently outperformed seven popular clustering tools in terms of accuracy. Notably, DeepScena exhibits high proficiency in identifying rare cell populations within large datasets that contain large numbers of clusters. When applied to scRNA-seq data of multiple myeloma cells, DeepScena successfully identified not only previously labeled large cell types but also subpopulations in CD14 monocytes, T cells and natural killer cells, respectively.
more »
« less
CosTaL: an accurate and scalable graph-based clustering algorithm for high-dimensional single-cell data analysis
Abstract With the aim of analyzing large-sized multidimensional single-cell datasets, we are describing a method for Cosine-based Tanimoto similarity-refined graph for community detection using Leiden’s algorithm (CosTaL). As a graph-based clustering method, CosTaL transforms the cells with high-dimensional features into a weighted k-nearest-neighbor (kNN) graph. The cells are represented by the vertices of the graph, while an edge between two vertices in the graph represents the close relatedness between the two cells. Specifically, CosTaL builds an exact kNN graph using cosine similarity and uses the Tanimoto coefficient as the refining strategy to re-weight the edges in order to improve the effectiveness of clustering. We demonstrate that CosTaL generally achieves equivalent or higher effectiveness scores on seven benchmark cytometry datasets and six single-cell RNA-sequencing datasets using six different evaluation metrics, compared with other state-of-the-art graph-based clustering methods, including PhenoGraph, Scanpy and PARC. As indicated by the combined evaluation metrics, Costal has high efficiency with small datasets and acceptable scalability for large datasets, which is beneficial for large-scale analysis.
more »
« less
- Award ID(s):
- 2002321
- PAR ID:
- 10411947
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Briefings in Bioinformatics
- Volume:
- 24
- Issue:
- 3
- ISSN:
- 1467-5463
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Cardiomyocytes (CMs) are heart cells responsible for heart contraction and relaxation. CMs can be derived from human induced pluripotent stem cells (hiPSCs) with high yield and purity. Mature CMs can potentially replace dead and dysfunctional cardiac tissue and be used for screening cardiac drugs and toxins. However, hiPSCs-derived CMs (hiPSC-CMs) are immature, which limits their utilization. Therefore, it is crucial to understand how experimental variables, especially tunable ones, of hiPSC expansion and differentiation phases affect the hiPSC-CM maturity stage. This study applied clustering algorithms to day 30 cardiac differentiation data to investigate if any maturity-related cell features could be related to the experimental variables. The best models were obtained using k-means and Gaussian mixture model clustering algorithms based on the evaluation metrics. They grouped the cells based on eccentricity and elongation. The cosine similarity between the clustering results and the experimental parameters revealed that the Gaussian mixture model results have strong similarities of 0.88, 0.94, and 0.93 with axial ratio, diameter, and cell concentration.more » « less
-
We present a novel method that automatically measures quality of sentential paraphrasing. Our method balances two conflicting criteria: semantic similarity and lexical diversity. Using a diverse annotated corpus, we built learning to rank models on edit distance, BLEU, ROUGE, and cosine similarity features. Extrinsic evaluation on STS Benchmark and ParaBank Evaluation datasets resulted in a model ensemble with moderate to high quality. We applied our method on both small benchmarking and large-scale datasets as resources for the community.more » « less
-
Mathelier, Anthony (Ed.)Abstract Motivation Recent breakthroughs of single-cell RNA sequencing (scRNA-seq) technologies offer an exciting opportunity to identify heterogeneous cell types in complex tissues. However, the unavoidable biological noise and technical artifacts in scRNA-seq data as well as the high dimensionality of expression vectors make the problem highly challenging. Consequently, although numerous tools have been developed, their accuracy remains to be improved. Results Here, we introduce a novel clustering algorithm and tool RCSL (Rank Constrained Similarity Learning) to accurately identify various cell types using scRNA-seq data from a complex tissue. RCSL considers both local similarity and global similarity among the cells to discern the subtle differences among cells of the same type as well as larger differences among cells of different types. RCSL uses Spearman’s rank correlations of a cell’s expression vector with those of other cells to measure its global similarity, and adaptively learns neighbor representation of a cell as its local similarity. The overall similarity of a cell to other cells is a linear combination of its global similarity and local similarity. RCSL automatically estimates the number of cell types defined in the similarity matrix, and identifies them by constructing a block-diagonal matrix, such that its distance to the similarity matrix is minimized. Each block-diagonal submatrix is a cell cluster/type, corresponding to a connected component in the cognate similarity graph. When tested on 16 benchmark scRNA-seq datasets in which the cell types are well-annotated, RCSL substantially outperformed six state-of-the-art methods in accuracy and robustness as measured by three metrics. Availability and implementation The RCSL algorithm is implemented in R and can be freely downloaded at https://cran.r-project.org/web/packages/RCSL/index.html. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
We propose a method for certifying the fairness of the classification result of a widely used supervised learning algorithm, the k-nearest neighbors (KNN), under the assumption that the training data may have historical bias caused by systematic mislabeling of samples from a protected minority group. To the best of our knowledge, this is the first certification method for KNN based on three variants of the fairness definition: individual fairness, ϵ -fairness, and label-flipping fairness. We first define the fairness certification problem for KNN and then propose sound approximations of the complex arithmetic computations used in the state-of-the-art KNN algorithm. This is meant to lift the computation results from the concrete domain to an abstract domain, to reduce the computational cost. We show effectiveness of this abstract interpretation based technique through experimental evaluation on six datasets widely used in the fairness research literature. We also show that the method is accurate enough to obtain fairness certifications for a large number of test inputs, despite the presence of historical bias in the datasets.more » « less
An official website of the United States government
