Numerous single‐cell transcriptomic datasets from identical tissues or cell lines are generated from different laboratories or single‐cell RNA sequencing (scRNA‐seq) protocols. The denoising of these datasets to eliminate batch effects is crucial for data integration, ensuring accurate interpretation and comprehensive analysis of biological questions. Although many scRNA‐seq data integration methods exist, most are inefficient and/or not conducive to downstream analysis. Here, DeepBID, a novel deep learning‐based method for batch effect correction, non‐linear dimensionality reduction, embedding, and cell clustering concurrently, is introduced. DeepBID utilizes a negative binomial‐based autoencoder with dual Kullback–Leibler divergence loss functions, aligning cell points from different batches within a consistent low‐dimensional latent space and progressively mitigating batch effects through iterative clustering. Extensive validation on multiple‐batch scRNA‐seq datasets demonstrates that DeepBID surpasses existing tools in removing batch effects and achieving superior clustering accuracy. When integrating multiple scRNA‐seq datasets from patients with Alzheimer's disease, DeepBID significantly improves cell clustering, effectively annotating unidentified cells, and detecting cell‐specific differentially expressed genes.
- PAR ID:
- 10222377
- Date Published:
- Journal Name:
- Proceedings of the National Academy of Sciences
- Volume:
- 118
- Issue:
- 10
- ISSN:
- 0027-8424
- Page Range / eLocation ID:
- e2024383118
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract -
null (Ed.)Single cell RNA-sequencing (scRNA-seq) technology enables comprehensive transcriptomic profiling of thousands of cells with distinct phenotypic and physiological states in a complex tissue. Substantial efforts have been made to characterize single cells of distinct identities from scRNA-seq data, including various cell clustering techniques. While existing approaches can handle single cells in terms of different cell (sub)types at a high resolution, identification of the functional variability within the same cell type remains unsolved. In addition, there is a lack of robust method to handle the inter-subject variation that often brings severe confounding effects for the functional clustering of single cells. In this study, we developed a novel data denoising and cell clustering approach, namely CIBS, to provide biologically explainable functional classification for scRNA-seq data. CIBS is based on a systems biology model of transcriptional regulation that assumes a multi-modality distribution of the cells’ activation status, and it utilizes a Boolean matrix factorization approach on the discretized expression status to robustly derive functional modules. CIBS is empowered by a novel fast Boolean Matrix Factorization method, namely PFAST, to increase the computational feasibility on large scale scRNA-seq data. Application of CIBS on two scRNA-seq datasets collected from cancer tumor micro-environment successfully identified subgroups of cancer cells with distinct expression patterns of epithelial-mesenchymal transition and extracellular matrix marker genes, which was not revealed by the existing cell clustering analysis tools. The identified cell groups were significantly associated with the clinically confirmed lymph-node invasion and metastasis events across different patients. Index Terms—Cell clustering analysis, Data denoising, Boolean matrix factorization, Cancer microenvirionment, Metastasis.more » « less
-
null (Ed.)With the advent of single-cell RNA sequencing (scRNA-seq) technologies, there has been a spike in stud-ies involving scRNA-seq of several tissues across diverse species includingDrosophila. Although a fewdatabases exist for users to query genes of interest within the scRNA-seq studies, search tools that enableusers to find orthologous genes and their cell type-specific expression patterns across species are limited.Here, we built a new search database, DRscDB (https://www.flyrnai.org/tools/single_cell/web/), toaddress this need. DRscDB serves as a comprehensive repository for published scRNA-seq datasets forDrosophilaand relevant datasets from human and other model organisms. DRscDB is based on manualcuration ofDrosophilascRNA-seq studies of various tissue types and their corresponding analogoustissues in vertebrates including zebrafish, mouse, and human. Of note, our search database provides mostof the literature-derived marker genes, thus preserving the original analysis of the published scRNA-seqdatasets. Finally, DRscDB serves as a web-based user interface that allows users to mine gene expressiondata from scRNA-seq studies and perform cell cluster enrichment analyses pertaining to variousscRNA-seq studies, both within and across species.more » « less
-
Abstract Large-scale scRNA-seq studies typically generate data in batches, which often induce nontrivial batch effects that need to be corrected. Given the global efforts for building cell atlases and the increasing number of annotated scRNA-seq datasets accumulated, we propose a supervised strategy for scRNA-seq data integration called SIDA (
S upervisedI ntegration usingD omainA daptation), which uses the cell type annotations to guide the integration of diverse batches. The supervised strategy is based on domain adaptation that was initially proposed in the computer vision field. We demonstrate that SIDA is able to generate comprehensive reference datasets that lead to improved accuracy in automated cell type mapping analyses. -
Abstract In recent years, the integration of single‐cell multi‐omics data has provided a more comprehensive understanding of cell functions and internal regulatory mechanisms from a non‐single omics perspective, but it still suffers many challenges, such as omics‐variance, sparsity, cell heterogeneity, and confounding factors. As it is known, the cell cycle is regarded as a confounder when analyzing other factors in single‐cell RNA‐seq data, but it is not clear how it will work on the integrated single‐cell multi‐omics data. Here, a cell cycle‐aware network (CCAN) is developed to remove cell cycle effects from the integrated single‐cell multi‐omics data while keeping the cell type‐specific variations. This is the first computational model to study the cell‐cycle effects in the integration of single‐cell multi‐omics data. Validations on several benchmark datasets show the outstanding performance of CCAN in a variety of downstream analyses and applications, including removing cell cycle effects and batch effects of scRNA‐seq datasets from different protocols, integrating paired and unpaired scRNA‐seq and scATAC‐seq data, accurately transferring cell type labels from scRNA‐seq to scATAC‐seq data, and characterizing the differentiation process from hematopoietic stem cells to different lineages in the integration of differentiation data.