Single-cell RNA sequencing (scRNA-seq) data often contain doublets, where a doublet manifests as 1 cell barcode that corresponds to combined gene expression of two or more cells. Existence of doublets can lead to spurious biological interpretations. Here, we present s ingle- c ell MO del-driven D oublet D etection ( scMODD ), a model-driven algorithm to detect doublets in scRNA-seq data. ScMODD achieved similar performance compared to existing doublet detection algorithms which are primarily data-driven, showing the promise of model-driven approach for doublet detection. When implementing scMODD in simulated and real scRNA-seq data, we tested both the negative binomial (NB) model and the zero-inflated negative binomial (ZINB) model to serve as the underlying statistical model for scRNA-seq count data, and observed that incorporating zero inflation did not improve detection performance, suggesting that consideration of zero inflation is not necessary in the context of doublet detection in scRNA-seq.
more »
« less
doubletD: detecting doublets in single-cell DNA sequencing data
Abstract Motivation While single-cell DNA sequencing (scDNA-seq) has enabled the study of intratumor heterogeneity at an unprecedented resolution, current technologies are error-prone and often result in doublets where two or more cells are mistaken for a single cell. Not only do doublets confound downstream analyses, but the increase in doublet rate is also a major bottleneck preventing higher throughput with current single-cell technologies. Although doublet detection and removal are standard practice in scRNA-seq data analysis, options for scDNA-seq data are limited. Current methods attempt to detect doublets while also performing complex downstream analyses tasks, leading to decreased efficiency and/or performance. Results We present doubletD, the first standalone method for detecting doublets in scDNA-seq data. Underlying our method is a simple maximum likelihood approach with a closed-form solution. We demonstrate the performance of doubletD on simulated data as well as real datasets, outperforming current methods for downstream analysis of scDNA-seq data that jointly infer doublets as well as standalone approaches for doublet detection in scRNA-seq data. Incorporating doubletD in scDNA-seq analysis pipelines will reduce complexity and lead to more accurate results. Availability and implementation https://github.com/elkebir-group/doubletD. Supplementary information Supplementary data are available at Bioinformatics online.
more »
« less
- PAR ID:
- 10289269
- Date Published:
- Journal Name:
- Bioinformatics
- Volume:
- 37
- Issue:
- Supplement_1
- ISSN:
- 1367-4803
- Page Range / eLocation ID:
- i214 to i221
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract SummaryWith the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells, etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or unique molecular identifier (UMI) deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-sequencing (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as polymerase chain reaction amplification, cellular barcodes (CB) and UMI selection and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
ABSTRACT: Motivation Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data. Results Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data. Availability and implementation The R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
Abstract MotivationSingle-cell RNA sequencing (scRNA-seq) has revolutionized biological sciences by revealing genome-wide gene expression levels within individual cells. However, a critical challenge faced by researchers is how to optimize the choices of sequencing platforms, sequencing depths and cell numbers in designing scRNA-seq experiments, so as to balance the exploration of the depth and breadth of transcriptome information. ResultsHere we present a flexible and robust simulator, scDesign, the first statistical framework for researchers to quantitatively assess practical scRNA-seq experimental design in the context of differential gene expression analysis. In addition to experimental design, scDesign also assists computational method development by generating high-quality synthetic scRNA-seq datasets under customized experimental settings. In an evaluation based on 17 cell types and 6 different protocols, scDesign outperformed four state-of-the-art scRNA-seq simulation methods and led to rational experimental design. In addition, scDesign demonstrates reproducibility across biological replicates and independent studies. We also discuss the performance of multiple differential expression and dimension reduction methods based on the protocol-dependent scRNA-seq data generated by scDesign. scDesign is expected to be an effective bioinformatic tool that assists rational scRNA-seq experimental design and comparison of scRNA–seq computational methods based on specific research goals. Availability and implementationWe have implemented our method in the R package scDesign, which is freely available at https://github.com/Vivianstats/scDesign. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Abstract In recent years, the integration of single‐cell multi‐omics data has provided a more comprehensive understanding of cell functions and internal regulatory mechanisms from a non‐single omics perspective, but it still suffers many challenges, such as omics‐variance, sparsity, cell heterogeneity, and confounding factors. As it is known, the cell cycle is regarded as a confounder when analyzing other factors in single‐cell RNA‐seq data, but it is not clear how it will work on the integrated single‐cell multi‐omics data. Here, a cell cycle‐aware network (CCAN) is developed to remove cell cycle effects from the integrated single‐cell multi‐omics data while keeping the cell type‐specific variations. This is the first computational model to study the cell‐cycle effects in the integration of single‐cell multi‐omics data. Validations on several benchmark datasets show the outstanding performance of CCAN in a variety of downstream analyses and applications, including removing cell cycle effects and batch effects of scRNA‐seq datasets from different protocols, integrating paired and unpaired scRNA‐seq and scATAC‐seq data, accurately transferring cell type labels from scRNA‐seq to scATAC‐seq data, and characterizing the differentiation process from hematopoietic stem cells to different lineages in the integration of differentiation data.more » « less
An official website of the United States government

