skip to main content


Title: A statistical simulator scDesign for rational scRNA-seq experimental design
Abstract Motivation

Single-cell RNA sequencing (scRNA-seq) has revolutionized biological sciences by revealing genome-wide gene expression levels within individual cells. However, a critical challenge faced by researchers is how to optimize the choices of sequencing platforms, sequencing depths and cell numbers in designing scRNA-seq experiments, so as to balance the exploration of the depth and breadth of transcriptome information.

Results

Here we present a flexible and robust simulator, scDesign, the first statistical framework for researchers to quantitatively assess practical scRNA-seq experimental design in the context of differential gene expression analysis. In addition to experimental design, scDesign also assists computational method development by generating high-quality synthetic scRNA-seq datasets under customized experimental settings. In an evaluation based on 17 cell types and 6 different protocols, scDesign outperformed four state-of-the-art scRNA-seq simulation methods and led to rational experimental design. In addition, scDesign demonstrates reproducibility across biological replicates and independent studies. We also discuss the performance of multiple differential expression and dimension reduction methods based on the protocol-dependent scRNA-seq data generated by scDesign. scDesign is expected to be an effective bioinformatic tool that assists rational scRNA-seq experimental design and comparison of scRNA–seq computational methods based on specific research goals.

Availability and implementation

We have implemented our method in the R package scDesign, which is freely available at https://github.com/Vivianstats/scDesign.

Supplementary information

Supplementary data are available at Bioinformatics online.

 
more » « less
Award ID(s):
1846216
NSF-PAR ID:
10425981
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
35
Issue:
14
ISSN:
1367-4803
Page Range / eLocation ID:
p. i41-i50
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Summary

    With the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells, etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or unique molecular identifier (UMI) deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-sequencing (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as polymerase chain reaction amplification, cellular barcodes (CB) and UMI selection and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. Abstract Motivation

    Single-cell RNA-sequencing (scRNA-seq) has brought the study of the transcriptome to higher resolution and makes it possible for scientists to provide answers with more clarity to the question of ‘differential expression’. However, most computational methods still stick with the old mentality of viewing differential expression as a simple ‘up or down’ phenomenon. We advocate that we should fully embrace the features of single cell data, which allows us to observe binary (from Off to On) as well as continuous (the amount of expression) regulations.

    Results

    We develop a method, termed SC2P, that first identifies the phase of expression a gene is in, by taking into account of both cell- and gene-specific contexts, in a model-based and data-driven fashion. We then identify two forms of transcription regulation: phase transition, and magnitude tuning. We demonstrate that compared with existing methods, SC2P provides substantial improvement in sensitivity without sacrificing the control of false discovery, as well as better robustness. Furthermore, the analysis provides better interpretation of the nature of regulation types in different genes.

    Availability and implementation

    SC2P is implemented as an open source R package publicly available at https://github.com/haowulab/SC2P.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Abstract

    The performance of computational methods and software to identify differentially expressed features in single‐cell RNA‐sequencing (scRNA‐seq) has been shown to be influenced by several factors, including the choice of the normalization method used and the choice of the experimental platform (or library preparation protocol) to profile gene expression in individual cells. Currently, it is up to the practitioner to choose the most appropriate differential expression (DE) method out of over 100 DE tools available to date, each relying on their own assumptions to model scRNA‐seq expression features. To model the technological variability in cross‐platform scRNA‐seq data, here we propose to use Tweedie generalized linear models that can flexibly capture a large dynamic range of observed scRNA‐seq expression profiles across experimental platforms induced by platform‐ and gene‐specific statistical properties such as heavy tails, sparsity, and gene expression distributions. We also propose a zero‐inflated Tweedie model that allows zero probability mass to exceed a traditional Tweedie distribution to model zero‐inflated scRNA‐seq data with excessive zero counts. Using both synthetic and published plate‐ and droplet‐based scRNA‐seq datasets, we perform a systematic benchmark evaluation of more than 10 representative DE methods and demonstrate that our method (Tweedieverse) outperforms the state‐of‐the‐art DE approaches across experimental platforms in terms of statistical power and false discovery rate control. Our open‐source software (R/Bioconductor package) is available athttps://github.com/himelmallick/Tweedieverse.

     
    more » « less
  4. Abstract Motivation

    The biclustering of large-scale gene expression data holds promising potential for detecting condition-specific functional gene modules (i.e. biclusters). However, existing methods do not adequately address a comprehensive detection of all significant bicluster structures and have limited power when applied to expression data generated by RNA-Sequencing (RNA-Seq), especially single-cell RNA-Seq (scRNA-Seq) data, where massive zero and low expression values are observed.

    Results

    We present a new biclustering algorithm, QUalitative BIClustering algorithm Version 2 (QUBIC2), which is empowered by: (i) a novel left-truncated mixture of Gaussian model for an accurate assessment of multimodality in zero-enriched expression data, (ii) a fast and efficient dropouts-saving expansion strategy for functional gene modules optimization using information divergency and (iii) a rigorous statistical test for the significance of all the identified biclusters in any organism, including those without substantial functional annotations. QUBIC2 demonstrated considerably improved performance in detecting biclusters compared to other five widely used algorithms on various benchmark datasets from E.coli, Human and simulated data. QUBIC2 also showcased robust and superior performance on gene expression data generated by microarray, bulk RNA-Seq and scRNA-Seq.

    Availability and implementation

    The source code of QUBIC2 is freely available at https://github.com/OSU-BMBL/QUBIC2.

    Contact

    czhang87@iu.edu or qin.ma@osumc.edu

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. Abstract Motivation

    Single cell RNA-seq (scRNA-seq) data contains a wealth of information which has to be inferred computationally from the observed sequencing reads. As the ability to sequence more cells improves rapidly, existing computational tools suffer from three problems. (i) The decreased reads-per-cell implies a highly sparse sample of the true cellular transcriptome. (ii) Many tools simply cannot handle the size of the resulting datasets. (iii) Prior biological knowledge such as bulk RNA-seq information of certain cell types or qualitative marker information is not taken into account. Here we present UNCURL, a preprocessing framework based on non-negative matrix factorization for scRNA-seq data, that is able to handle varying sampling distributions, scales to very large cell numbers and can incorporate prior knowledge.

    Results

    We find that preprocessing using UNCURL consistently improves performance of commonly used scRNA-seq tools for clustering, visualization and lineage estimation, both in the absence and presence of prior knowledge. Finally we demonstrate that UNCURL is extremely scalable and parallelizable, and runs faster than other methods on a scRNA-seq dataset containing 1.3 million cells.

    Availability and implementation

    Source code is available at https://github.com/yjzhang/uncurl_python.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less