Abstract Recent advances in biochemistry and single-cell RNA sequencing (scRNA-seq) have allowed us to monitor the biological systems at the single-cell resolution. However, the low capture of mRNA material within individual cells often leads to inaccurate quantification of genetic material. Consequently, a significant amount of expression values are reported as missing, which are often referred to as dropouts. To overcome this challenge, we develop a novel imputation method, named single-cell Imputation via Subspace Regression (scISR), that can reliably recover the dropout values of scRNA-seq data. The scISR method first uses a hypothesis-testing technique to identify zero-valued entries that are most likely affected by dropout events and then estimates the dropout values using a subspace regression model. Our comprehensive evaluation using 25 publicly available scRNA-seq datasets and various simulation scenarios against five state-of-the-art methods demonstrates that scISR is better than other imputation methods in recovering scRNA-seq expression profiles via imputation. scISR consistently improves the quality of cluster analysis regardless of dropout rates, normalization techniques, and quantification schemes. The source code of scISR can be found on GitHub at https://github.com/duct317/scISR .
Evaluating Model-free Directional Dependency Methods on Single-cell RNA Sequencing Data with Severe Dropout
As severe dropout in single-cell RNA sequencing (scRNA-seq) degrades data quality, current methods for network inference face increased uncertainty from such data. To examine how dropout influences directional dependency inference from scRNA-seq data, we thus studied four methods based on discrete data that are model-free without parametric model assumptions. They include two established methods: conditional entropy and Kruskal-Wallis test, and two recent methods: causal inference by stochastic complexity and function index. We also included three non-directional methods for a contrast. On simulated data, function index performed most favorably at varying dropout rates, sample sizes, and discrete levels. On an scRNA-seq dataset from developing mouse cerebella, function index and Kruskal-Wallis test performed favorably over other methods in detecting expression of developmental genes as a function of time. Overall among the four methods, function index is most resistant to dropout for both directional and dependency inference. The next best choice, Kruskal-Wallis test, carries a directional bias towards a uniformly distributed variable. We conclude that a method robust to marginal distributions with a sufficiently large sample size can reap benefits of single-cell over bulk RNA sequencing in understanding molecular mechanisms at the cellular resolution.
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- Proceedings of International Conference on Bioinformatics Research and Applications (ICBRA)
- Page Range or eLocation-ID:
- 55 to 62
- Sponsoring Org:
- National Science Foundation
More Like this
SIMPLEs: a single-cell RNA sequencing imputation strategy preserving gene modules and cell clusters variationAbstract A main challenge in analyzing single-cell RNA sequencing (scRNA-seq) data is to reduce technical variations yet retain cell heterogeneity. Due to low mRNAs content per cell and molecule losses during the experiment (called ‘dropout’), the gene expression matrix has a substantial amount of zero read counts. Existing imputation methods treat either each cell or each gene as independently and identically distributed, which oversimplifies the gene correlation and cell type structure. We propose a statistical model-based approach, called SIMPLEs (SIngle-cell RNA-seq iMPutation and celL clustErings), which iteratively identifies correlated gene modules and cell clusters and imputes dropouts customized for individual gene module and cell type. Simultaneously, it quantifies the uncertainty of imputation and cell clustering via multiple imputations. In simulations, SIMPLEs performed significantly better than prevailing scRNA-seq imputation methods according to various metrics. By applying SIMPLEs to several real datasets, we discovered gene modules that can further classify subtypes of cells. Our imputations successfully recovered the expression trends of marker genes in stem cell differentiation and can discover putative pathways regulating biological processes.
netNMF-sc: Leveraging gene-gene interactions for imputation and dimensionality reduction in single-cell expression analysisSingle-cell RNA-sequencing (scRNA-seq) enables high throughput measurement of RNA expression in individual cells. Due to technical limitations, scRNA-seq data often contain zero counts for many transcripts in individual cells. These zero counts, or dropout events, complicate the analysis of scRNA-seq data using standard analysis methods developed for bulk RNA-seq data. Current scRNA-seq analysis methods typically overcome dropout by combining information across cells, leveraging the observation that cells generally occupy a small number of RNA expression states. We introduce netNMF-sc, an algorithm for scRNA-seq analysis that leverages information across both cells and genes. netNMF-sc combines network-regularized non-negative matrix factorization with a procedure for handling zero inflation in transcript count matrices. The matrix factorization results in a low-dimensional representation of the transcript count matrix, which imputes gene abundance for both zero and non-zero entries and can be used to cluster cells. The network regularization leverages prior knowledge of gene-gene interactions, encouraging pairs of genes with known interactions to be close in the low-dimensional representation. We show that netNMF-sc outperforms existing methods on simulated and real scRNA-seq data, with increasing advantage at higher dropout rates (e.g. above 60%). Furthermore, we show that the results from netNMF-sc -- including estimation of gene-gene covariance --more »
Kendziorski, Christina (Ed.)Abstract Motivation Allelic expression analysis aids in detection of cis-regulatory mechanisms of genetic variation which produce allelic imbalance (AI) in heterozygotes. Measuring AI in bulk data lacking time or spatial resolution has the limitation that cell-type-specific (CTS), spatial-, or time-dependent AI signals may be dampened or not detected. Results We introduce a statistical method airpart for identifying differential CTS AI from single-cell RNA-sequencing (scRNA-seq) data, or other spatially- or time-resolved datasets. airpart outputs discrete partitions of data, pointing to groups of genes and cells under common mechanisms of cis-genetic regulation. In order to account for low counts in single-cell data, our method uses a Generalized Fused Lasso with Binomial likelihood for partitioning groups of cells by AI signal, and a hierarchical Bayesian model for AI statistical inference. In simulation, airpart accurately detected partitions of cell types by their AI and had lower RMSE of allelic ratio estimates than existing methods. In real data, airpart identified DAI patterns across cell states and could be used to define trends of AI signal over spatial or time axes. Availability The airpart package is available as an R/Bioconductor package at https://bioconductor.org/packages/airpart.
The metabolic heterogeneity and metabolic interplay between cells are known as significant contributors to disease treatment resistance. However, with the lack of a mature high-throughput single-cell metabolomics technology, we are yet to establish systematic understanding of the intra-tissue metabolic heterogeneity and cooperative mechanisms. To mitigate this knowledge gap, we developed a novel computational method, namely, single-cell flux estimation analysis (scFEA), to infer the cell-wise fluxome from single-cell RNA-sequencing (scRNA-seq) data. scFEA is empowered by a systematically reconstructed human metabolic map as a factor graph, a novel probabilistic model to leverage the flux balance constraints on scRNA-seq data, and a novel graph neural network–based optimization solver. The intricate information cascade from transcriptome to metabolome was captured using multilayer neural networks to capitulate the nonlinear dependency between enzymatic gene expressions and reaction rates. We experimentally validated scFEA by generating an scRNA-seq data set with matched metabolomics data on cells of perturbed oxygen and genetic conditions. Application of scFEA on this data set showed the consistency between predicted flux and the observed variation of metabolite abundance in the matched metabolomics data. We also applied scFEA on five publicly available scRNA-seq and spatial transcriptomics data sets and identified context- and cell group–specific metabolic variations.more »