skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Enhancing droplet-based single-nucleus RNA-seq resolution using the semi-supervised machine learning classifier DIEM
Abstract Single-nucleus RNA sequencing (snRNA-seq) measures gene expression in individual nuclei instead of cells, allowing for unbiased cell type characterization in solid tissues. We observe that snRNA-seq is commonly subject to contamination by high amounts of ambient RNA, which can lead to biased downstream analyses, such as identification of spurious cell types if overlooked. We present a novel approach to quantify contamination and filter droplets in snRNA-seq experiments, called Debris Identification using Expectation Maximization (DIEM). Our likelihood-based approach models the gene expression distribution of debris and cell types, which are estimated using EM. We evaluated DIEM using three snRNA-seq data sets: (1) human differentiating preadipocytes in vitro, (2) fresh mouse brain tissue, and (3) human frozen adipose tissue (AT) from six individuals. All three data sets showed evidence of extranuclear RNA contamination, and we observed that existing methods fail to account for contaminated droplets and led to spurious cell types. When compared to filtering using these state of the art methods, DIEM better removed droplets containing high levels of extranuclear RNA and led to higher quality clusters. Although DIEM was designed for snRNA-seq, our clustering strategy also successfully filtered single-cell RNA-seq data. To conclude, our novel method DIEM removes debris-contaminated droplets from single-cell-based data fast and effectively, leading to cleaner downstream analysis. Our code is freely available for use at https://github.com/marcalva/diem .  more » « less
Award ID(s):
1705197
PAR ID:
10283371
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Scientific Reports
Volume:
10
Issue:
1
ISSN:
2045-2322
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract We present Bisque, a tool for estimating cell type proportions in bulk expression. Bisque implements a regression-based approach that utilizes single-cell RNA-seq (scRNA-seq) or single-nucleus RNA-seq (snRNA-seq) data to generate a reference expression profile and learn gene-specific bulk expression transformations to robustly decompose RNA-seq data. These transformations significantly improve decomposition performance compared to existing methods when there is significant technical variation in the generation of the reference profile and observed bulk expression. Importantly, compared to existing methods, our approach is extremely efficient, making it suitable for the analysis of large genomic datasets that are becoming ubiquitous. When applied to subcutaneous adipose and dorsolateral prefrontal cortex expression datasets with both bulk RNA-seq and snRNA-seq data, Bisque replicates previously reported associations between cell type proportions and measured phenotypes across abundant and rare cell types. We further propose an additional mode of operation that merely requires a set of known marker genes. 
    more » « less
  2. Abstract To understand phenotypic variations and key factors which affect disease susceptibility of complex traits, it is important to decipher cell‐type tissue compositions. To study cellular compositions of bulk tissue samples, one can evaluate cellular abundances and cell‐type‐specific gene expression patterns from the tissue transcriptome profiles. We develop both fixed and mixed models to reconstruct cellular expression fractions for bulk‐profiled samples by using reference single‐cell (sc) RNA‐sequencing (RNA‐seq) reference data. In benchmark evaluations of estimating cellular expression fractions, the mixed‐effect models provide similar results as an elegant machine learning algorithm named cell‐type identification by estimating relative subsets of RNA transcripts (CIBERSORTx), which is a well‐known and reliable procedure to reconstruct cell‐type abundances and cell‐type‐specific gene expression profiles. In real data analysis, the mixed‐effect models outperform or perform similarly as CIBERSORTx. The mixed models perform better than the fixed models in both benchmark evaluations and data analysis. In simulation studies, we show that if the heterogeneity exists in scRNA‐seq data, it is better to use mixed models with heterogeneous mean and variance–covariance. As a byproduct, the mixed models provide fractions of covariance between subject‐specific gene expression and cell types to measure their correlations. The proposed mixed models provide a complementary tool to dissect bulk tissues using scRNA‐seq data. 
    more » « less
  3. Abstract Single-cell RNA sequencing is a powerful technique that continues to expand across various biological applications. However, incomplete 3′-UTR annotations can impede single-cell analysis resulting in genes that are partially or completely uncounted. Performing single-cell RNA sequencing with incomplete 3′-UTR annotations can hinder the identification of cell identities and gene expression patterns and lead to erroneous biological inferences. We demonstrate that performing single-cell isoform sequencing in tandem with single-cell RNA sequencing can rapidly improve 3′-UTR annotations. Using threespine stickleback fish (Gasterosteus aculeatus), we show that gene models resulting from a minimal embryonic single-cell isoform sequencing dataset retained 26.1% greater single-cell RNA sequencing reads than gene models from Ensembl alone. Furthermore, pooling our single-cell sequencing isoforms with a previously published adult bulk Iso-Seq dataset from stickleback, and merging the annotation with the Ensembl gene models, resulted in a marginal improvement (+0.8%) over the single-cell isoform sequencing only dataset. In addition, isoforms identified by single-cell isoform sequencing included thousands of new splicing variants. The improved gene models obtained using single-cell isoform sequencing led to successful identification of cell types and increased the reads identified of many genes in our single-cell RNA sequencing stickleback dataset. Our work illuminates single-cell isoform sequencing as a cost-effective and efficient mechanism to rapidly annotate genomes for single-cell RNA sequencing. 
    more » « less
  4. Gene expression is the fundamental level at which the results of various genetic and regulatory programs are observable. The measurement of transcriptome-wide gene expression has convincingly switched from microarrays to sequencing in a matter of years. RNA sequencing (RNA-seq) provides a quantitative and open system for profiling transcriptional outcomes on a large scale and therefore facilitates a large diversity of applications, including basic science studies, but also agricultural or clinical situations. In the past 10 years or so, much has been learned about the characteristics of the RNA-seq data sets, as well as the performance of the myriad of methods developed. In this review, we give an overview of the developments in RNA-seq data analysis, including experimental design, with an explicit focus on the quantification of gene expression and statistical approachesfor differential expression. We also highlight emerging data types, such as single-cell RNA-seq and gene expression profiling using long-read technologies. 
    more » « less
  5. The metabolic heterogeneity and metabolic interplay between cells are known as significant contributors to disease treatment resistance. However, with the lack of a mature high-throughput single-cell metabolomics technology, we are yet to establish systematic understanding of the intra-tissue metabolic heterogeneity and cooperative mechanisms. To mitigate this knowledge gap, we developed a novel computational method, namely, single-cell flux estimation analysis (scFEA), to infer the cell-wise fluxome from single-cell RNA-sequencing (scRNA-seq) data. scFEA is empowered by a systematically reconstructed human metabolic map as a factor graph, a novel probabilistic model to leverage the flux balance constraints on scRNA-seq data, and a novel graph neural network–based optimization solver. The intricate information cascade from transcriptome to metabolome was captured using multilayer neural networks to capitulate the nonlinear dependency between enzymatic gene expressions and reaction rates. We experimentally validated scFEA by generating an scRNA-seq data set with matched metabolomics data on cells of perturbed oxygen and genetic conditions. Application of scFEA on this data set showed the consistency between predicted flux and the observed variation of metabolite abundance in the matched metabolomics data. We also applied scFEA on five publicly available scRNA-seq and spatial transcriptomics data sets and identified context- and cell group–specific metabolic variations. The cell-wise fluxome predicted by scFEA empowers a series of downstream analyses including identification of metabolic modules or cell groups that share common metabolic variations, sensitivity evaluation of enzymes with regards to their impact on the whole metabolic flux, and inference of cell–tissue and cell–cell metabolic communications. 
    more » « less