skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, November 14 until 2:00 AM ET on Saturday, November 15 due to maintenance. We apologize for the inconvenience.


Title: CellMeSH: probabilistic cell-type identification using indexed literature
Abstract MotivationSingle-cell RNA sequencing (scRNA-seq) is widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-seq experiments aim to identify and quantify all cell types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are mapped to cell types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious ad hoc effort that requires expert biological knowledge. ResultsHere, we introduce CellMeSH—a new automated approach to identifying cell types for clusters based on prior literature. CellMeSH combines a database of gene–cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and is easily updated with new data. The probabilistic query method enables reliable information retrieval even though the gene–cell-type associations extracted from the literature are noisy. CellMeSH is also able to optionally utilize prior knowledge about tissues or cells for further annotation improvement. CellMeSH achieves top-one and top-three accuracies on a number of mouse and human datasets that are consistently better than existing approaches. Availability and implementationWeb server at https://uncurl.cs.washington.edu/db_query and API at https://github.com/shunfumao/cellmesh. Supplementary informationSupplementary data are available at Bioinformatics online.  more » « less
Award ID(s):
1651236
PAR ID:
10492428
Author(s) / Creator(s):
; ; ;
Editor(s):
Birol, Inanc
Publisher / Repository:
Bioinformatics
Date Published:
Journal Name:
Bioinformatics
Volume:
38
Issue:
5
ISSN:
1367-4803
Page Range / eLocation ID:
1393 to 1402
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract MotivationSingle cell RNA-seq (scRNA-seq) data contains a wealth of information which has to be inferred computationally from the observed sequencing reads. As the ability to sequence more cells improves rapidly, existing computational tools suffer from three problems. (i) The decreased reads-per-cell implies a highly sparse sample of the true cellular transcriptome. (ii) Many tools simply cannot handle the size of the resulting datasets. (iii) Prior biological knowledge such as bulk RNA-seq information of certain cell types or qualitative marker information is not taken into account. Here we present UNCURL, a preprocessing framework based on non-negative matrix factorization for scRNA-seq data, that is able to handle varying sampling distributions, scales to very large cell numbers and can incorporate prior knowledge. ResultsWe find that preprocessing using UNCURL consistently improves performance of commonly used scRNA-seq tools for clustering, visualization and lineage estimation, both in the absence and presence of prior knowledge. Finally we demonstrate that UNCURL is extremely scalable and parallelizable, and runs faster than other methods on a scRNA-seq dataset containing 1.3 million cells. Availability and implementationSource code is available at https://github.com/yjzhang/uncurl_python. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  2. Abstract BackgroundSingle-cell RNA-sequencing (scRNA-seq) technologies allow for the study of gene expression in individual cells. Often, it is of interest to understand how transcriptional activity is associated with cell-specific covariates, such as cell type, genotype, or measures of cell health. Traditional approaches for this type of association mapping assume independence between the outcome variables (or genes), and perform a separate regression for each. However, these methods are computationally costly and ignore the substantial correlation structure of gene expression. Furthermore, count-based scRNA-seq data pose challenges for traditional models based on Gaussian assumptions. ResultsWe aim to resolve these issues by developing a reduced-rank regression model that identifies low-dimensional linear associations between a large number of cell-specific covariates and high-dimensional gene expression readouts. Our probabilistic model uses a Poisson likelihood in order to account for the unique structure of scRNA-seq counts. We demonstrate the performance of our model using simulations, and we apply our model to a scRNA-seq dataset, a spatial gene expression dataset, and a bulk RNA-seq dataset to show its behavior in three distinct analyses. ConclusionWe show that our statistical modeling approach, which is based on reduced-rank regression, captures associations between gene expression and cell- and sample-specific covariates by leveraging low-dimensional representations of transcriptional states. 
    more » « less
  3. Abstract MotivationSingle-cell RNA sequencing (scRNA-seq) has revolutionized biological sciences by revealing genome-wide gene expression levels within individual cells. However, a critical challenge faced by researchers is how to optimize the choices of sequencing platforms, sequencing depths and cell numbers in designing scRNA-seq experiments, so as to balance the exploration of the depth and breadth of transcriptome information. ResultsHere we present a flexible and robust simulator, scDesign, the first statistical framework for researchers to quantitatively assess practical scRNA-seq experimental design in the context of differential gene expression analysis. In addition to experimental design, scDesign also assists computational method development by generating high-quality synthetic scRNA-seq datasets under customized experimental settings. In an evaluation based on 17 cell types and 6 different protocols, scDesign outperformed four state-of-the-art scRNA-seq simulation methods and led to rational experimental design. In addition, scDesign demonstrates reproducibility across biological replicates and independent studies. We also discuss the performance of multiple differential expression and dimension reduction methods based on the protocol-dependent scRNA-seq data generated by scDesign. scDesign is expected to be an effective bioinformatic tool that assists rational scRNA-seq experimental design and comparison of scRNA–seq computational methods based on specific research goals. Availability and implementationWe have implemented our method in the R package scDesign, which is freely available at https://github.com/Vivianstats/scDesign. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  4. ABSTRACT: Motivation Single-cell RNA sequencing (scRNA-seq) captures whole transcriptome information of individual cells. While scRNA-seq measures thousands of genes, researchers are often interested in only dozens to hundreds of genes for a closer study. Then, a question is how to select those informative genes from scRNA-seq data. Moreover, single-cell targeted gene profiling technologies are gaining popularity for their low costs, high sensitivity and extra (e.g. spatial) information; however, they typically can only measure up to a few hundred genes. Then another challenging question is how to select genes for targeted gene profiling based on existing scRNA-seq data. Results Here, we develop the single-cell Projective Non-negative Matrix Factorization (scPNMF) method to select informative genes from scRNA-seq data in an unsupervised way. Compared with existing gene selection methods, scPNMF has two advantages. First, its selected informative genes can better distinguish cell types. Second, it enables the alignment of new targeted gene profiling data with reference data in a low-dimensional space to facilitate the prediction of cell types in the new data. Technically, scPNMF modifies the PNMF algorithm for gene selection by changing the initialization and adding a basis selection step, which selects informative bases to distinguish cell types. We demonstrate that scPNMF outperforms the state-of-the-art gene selection methods on diverse scRNA-seq datasets. Moreover, we show that scPNMF can guide the design of targeted gene profiling experiments and the cell-type annotation on targeted gene profiling data. Availability and implementation The R package is open-access and available at https://github.com/JSB-UCLA/scPNMF. The data used in this work are available at Zenodo: https://doi.org/10.5281/zenodo.4797997. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  5. When analyzing scRNA-seq data with clustering algorithms, annotating the clusters with cell types is an essential step toward biological interpretation of the data. Annotations can be performed manually using known cell type marker genes. Annotations can also be automated using knowledge-driven or data-driven machine learning algorithms. Majority of cell type annotation algorithms are designed to predict cell types for individual cells in a new dataset. Since biological interpretation of scRNA-seq data is often made on cell clusters rather than individual cells, several algorithms have been developed to annotate cell clusters. In this study, we compared five cell type annotation algorithms, Azimuth, SingleR, Garnett, scCATCH, and SCSA, which cover the spectrum of knowledge-driven and data-driven approaches to annotate either individual cells or cell clusters. We applied these five algorithms to two scRNA-seq datasets of peripheral blood mononuclear cells (PBMC) samples from COVID-19 patients and healthy controls, and evaluated their annotation performance. From this comparison, we observed that methods for annotating individual cells outperformed methods for annotation cell clusters. We applied the cell-based annotation algorithm Azimuth to the two scRNA-seq datasets to examine the immune response during COVID-19 infection. Both datasets presented significant depletion of plasmacytoid dendritic cells (pDCs), where differential expression in this cell type and pathway analysis revealed strong activation of type I interferon signaling pathway in response to the infection. 
    more » « less