Abstract BackgroundCurrent methods for analyzing single-cell datasets have relied primarily on static gene expression measurements to characterize the molecular state of individual cells. However, capturing temporal changes in cell state is crucial for the interpretation of dynamic phenotypes such as the cell cycle, development, or disease progression. RNA velocity infers the direction and speed of transcriptional changes in individual cells, yet it is unclear how these temporal gene expression modalities may be leveraged for predictive modeling of cellular dynamics. ResultsHere, we present the first task-oriented benchmarking study that investigates integration of temporal sequencing modalities for dynamic cell state prediction. We benchmark ten integration approaches on ten datasets spanning different biological contexts, sequencing technologies, and species. We find that integrated data more accurately infers biological trajectories and achieves increased performance on classifying cells according to perturbation and disease states. Furthermore, we show that simple concatenation of spliced and unspliced molecules performs consistently well on classification tasks and can be used over more memory intensive and computationally expensive methods. ConclusionsThis work illustrates how integrated temporal gene expression modalities may be leveraged for predicting cellular trajectories and sample-associated perturbation and disease phenotypes. Additionally, this study provides users with practical recommendations for task-specific integration of single-cell gene expression modalities.
more »
« less
SCEMENT: scalable and memory efficient integration of large-scale single-cell RNA-sequencing data
Abstract MotivationIntegrative analysis of large-scale single-cell data collected from diverse cell populations promises an improved understanding of complex biological systems. While several algorithms have been developed for single-cell RNA-sequencing data integration, many lack the scalability to handle large numbers of datasets and/or millions of cells due to their memory and run time requirements. The few tools that can handle large data do so by reducing the computational burden through strategies such as subsampling of the data or selecting a reference dataset to improve computational efficiency and scalability. Such shortcuts, however, hamper the accuracy of downstream analyses, especially those requiring quantitative gene expression information. ResultsWe present SCEMENT, a SCalablE and Memory-Efficient iNTegration method, to overcome these limitations. Our new parallel algorithm builds upon and extends the linear regression model previously applied in ComBat to an unsupervised sparse matrix setting to enable accurate integration of diverse and large collections of single-cell RNA-sequencing data. Using tens to hundreds of real single-cell RNA-seq datasets, we show that SCEMENT outperforms ComBat as well as FastIntegration and Scanorama in runtime (upto 214× faster) and memory usage (upto 17.5× less). It not only performs batch correction and integration of millions of cells in under 25 min, but also facilitates the discovery of new rare cell types and more robust reconstruction of gene regulatory networks with full quantitative gene expression information. Availability and implementationSource code freely available for download at https://github.com/AluruLab/scement, implemented in C++ and supported on Linux.
more »
« less
- Award ID(s):
- 2233887
- PAR ID:
- 10575189
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Bioinformatics
- Volume:
- 41
- Issue:
- 2
- ISSN:
- 1367-4811
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract BackgroundLow back pain is a leading cause of disability worldwide and is frequently attributed to intervertebral disc (IVD) degeneration. Though the contributions of the adjacent cartilage endplates (CEP) to IVD degeneration are well documented, the phenotype and functions of the resident CEP cells are critically understudied. To better characterize CEP cell phenotype and possible mechanisms of CEP degeneration, bulk and single-cell RNA sequencing of non-degenerated and degenerated CEP cells were performed. MethodsHuman lumbar CEP cells from degenerated (Thompson grade ≥ 4) and non-degenerated (Thompson grade ≤ 2) discs were expanded for bulk (N=4 non-degenerated,N=4 degenerated) and single-cell (N=1 non-degenerated,N=1 degenerated) RNA sequencing. Genes identified from bulk RNA sequencing were categorized by function and their expression in non-degenerated and degenerated CEP cells were compared. A PubMed literature review was also performed to determine which genes were previously identified and studied in the CEP, IVD, and other cartilaginous tissues. For single-cell RNA sequencing, different cell clusters were resolved using unsupervised clustering and functional annotation. Differential gene expression analysis and Gene Ontology, respectively, were used to compare gene expression and functional enrichment between cell clusters, as well as between non-degenerated and degenerated CEP samples. ResultsBulk RNA sequencing revealed 38 genes were significantly upregulated and 15 genes were significantly downregulated in degenerated CEP cells relative to non-degenerated cells (|fold change| ≥ 1.5). Of these, only 2 genes were previously studied in CEP cells, and 31 were previously studied in the IVD and other cartilaginous tissues. Single-cell RNA sequencing revealed 11 unique cell clusters, including multiple chondrocyte and progenitor subpopulations with distinct gene expression and functional profiles. Analysis of genes in the bulk RNA sequencing dataset showed that progenitor cell clusters from both samples were enriched in “non-degenerated” genes but not “degenerated” genes. For both bulk- and single-cell analyses, gene expression and pathway enrichment analyses highlighted several pathways that may regulate CEP degeneration, including transcriptional regulation, translational regulation, intracellular transport, and mitochondrial dysfunction. ConclusionsThis thorough analysis using RNA sequencing methods highlighted numerous differences between non-degenerated and degenerated CEP cells, the phenotypic heterogeneity of CEP cells, and several pathways of interest that may be relevant in CEP degeneration.more » « less
-
Birol, Inanc (Ed.)Abstract MotivationSingle-cell RNA sequencing (scRNA-seq) is widely used for analyzing gene expression in multi-cellular systems and provides unprecedented access to cellular heterogeneity. scRNA-seq experiments aim to identify and quantify all cell types present in a sample. Measured single-cell transcriptomes are grouped by similarity and the resulting clusters are mapped to cell types based on cluster-specific gene expression patterns. While the process of generating clusters has become largely automated, annotation remains a laborious ad hoc effort that requires expert biological knowledge. ResultsHere, we introduce CellMeSH—a new automated approach to identifying cell types for clusters based on prior literature. CellMeSH combines a database of gene–cell-type associations with a probabilistic method for database querying. The database is constructed by automatically linking gene and cell-type information from millions of publications using existing indexed literature resources. Compared to manually constructed databases, CellMeSH is more comprehensive and is easily updated with new data. The probabilistic query method enables reliable information retrieval even though the gene–cell-type associations extracted from the literature are noisy. CellMeSH is also able to optionally utilize prior knowledge about tissues or cells for further annotation improvement. CellMeSH achieves top-one and top-three accuracies on a number of mouse and human datasets that are consistently better than existing approaches. Availability and implementationWeb server at https://uncurl.cs.washington.edu/db_query and API at https://github.com/shunfumao/cellmesh. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Abstract BackgroundAnalyzing single-cell RNA sequencing (scRNAseq) data plays an important role in understanding the intrinsic and extrinsic cellular processes in biological and biomedical research. One significant effort in this area is the identification of cell types. With the availability of a huge amount of single cell sequencing data and discovering more and more cell types, classifying cells into known cell types has become a priority nowadays. Several methods have been introduced to classify cells utilizing gene expression data. However, incorporating biological gene interaction networks has been proved valuable in cell classification procedures. ResultsIn this study, we propose a multimodal end-to-end deep learning model, named sigGCN, for cell classification that combines a graph convolutional network (GCN) and a neural network to exploit gene interaction networks. We used standard classification metrics to evaluate the performance of the proposed method on the within-dataset classification and the cross-dataset classification. We compared the performance of the proposed method with those of the existing cell classification tools and traditional machine learning classification methods. ConclusionsResults indicate that the proposed method outperforms other commonly used methods in terms of classification accuracy and F1 scores. This study shows that the integration of prior knowledge about gene interactions with gene expressions using GCN methodologies can extract effective features improving the performance of cell classification.more » « less
-
Abstract SummaryWith the advancements of high-throughput single-cell RNA-sequencing protocols, there has been a rapid increase in the tools available to perform an array of analyses on the gene expression data that results from such studies. For example, there exist methods for pseudo-time series analysis, differential cell usage, cell-type detection RNA-velocity in single cells, etc. Most analysis pipelines validate their results using known marker genes (which are not widely available for all types of analysis) and by using simulated data from gene-count-level simulators. Typically, the impact of using different read-alignment or unique molecular identifier (UMI) deduplication methods has not been widely explored. Assessments based on simulation tend to start at the level of assuming a simulated count matrix, ignoring the effect that different approaches for resolving UMI counts from the raw read data may produce. Here, we present minnow, a comprehensive sequence-level droplet-based single-cell RNA-sequencing (dscRNA-seq) experiment simulation framework. Minnow accounts for important sequence-level characteristics of experimental scRNA-seq datasets and models effects such as polymerase chain reaction amplification, cellular barcodes (CB) and UMI selection and sequence fragmentation and sequencing. It also closely matches the gene-level ambiguity characteristics that are observed in real scRNA-seq experiments. Using minnow, we explore the performance of some common processing pipelines to produce gene-by-cell count matrices from droplet-bases scRNA-seq data, demonstrate the effect that realistic levels of gene-level sequence ambiguity can have on accurate quantification and show a typical use-case of minnow in assessing the output generated by different quantification pipelines on the simulated experiment. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
An official website of the United States government
