Abstract BackgroundCurrent methods for analyzing single-cell datasets have relied primarily on static gene expression measurements to characterize the molecular state of individual cells. However, capturing temporal changes in cell state is crucial for the interpretation of dynamic phenotypes such as the cell cycle, development, or disease progression. RNA velocity infers the direction and speed of transcriptional changes in individual cells, yet it is unclear how these temporal gene expression modalities may be leveraged for predictive modeling of cellular dynamics. ResultsHere, we present the first task-oriented benchmarking study that investigates integration of temporal sequencing modalities for dynamic cell state prediction. We benchmark ten integration approaches on ten datasets spanning different biological contexts, sequencing technologies, and species. We find that integrated data more accurately infers biological trajectories and achieves increased performance on classifying cells according to perturbation and disease states. Furthermore, we show that simple concatenation of spliced and unspliced molecules performs consistently well on classification tasks and can be used over more memory intensive and computationally expensive methods. ConclusionsThis work illustrates how integrated temporal gene expression modalities may be leveraged for predicting cellular trajectories and sample-associated perturbation and disease phenotypes. Additionally, this study provides users with practical recommendations for task-specific integration of single-cell gene expression modalities.
more »
« less
This content will become publicly available on July 1, 2026
Fluctuation structure predicts genome-wide perturbation outcomes
Pooled single-cell perturbation screens represent powerful experimental platforms for functional genomics, yet interpreting these rich datasets for meaningful biological conclusions remains challenging. Most current methods fall at one of two extremes: either opaque deep learning models that obscure biological meaning, or simplified frameworks that treat genes as isolated units. As such, these approaches overlook a crucial insight: gene co-fluctuations in unperturbed cellular states can be harnessed to model perturbation responses. Here we present CIPHER (Covariance Inference for Perturbation and High-dimensional Expression Response), a framework leveraging linear response theory from statistical physics to predict transcriptome-wide perturbation outcomes using gene co-fluctuations in unperturbed cells. We validated CIPHER on synthetic regulatory networks before applying it to 11 large-scale single-cell perturbation datasets covering 4,234 perturbations and over 1.36M cells. CIPHER robustly recapitulated genome-wide responses to single and double perturbations by exploiting baseline gene covariance structure. Importantly, eliminating gene-gene covariances, while retaining gene-intrinsic variances, reduced model performance by 11-fold, demonstrating the rich information stored within baseline fluctuation structures. Moreover, gene-gene correlations transferred successfully across independent experiments of the same cell type, revealing stereotypic fluctuation structures. Furthermore, CIPHER outperformed conventional differential expression metrics in identifying true perturbations while providing uncertainty-aware effect size estimates through Bayesian inference. Finally, most genome-wide responses propagated through the covariance matrix along approximately three independent and global gene modules. CIPHER underscores the importance of theoretically-grounded models in capturing complex biological responses, highlighting fundamental design principles encoded in cellular fluctuation patterns.
more »
« less
- PAR ID:
- 10612000
- Publisher / Repository:
- bioRxiv
- Date Published:
- Format(s):
- Medium: X
- Institution:
- bioRxiv
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract MotivationGene regulatory networks (GRNs) in a cell provide the tight feedback needed to synchronize cell actions. However, genes in a cell also take input from, and provide signals to other neighboring cells. These cell–cell interactions (CCIs) and the GRNs deeply influence each other. Many computational methods have been developed for GRN inference in cells. More recently, methods were proposed to infer CCIs using single cell gene expression data with or without cell spatial location information. However, in reality, the two processes do not exist in isolation and are subject to spatial constraints. Despite this rationale, no methods currently exist to infer GRNs and CCIs using the same model. ResultsWe propose CLARIFY, a tool that takes GRNs as input, uses them and spatially resolved gene expression data to infer CCIs, while simultaneously outputting refined cell-specific GRNs. CLARIFY uses a novel multi-level graph autoencoder, which mimics cellular networks at a higher level and cell-specific GRNs at a deeper level. We applied CLARIFY to two real spatial transcriptomic datasets, one using seqFISH and the other using MERFISH, and also tested on simulated datasets from scMultiSim. We compared the quality of predicted GRNs and CCIs with state-of-the-art baseline methods that inferred either only GRNs or only CCIs. The results show that CLARIFY consistently outperforms the baseline in terms of commonly used evaluation metrics. Our results point to the importance of co-inference of CCIs and GRNs and to the use of layered graph neural networks as an inference tool for biological networks. Availability and implementationThe source code and data is available at https://github.com/MihirBafna/CLARIFY.more » « less
-
Summary CRISPR genome engineering and single-cell RNA sequencing have accelerated biological discovery. Single-cell CRISPR screens unite these two technologies, linking genetic perturbations in individual cells to changes in gene expression and illuminating regulatory networks underlying diseases. Despite their promise, single-cell CRISPR screens present considerable statistical challenges. We demonstrate through theoretical and real data analyses that a standard method for estimation and inference in single-cell CRISPR screens—“thresholded regression”—exhibits attenuation bias and a bias-variance tradeoff as a function of an intrinsic, challenging-to-select tuning parameter. To overcome these difficulties, we introduce GLM-EIV (“GLM-based errors-in-variables”), a new method for single-cell CRISPR screen analysis. GLM-EIV extends the classical errors-in-variables model to responses and noisy predictors that are exponential family-distributed and potentially impacted by the same set of confounding variables. We develop a computational infrastructure to deploy GLM-EIV across hundreds of processors on clouds (e.g. Microsoft Azure) and high-performance clusters. Leveraging this infrastructure, we apply GLM-EIV to analyze two recent, large-scale, single-cell CRISPR screen datasets, yielding several new insights.more » « less
-
Genome-wide association studies (GWASs) have identified and replicated many genetic variants that are associated with diseases and disease-related complex traits. However, the biological mechanisms underlying these identified associations remain largely elusive. Exploring the biological mechanisms underlying these associations requires identifying trait-relevant tissues and cell types, as genetic variants likely influence complex traits in a tissue- and cell type-specific manner. Recently, several statistical methods have been developed to integrate genomic data with GWASs for identifying trait-relevant tissues and cell types. These methods often rely on different genomic information and use different statistical models for trait-tissue relevance inference. Here, we present a comprehensive technical review to summarize ten existing methods for trait-tissue relevance inference. These methods make use of different genomic information that include functional annotation information, expression quantitative trait loci information, genetically regulated gene expression information, as well as gene co-expression network information. These methods also use different statistical models that range from linear mixed models to covariance network models. We hope that this review can serve as a useful reference both for methodologists who develop methods and for applied analysts who apply these methods for identifying trait relevant tissues and cell types.more » « less
-
Mathelier, Anthony (Ed.)Abstract Motivation Methods to model dynamic changes in gene expression at a genome-wide level are not currently sufficient for large (temporally rich or single-cell) datasets. Variational autoencoders offer means to characterize large datasets and have been used effectively to characterize features of single-cell datasets. Here, we extend these methods for use with gene expression time series data. Results We present RVAgene: a recurrent variational autoencoder to model gene expression dynamics. RVAgene learns to accurately and efficiently reconstruct temporal gene profiles. It also learns a low dimensional representation of the data via a recurrent encoder network that can be used for biological feature discovery, and from which we can generate new gene expression data by sampling the latent space. We test RVAgene on simulated and real biological datasets, including embryonic stem cell differentiation and kidney injury response dynamics. In all cases, RVAgene accurately reconstructed complex gene expression temporal profiles. Via cross validation, we show that a low-error latent space representation can be learnt using only a fraction of the data. Through clustering and gene ontology term enrichment analysis on the latent space, we demonstrate the potential of RVAgene for unsupervised discovery. In particular, RVAgene identifies new programs of shared gene regulation of Lox family genes in response to kidney injury. Availability and implementation All datasets analyzed in this manuscript are publicly available and have been published previously. RVAgene is available in Python, at GitHub: https://github.com/maclean-lab/RVAgene; Zenodo archive: http://doi.org/10.5281/zenodo.4271097. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
An official website of the United States government
