Trajectory inference methods are essential for analyzing the developmental paths of cells in single-cell sequencing datasets. It provides insights into cellular differentiation, transitions, and lineage hierarchies, helping unravel the dynamic processes underlying development and disease progression. However, many existing tools lack a coherent statistical model and reliable uncertainty quantification, limiting their utility and robustness. In this paper, we introduce VITAE (Variational Inference for Trajectory by AutoEncoder), a statistical approach that integrates a latent hierarchical mixture model with variational autoencoders to infer trajectories. The statistical hierarchical model enhances the interpretability of our framework, while the posterior approximations generated by our variational autoencoder ensure computational efficiency and provide uncertainty quantification of cell projections along trajectories. Specifically, VITAE enables simultaneous trajectory inference and data integration, improving the accuracy of learning a joint trajectory structure in the presence of biological and technical heterogeneity across datasets. We show that VITAE outperforms other state-of-the-art trajectory inference methods on both real and synthetic data under various trajectory topologies. Furthermore, we apply VITAE to jointly analyze three distinct single-cell RNA sequencing datasets of the mouse neocortex, unveiling comprehensive developmental lineages of projection neurons. VITAE effectively reduces batch effects within and across datasets and uncovers finer structures that might be overlooked in individual datasets. Additionally, we showcase VITAE’s efficacy in integrative analyses of multiomic datasets with continuous cell population structures.
more »
« less
Analysis of Variability of Functionals of Recombinant Protein Production Trajectories Based on Limited Data
Making statistical inference on quantities defining various characteristics of a temporally measured biochemical process and analyzing its variability across different experimental conditions is a core challenge in various branches of science. This problem is particularly difficult when the amount of data that can be collected is limited in terms of both the number of replicates and the number of time points per process trajectory. We propose a method for analyzing the variability of smooth functionals of the growth or production trajectories associated with such processes across different experimental conditions. Our modeling approach is based on a spline representation of the mean trajectories. We also develop a bootstrap-based inference procedure for the parameters while accounting for possible multiple comparisons. This methodology is applied to study two types of quantities—the “time to harvest” and “maximal productivity”—in the context of an experiment on the production of recombinant proteins. We complement the findings with extensive numerical experiments comparing the effectiveness of different types of bootstrap procedures for various tests of hypotheses. These numerical experiments convincingly demonstrate that the proposed method yields reliable inference on complex characteristics of the processes even in a data-limited environment where more traditional methods for statistical inference are typically not reliable.
more »
« less
- PAR ID:
- 10349722
- Date Published:
- Journal Name:
- International Journal of Molecular Sciences
- Volume:
- 23
- Issue:
- 14
- ISSN:
- 1422-0067
- Page Range / eLocation ID:
- 7628
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Linear mixed models are widely used for analyzing longitudinal datasets, and the inference for variance component parameters relies on the bootstrap method. However, health systems and technology companies routinely generate massive longitudinal datasets that make the traditional bootstrap method infeasible. To solve this problem, we extend the highly scalable bag of little bootstraps method for independent data to longitudinal data and develop a highly efficient Julia packageMixedModelsBLB.jl.Simulation experiments and real data analysis demonstrate the favorable statistical performance and computational advantages of our method compared to the traditional bootstrap method. For the statistical inference of variance components, it achieves 200 times speedup on the scale of 1 million subjects (20 million total observations), and is the only currently available tool that can handle more than 10 million subjects (200 million total observations) using desktop computers.more » « less
-
Variability in gene expression causes genetically identical cells to exhibit different phenotypes. One probable cause of this variability is transcriptional bursting, where the synthesis of RNA molecules randomly alternates with periods of silence in the transfer of genetic information. Yet, the molecular mechanisms behind this variability remain unclear. Experiments indicate that multiple biochemical states might be involved in the production of RNA molecules. Stimulated by these observations, we developed a theoretical framework to investigate the mechanisms of transcriptional bursting. It is based on a multi-state stochastic approach that provides a full quantitative description of the dynamic properties in the system. We found that the degree of stochastic fluctuations during transcription directly correlates with the number of biochemical states. This explains experimentally observed variability and fluctuations in the quantities of the produced RNA molecules. The procedure to estimate the number of relevant biochemical states participating in the transcription is outlined and applied for analysis of experimental results. We also developed a general dynamic phase diagram for the transcription process. The presented theoretical method clarifies physical−chemical aspects of the transcriptional bursting and presents a minimal chemical-kinetic description of the process.more » « less
-
Pressure swing adsorption (PSA) is a widely used technology to separate a gas product from impurities in a variety of fields. Due to the complexity of PSA operations, process and instrument faults can occur at different parts and/or steps of the process. Thus, effective process monitoring is critical for ensuring efficient and safe operations of PSA systems. However, multi-bed PSA processes present several major challenges to process monitoring. First, a PSA process is operated in a periodic or cyclic fashion and never reaches a steady state; Second, the duration of different operation cycles is dynamically controlled in response to various disturbances, which results in a wide range of normal operation trajectories. Third, there is limited data for process monitoring, and bed pressure is usually the only measured variable for process monitoring. These key characteristics of the PSA operation make process monitoring, especially early fault detection, significantly more challenging than that for a continuous process operated at a steady state. To address these challenges, we propose a feature-based statistical process monitoring (SPM) framework for PSA processes, namely feature space monitoring (FSM). Through feature engineering and feature selection, we show that FSM can naturally handle the key challenges in PSA process monitoring and achieve early detection of subtle faults from a wide range of normal operating conditions. The performance of FSM is compared to the conventional SPM methods using both simulated and real faults from an industrial PSA process. The results demonstrate FSM’s superior performance in fault detection and fault diagnosis compared to the traditional SPM methods. In particular, the robust monitoring performance from FSM is achieved without any data preprocessing, trajectory alignment or synchronization required by the conventional SPM methods.more » « less
-
Abstract Multivariate spatially oriented data sets are prevalent in the environmental and physical sciences. Scientists seek to jointly model multiple variables, each indexed by a spatial location, to capture any underlying spatial association for each variable and associations among the different dependent variables. Multivariate latent spatial process models have proved effective in driving statistical inference and rendering better predictive inference at arbitrary locations for the spatial process. High‐dimensional multivariate spatial data, which are the theme of this article, refer to data sets where the number of spatial locations and the number of spatially dependent variables is very large. The field has witnessed substantial developments in scalable models for univariate spatial processes, but such methods for multivariate spatial processes, especially when the number of outcomes are moderately large, are limited in comparison. Here, we extend scalable modeling strategies for a single process to multivariate processes. We pursue Bayesian inference, which is attractive for full uncertainty quantification of the latent spatial process. Our approach exploits distribution theory for the matrix‐normal distribution, which we use to construct scalable versions of a hierarchical linear model of coregionalization (LMC) and spatial factor models that deliver inference over a high‐dimensional parameter space including the latent spatial process. We illustrate the computational and inferential benefits of our algorithms over competing methods using simulation studies and an analysis of a massive vegetation index data set.more » « less
An official website of the United States government

