skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 30, 2026

Title: A Comparison of Zero-Inflated Models for Modern Biomedical Data
There has been a growing number of datasets exhibiting an excess of zero values that cannot be adequately modeled using standard probability distributions. For example, microbiome data and single-cell RNA sequencing data consist of count measurements in which the proportion of zeros exceeds what can be captured by standard distributions such as the Poisson or negative binomial, while also requiring appropriate modeling of the nonzero counts. Several models have been proposed to address zero-inflated datasets including the zero-inflated negative binomial, hurdle negative binomial model, and the truncated latent Gaussian copula model. This study aims to compare various models and determine which one performs optimally under different conditions using both simulation studies and real data analyses. We are particularly interested in investigating how dependence among the variables, level of zeroinflation or deflation, and variance of the data affects model selection. KEYWORDS: Zero-InflatedModels; HurdleModels; Truncated Latent Gaussian CopulaModel; Microbiome Data; Gene-Sequencing Data; Zero-Inflation, Negative Binomial; Zero-Deflation  more » « less
Award ID(s):
2150179
PAR ID:
10611607
Author(s) / Creator(s):
; ;
Publisher / Repository:
American Journal of Undergraduate Research
Date Published:
Journal Name:
American Journal of Undergraduate Research
Volume:
22
Issue:
2
ISSN:
1536-4585
Page Range / eLocation ID:
49-68
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Background Single-cell RNA sequencing (scRNA-seq) is a powerful profiling technique at the single-cell resolution. Appropriate analysis of scRNA-seq data can characterize molecular heterogeneity and shed light into the underlying cellular process to better understand development and disease mechanisms. The unique analytic challenge is to appropriately model highly over-dispersed scRNA-seq count data with prevalent dropouts (zero counts), making zero-inflated dimensionality reduction techniques popular for scRNA-seq data analyses. Employing zero-inflated distributions, however, may place extra emphasis on zero counts, leading to potential bias when identifying the latent structure of the data. Results In this paper, we propose a fully generative hierarchical gamma-negative binomial (hGNB) model of scRNA-seq data, obviating the need for explicitly modeling zero inflation. At the same time, hGNB can naturally account for covariate effects at both the gene and cell levels to identify complex latent representations of scRNA-seq data, without the need for commonly adopted pre-processing steps such as normalization. Efficient Bayesian model inference is derived by exploiting conditional conjugacy via novel data augmentation techniques. Conclusion Experimental results on both simulated data and several real-world scRNA-seq datasets suggest that hGNB is a powerful tool for cell cluster discovery as well as cell lineage inference. 
    more » « less
  2. Spatiotemporal data analysis with massive zeros is widely used in many areas such as epidemiology and public health. We use a Bayesian framework to fit zero-inflated negative binomial models and employ a set of latent variables from Pólya-Gamma distributions to derive an efficient Gibbs sampler. The proposed model accommodates varying spatial and temporal random effects through Gaussian process priors, which have both the simplicity and flexibility in modeling nonlinear relationships through a covariance function. To conquer the computation bottleneck that GPs may suffer when the sample size is large, we adopt the nearest-neighbor GP approach that approximates the covariance matrix using local experts. For the simulation study, we adopt multiple settings with varying sizes of spatial locations to evaluate the performance of the proposed model such as spatial and temporal random effects estimation and compare the result to other methods. We also apply the proposed model to the COVID-19 death counts in the state of Florida, USA from 3/25/2020 through 7/29/2020 to examine relationships between social vulnerability and COVID-19 deaths. 
    more » « less
  3. Background: Outcome measures that are count variables with excessive zeros are common in health behaviors research. Examples include the number of standard drinks consumed or alcohol‐related problems experienced over time. There is a lack of empirical data about the relative performance of prevailing statistical models for assessing the efficacy of interventions when outcomes are zero‐inflated, particularly compared with recently developed marginalized count regression approaches for such data.Methods: The current simulation study examined five commonly used approaches for analyzing count outcomes, including two linear models (with outcomes on raw and log‐transformed scales, respectively) and three prevailing count distribution‐based models (ie, Poisson, negative binomial, and zero‐inflated Poisson (ZIP) models). We also considered the marginalized zero‐inflated Poisson (MZIP) model, a novel alternative that estimates the overall effects on the population mean while adjusting for zero‐inflation. Motivated by alcohol misuse prevention trials, extensive simulations were conducted to evaluate and compare the statistical power and Type I error rate of the statistical models and approaches across data conditions that varied in sample size ( to 500), zero rate (0.2 to 0.8), and intervention effect sizes.Results: Under zero‐inflation, the Poisson model failed to control the Type I error rate, resulting in higher than expected false positive results. When the intervention effects on the zero (vs. non‐zero) and count parts were in the same direction, the MZIP model had the highest statistical power, followed by the linear model with outcomes on the raw scale, negative binomial model, and ZIP model. The performance of the linear model with a log‐transformed outcome variable was unsatisfactory.Conclusions: The MZIP model demonstrated better statistical properties in detecting true intervention effects and controlling false positive results for zero‐inflated count outcomes. This MZIP model may serve as an appealing analytical approach to evaluating overall intervention effects in studies with count outcomes marked by excessive zeros. 
    more » « less
  4. Summary Canonical correlation analysis investigates linear relationships between two sets of variables, but it often works poorly on modern datasets because of high dimensionality and mixed data types such as continuous, binary and zero-inflated. To overcome these challenges, we propose a semiparametric approach to sparse canonical correlation analysis based on the Gaussian copula. The main result of this paper is a truncated latent Gaussian copula model for data with excess zeros, which allows us to derive a rank-based estimator of the latent correlation matrix for mixed variable types without estimation of marginal transformation functions. The resulting canonical correlation analysis method works well in high-dimensional settings, as demonstrated via numerical studies, and when applied to the analysis of association between gene expression and microRNA data from breast cancer patients. 
    more » « less
  5. Single-cell RNA sequencing (scRNA-seq) data often contain doublets, where a doublet manifests as 1 cell barcode that corresponds to combined gene expression of two or more cells. Existence of doublets can lead to spurious biological interpretations. Here, we present s ingle- c ell MO del-driven D oublet D etection ( scMODD ), a model-driven algorithm to detect doublets in scRNA-seq data. ScMODD achieved similar performance compared to existing doublet detection algorithms which are primarily data-driven, showing the promise of model-driven approach for doublet detection. When implementing scMODD in simulated and real scRNA-seq data, we tested both the negative binomial (NB) model and the zero-inflated negative binomial (ZINB) model to serve as the underlying statistical model for scRNA-seq count data, and observed that incorporating zero inflation did not improve detection performance, suggesting that consideration of zero inflation is not necessary in the context of doublet detection in scRNA-seq. 
    more » « less