skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A Bayesian Zero-Inflated Dirichlet-Multinomial Regression Model for Multivariate Compositional Count Data
Abstract The Dirichlet-multinomial (DM) distribution plays a fundamental role in modern statistical methodology development and application. Recently, the DM distribution and its variants have been used extensively to model multivariate count data generated by high-throughput sequencing technology in omics research due to its ability to accommodate the compositional structure of the data as well as overdispersion. A major limitation of the DM distribution is that it is unable to handle excess zeros typically found in practice which may bias inference. To fill this gap, we propose a novel Bayesian zero-inflated DM model for multivariate compositional count data with excess zeros. We then extend our approach to regression settings and embed sparsity-inducing priors to perform variable selection for high-dimensional covariate spaces. Throughout, modeling decisions are made to boost scalability without sacrificing interpretability or imposing limiting assumptions. Extensive simulations and an application to a human gut microbiome dataset are presented to compare the performance of the proposed method to existing approaches. We provide an accompanying R package with a user-friendly vignette to apply our method to other datasets.  more » « less
Award ID(s):
2245492
PAR ID:
10487591
Author(s) / Creator(s):
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Biometrics
Volume:
79
Issue:
4
ISSN:
0006-341X
Format(s):
Medium: X Size: p. 3239-3251
Size(s):
p. 3239-3251
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Zero-inflated and hurdle models are widely applied to count data possessing excess zeros, where they can simultaneously model the process from how the zeros were generated and potentially help mitigate the effects of overdispersion relative to the assumed count distribution. Which model to use depends on how the zeros are generated: zero-inflated models add an additional probability mass on zero, while hurdle models are two-part models comprised of a degenerate distribution for the zeros and a zero-truncated distribution. Developing confidence intervals for such models is challenging since no closed-form function is available to calculate the mean. In this study, generalized fiducial inference is used to construct confidence intervals for the means of zero-inflated Poisson and Poisson hurdle models. The proposed methods are assessed by an intensive simulation study. An illustrative example demonstrates the inference methods. 
    more » « less
  2. Summary Many clinical endpoint measures, such as the number of standard drinks consumed per week or the number of days that patients stayed in the hospital, are count data with excessive zeros. However, the zero‐inflated nature of such outcomes is sometimes ignored in analyses of clinical trials. This leads to biased estimates of study‐level intervention effect and, consequently, a biased estimate of the overall intervention effect in a meta‐analysis. The current study proposes a novel statistical approach, the Zero‐inflation Bias Correction (ZIBC) method, that can account for the bias introduced when using the Poisson regression model, despite a high rate of inflated zeros in the outcome distribution of a randomized clinical trial. This correction method only requires summary information from individual studies to correct intervention effect estimates as if they were appropriately estimated using the zero‐inflated Poisson regression model, thus it is attractive for meta‐analysis when individual participant‐level data are not available in some studies. Simulation studies and real data analyses showed that the ZIBC method performed well in correcting zero‐inflation bias in most situations. 
    more » « less
  3. Fan, Yanan; Nott, David; Smith, Michael S; Dortet-Bernadet, Jean-Luc. (Ed.)
    Quantile regression is widely seen as an ideal tool to understand complex predictor-response relations. Its biggest promise rests in its ability to quantify whether and how predictor effects vary across response quantile levels. But this promise has not been fully met due to a lack of statistical estimation methods that perform a rigorous, joint analysis of all quantile levels. This gap has been recently bridged by Yang and Tokdar [18]. Here we demonstrate how their joint quantile regression method, as encoded in the R package qrjoint, offers a comprehensive and model-based regression analysis framework. This chapter is an R vignette where we illustrate how to fit models, interpret coefficients, improve and compare models and obtain predictions under this framework. Our case study is an application to ecology where we analyse how the abundance of red maple trees depends on topographical and geographical features of the location. A complete absence of the species contributes excess zeros in the response data. We treat such excess zeros as left censoring in the spirit of a Tobit regression analysis. By utilising the generative nature of the joint quantile regression model, we not only adjust for censoring but also treat it as an object of independent scientific interest. 
    more » « less
  4. Abstract The FLIMFLAM survey is collecting spectroscopic data of field galaxies near fast radio burst (FRB) sight lines to constrain key parameters describing the distribution of matter in the Universe. In this work, we leverage the survey data to determine the source of the excess extragalactic dispersion measure (DM), compared to Macquart relation estimates of four FRBs: FRB20190714A, FRB20200906A, FRB20200430A, and FRB20210117A. By modeling the gas distribution around the foreground galaxy halos and galaxy groups of the sight lines, we estimate DMhalos, their contribution to the FRB DMs. The FRB20190714A sight line shows a clear excess of foreground halos which contribute roughly two-thirds of the observed excess DM, thus implying a sight line that is baryon dense. FRB20200906A shows a smaller but nonnegligible foreground halo contribution, and further analysis of the intergalactic medium is necessary to ascertain the true cosmic contribution to its DM. FRB20200430A and FRB20210117A show negligible foreground contributions, implying a large host galaxy excess and/or progenitor environment excess. 
    more » « less
  5. A<sc>bstract</sc> A search is performed for dark matter (DM) produced in association with a single top quark or a pair of top quarks using the data collected with the CMS detector at the LHC from proton-proton collisions at a center-of-mass energy of 13 TeV, corresponding to 138 fb−1of integrated luminosity. An excess of events with a large imbalance of transverse momentum is searched for across 0, 1 and 2 lepton final states. Novel multivariate techniques are used to take advantage of the differences in kinematic properties between the two DM production mechanisms. No significant deviations with respect to the standard model predictions are observed. The results are interpreted considering a simplified model in which the mediator is either a scalar or pseudoscalar particle and couples to top quarks and to DM fermions. Axion-like particles that are coupled to top quarks and DM fermions are also considered. Expected exclusion limits of 410 and 380 GeV for scalar and pseudoscalar mediator masses, respectively, are set at the 95% confidence level. A DM particle mass of 1 GeV is assumed, with mediator couplings to fermions and DM particles set to unity. A small signal-like excess is observed in data, with the largest local significance observed to be 1.9 standard deviations for the 150 GeV pseudoscalar mediator hypothesis. Because of this excess, mediator masses are only excluded below 310 (320) GeV for the scalar (pseudoscalar) mediator. The results are also translated into model-independent 95% confidence level upper limits on the visible cross section of DM production in association with top quarks, ranging from 1 pb to 0.02 pb. 
    more » « less