Background: Outcome measures that are count variables with excessive zeros are common in health behaviors research. Examples include the number of standard drinks consumed or alcohol‐related problems experienced over time. There is a lack of empirical data about the relative performance of prevailing statistical models for assessing the efficacy of interventions when outcomes are zero‐inflated, particularly compared with recently developed marginalized count regression approaches for such data.Methods: The current simulation study examined five commonly used approaches for analyzing count outcomes, including two linear models (with outcomes on raw and log‐transformed scales, respectively) and three prevailing count distribution‐based models (ie, Poisson, negative binomial, and zero‐inflated Poisson (ZIP) models). We also considered the marginalized zero‐inflated Poisson (MZIP) model, a novel alternative that estimates the overall effects on the population mean while adjusting for zero‐inflation. Motivated by alcohol misuse prevention trials, extensive simulations were conducted to evaluate and compare the statistical power and Type I error rate of the statistical models and approaches across data conditions that varied in sample size ( to 500), zero rate (0.2 to 0.8), and intervention effect sizes.Results: Under zero‐inflation, the Poisson model failed to control the Type I error rate, resulting in higher than expected false positive results. When the intervention effects on the zero (vs. non‐zero) and count parts were in the same direction, the MZIP model had the highest statistical power, followed by the linear model with outcomes on the raw scale, negative binomial model, and ZIP model. The performance of the linear model with a log‐transformed outcome variable was unsatisfactory.Conclusions: The MZIP model demonstrated better statistical properties in detecting true intervention effects and controlling false positive results for zero‐inflated count outcomes. This MZIP model may serve as an appealing analytical approach to evaluating overall intervention effects in studies with count outcomes marked by excessive zeros.
more »
« less
Interaction analysis under misspecification of main effects: Some common mistakes and simple solutions
The statistical practice of modeling interaction with two linear main effects and a product term is ubiquitous in the statistical and epidemiological literature. Most data modelers are aware that the misspecification of main effects can potentially cause severe type I error inflation in tests for interactions, leading to spurious detection of interactions. However, modeling practice has not changed. In this article, we focus on the specific situation where the main effects in the model are misspecified as linear terms and characterize its impact on common tests for statistical interaction. We then propose some simple alternatives that fix the issue of potential type I error inflation in testing interaction due to main effect misspecification. We show that when using the sandwich variance estimator for a linear regression model with a quantitative outcome and two independent factors, both the Wald and score tests asymptotically maintain the correct type I error rate. However, if the independence assumption does not hold or the outcome is binary, using the sandwich estimator does not fix the problem. We further demonstrate that flexibly modeling the main effect under a generalized additive model can largely reduce or often remove bias in the estimates and maintain the correct type I error rate for both quantitative and binary outcomes regardless of the independence assumption. We show, under the independence assumption and for a continuous outcome, overfitting and flexibly modeling the main effects does not lead to power loss asymptotically relative to a correctly specified main effect model. Our simulation study further demonstrates the empirical fact that using flexible models for the main effects does not result in a significant loss of power for testing interaction in general. Our results provide an improved understanding of the strengths and limitations for tests of interaction in the presence of main effect misspecification. Using data from a large biobank study “The Michigan Genomics Initiative”, we present two examples of interaction analysis in support of our results.
more »
« less
- Award ID(s):
- 1712933
- PAR ID:
- 10455514
- Publisher / Repository:
- Wiley Blackwell (John Wiley & Sons)
- Date Published:
- Journal Name:
- Statistics in Medicine
- Volume:
- 39
- Issue:
- 11
- ISSN:
- 0277-6715
- Page Range / eLocation ID:
- p. 1675-1694
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract While researchers commonly use the bootstrap to quantify the uncertainty of an estimator, it has been noticed that the standard bootstrap, in general, does not work for Chatterjee’s rank correlation. In this paper, we provide proof of this issue under an additional independence assumption, and complement our theory with simulation evidence for general settings. Chatterjee’s rank correlation thus falls into a category of statistics that are asymptotically normal, but bootstrap inconsistent. Valid inferential methods in this case are Chatterjee’s original proposal for testing independence and the analytic asymptotic variance estimator of Lin & Han (2022) for more general purposes. [Received on 5 April 2023. Editorial decision on 10 January 2024]more » « less
-
Summary Randomized experiments have been the gold standard for drawing causal inference. The conventional model-based approach has been one of the most popular methods of analysing treatment effects from randomized experiments, which is often carried out through inference for certain model parameters. In this paper, we provide a systematic investigation of model-based analyses for treatment effects under the randomization-based inference framework. This framework does not impose any distributional assumptions on the outcomes, covariates and their dependence, and utilizes only randomization as the reasoned basis. We first derive the asymptotic theory for $ Z $-estimation in completely randomized experiments, and propose sandwich-type conservative covariance estimation. We then apply the developed theory to analyse both average and individual treatment effects in randomized experiments. For the average treatment effect, we consider model-based, model-imputed and model-assisted estimation strategies, where the first two strategies can be sensitive to model misspecification or require specific methods for parameter estimation. The model-assisted approach is robust to arbitrary model misspecification and always provides consistent average treatment effect estimation. We propose optimal ways to conduct model-assisted estimation using generally nonlinear least squares for parameter estimation. For the individual treatment effects, we propose directly modelling the relationship between individual effects and covariates, and discuss the model’s identifiability, inference and interpretation allowing for model misspecification.more » « less
-
Abstract Network interference, where the outcome of an individual is affected by the treatment assignment of those in their social network, is pervasive in real-world settings. However, it poses a challenge to estimating causal effects. We consider the task of estimating the total treatment effect (TTE), or the difference between the average outcomes of the population when everyone is treated versus when no one is, under network interference. Under a Bernoulli randomized design, we provide an unbiased estimator for the TTE when network interference effects are constrained to low-order interactions among neighbors of an individual. We make no assumptions on the graph other than bounded degree, allowing for well-connected networks that may not be easily clustered. We derive a bound on the variance of our estimator and show in simulated experiments that it performs well compared with standard estimators for the TTE. We also derive a minimax lower bound on the mean squared error of our estimator, which suggests that the difficulty of estimation can be characterized by the degree of interactions in the potential outcomes model. We also prove that our estimator is asymptotically normal under boundedness conditions on the network degree and potential outcomes model. Central to our contribution is a new framework for balancing model flexibility and statistical complexity as captured by this low-order interactions structure.more » « less
-
Random forest is considered as one of the most successful machine learning algorithms, which has been widely used to construct microbiome-based predictive models. However, its use as a statistical testing method has not been explored. In this study, we propose “Random Forest Test” (RFtest), a global (community-level) test based on random forest for high-dimensional and phylogenetically structured microbiome data. RFtest is a permutation test using the generalization error of random forest as the test statistic. Our simulations demonstrate that RFtest has controlled type I error rates, that its power is superior to competing methods for phylogenetically clustered signals, and that it is robust to outliers and adaptive to interaction effects and non-linear associations. Finally, we apply RFtest to two real microbiome datasets to ascertain whether microbial communities are associated or not with the outcome variables.more » « less
An official website of the United States government
