Semicontinuous outcomes commonly arise in a wide variety of fields, such as insurance claims, healthcare expenditures, rainfall amounts, and alcohol consumption. Regression models, including Tobit, Tweedie, and two-part models, are widely employed to understand the relationship between semicontinuous outcomes and covariates. Given the potential detrimental consequences of model misspecification, after fitting a regression model, it is of prime importance to check the adequacy of the model. However, due to the point mass at zero, standard diagnostic tools for regression models (eg, deviance and Pearson residuals) are not informative for semicontinuous data. To bridge this gap, we propose a new type of residuals for semicontinuous outcomes that is applicable to general regression models. Under the correctly specified model, the proposed residuals converge to being uniformly distributed, and when the model is misspecified, they significantly depart from this pattern. In addition to in-sample validation, the proposed methodology can also be employed to evaluate predictive distributions. We demonstrate the effectiveness of the proposed tool using health expenditure data from the US Medical Expenditure Panel Survey.
This content will become publicly available on February 14, 2025
- Award ID(s):
- 2210712
- PAR ID:
- 10498083
- Publisher / Repository:
- Taylor & Francis
- Date Published:
- Journal Name:
- Journal of Computational and Graphical Statistics
- ISSN:
- 1061-8600
- Page Range / eLocation ID:
- 1 to 17
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
ABSTRACT -
Abstract The genome‐wide association studies (GWAS) typically use linear or logistic regression models to identify associations between phenotypes (traits) and genotypes (genetic variants) of interest. However, the use of regression with the additive assumption has potential limitations. First, the normality assumption of residuals is the one that is rarely seen in practice, and deviation from normality increases the Type‐I error rate. Second, building a model based on such an assumption ignores genetic structures, like, dominant, recessive, and protective‐risk cases. Ignoring genetic variants may result in spurious conclusions about the associations between a variant and a trait. We propose an assumption‐free model built upon data‐consistent inversion (DCI), which is a recently developed measure‐theoretic framework utilized for uncertainty quantification. This proposed DCI‐derived model builds a nonparametric distribution on model inputs that propagates to the distribution of observed data without the required normality assumption of residuals in the regression model. This characteristic enables the proposed DCI‐derived model to cover all genetic variants without emphasizing on additivity of the classic‐GWAS model. Simulations and a replication GWAS with data from the COPDGene demonstrate the ability of this model to control the Type‐I error rate at least as well as the classic‐GWAS (additive linear model) approach while having similar or greater power to discover variants in different genetic modes of transmission.
-
Abstract We consider a regression analysis of longitudinal data in the presence of outcome‐dependent observation times and informative censoring. Existing approaches commonly require a correct specification of the joint distribution of longitudinal measurements, the observation time process, and informative censoring time under the joint modeling framework and can be computationally cumbersome due to the complex form of the likelihood function. In view of these issues, we propose a semiparametric joint regression model and construct a composite likelihood function based on a conditional order statistics argument. As a major feature of our proposed methods, the aforementioned joint distribution is not required to be specified, and the random effect in the proposed joint model is treated as a nuisance parameter. Consequently, the derived composite likelihood bypasses the need to integrate over the random effect and offers the advantage of easy computation. We show that the resulting estimators are consistent and asymptotically normal. We use simulation studies to evaluate the finite‐sample performance of the proposed method and apply it to a study of weight loss data that motivated our investigation.
-
Abstract With advances in biomedical research, biomarkers are becoming increasingly important prognostic factors for predicting overall survival, while the measurement of biomarkers is often censored due to instruments' lower limits of detection. This leads to two types of censoring: random censoring in overall survival outcomes and fixed censoring in biomarker covariates, posing new challenges in statistical modeling and inference. Existing methods for analyzing such data focus primarily on linear regression ignoring censored responses or semiparametric accelerated failure time models with covariates under detection limits (DL). In this paper, we propose a quantile regression for survival data with covariates subject to DL. Comparing to existing methods, the proposed approach provides a more versatile tool for modeling the distribution of survival outcomes by allowing covariate effects to vary across conditional quantiles of the survival time and requiring no parametric distribution assumptions for outcome data. To estimate the quantile process of regression coefficients, we develop a novel multiple imputation approach based on another quantile regression for covariates under DL, avoiding stringent parametric restrictions on censored covariates as often assumed in the literature. Under regularity conditions, we show that the estimation procedure yields uniformly consistent and asymptotically normal estimators. Simulation results demonstrate the satisfactory finite‐sample performance of the method. We also apply our method to the motivating data from a study of genetic and inflammatory markers of Sepsis.
-
Summary In many observational longitudinal studies, the outcome of interest presents a skewed distribution, is subject to censoring due to detection limit or other reasons, and is observed at irregular times that may follow a outcome-dependent pattern. In this work, we consider quantile regression modeling of such longitudinal data, because quantile regression is generally robust in handling skewed and censored outcomes and is flexible to accommodate dynamic covariate-outcome relationships. Specifically, we study a longitudinal quantile regression model that specifies covariate effects on the marginal quantiles of the longitudinal outcome. Such a model is easy to interpret and can accommodate dynamic outcome profile changes over time. We propose estimation and inference procedures that can appropriately account for censoring and irregular outcome-dependent follow-up. Our proposals can be readily implemented based on existing software for quantile regression. We establish the asymptotic properties of the proposed estimator, including uniform consistency and weak convergence. Extensive simulations suggest good finite-sample performance of the new method. We also present an analysis of data from a long-term study of a population exposed to polybrominated biphenyls (PBB), which uncovers an inhomogeneous PBB elimination pattern that would not be detected by traditional longitudinal data analysis.