ABSTRACT Semicontinuous outcomes commonly arise in a wide variety of fields, such as insurance claims, healthcare expenditures, rainfall amounts, and alcohol consumption. Regression models, including Tobit, Tweedie, and two-part models, are widely employed to understand the relationship between semicontinuous outcomes and covariates. Given the potential detrimental consequences of model misspecification, after fitting a regression model, it is of prime importance to check the adequacy of the model. However, due to the point mass at zero, standard diagnostic tools for regression models (eg, deviance and Pearson residuals) are not informative for semicontinuous data. To bridge this gap, we propose a new type of residuals for semicontinuous outcomes that is applicable to general regression models. Under the correctly specified model, the proposed residuals converge to being uniformly distributed, and when the model is misspecified, they significantly depart from this pattern. In addition to in-sample validation, the proposed methodology can also be employed to evaluate predictive distributions. We demonstrate the effectiveness of the proposed tool using health expenditure data from the US Medical Expenditure Panel Survey.
more »
« less
Double Probability Integral Transform Residuals for Regression Models with Discrete Outcomes
The assessment of regression models with discrete outcomes is challenging and has many fundamental issues. With discrete outcomes, standard regression model assessment tools such as Pearson and deviance residuals do not follow the conventional reference distribution (normal) under the true model, calling into question the legitimacy of model assessment based on these tools. To fill this gap, we construct a new type of residuals for regression models with general discrete outcomes, including ordinal and count outcomes. The proposed residuals are based on two layers of probability integral transformation. When at least one continuous covariate is available, the proposed residuals closely follow a uniform distribution (or a normal distribution after transformation) under the correctly specified model. One can construct visualizations such as QQ plots to check the overall fit of a model straightforwardly, and the shape of QQ plots can further help identify possible causes of misspecification such as overdispersion. We provide theoretical justification for the proposed residuals by establishing their asymptotic properties. Moreover, in order to assess the mean structure and identify potential covariates, we develop an ordered curve as a supplementary tool, which is based on the comparison between the partial sum of outcomes and of fitted means. Through simulation, we demonstrate empirically that the proposed tools outperform commonly used residuals for various model assessment tasks. We also illustrate the workflow of model assessment using the proposed tools in data analysis. Supplementary materials for this article are available online.
more »
« less
- Award ID(s):
- 2210712
- PAR ID:
- 10498083
- Publisher / Repository:
- Taylor & Francis
- Date Published:
- Journal Name:
- Journal of Computational and Graphical Statistics
- ISSN:
- 1061-8600
- Page Range / eLocation ID:
- 1 to 17
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract The genome‐wide association studies (GWAS) typically use linear or logistic regression models to identify associations between phenotypes (traits) and genotypes (genetic variants) of interest. However, the use of regression with the additive assumption has potential limitations. First, the normality assumption of residuals is the one that is rarely seen in practice, and deviation from normality increases the Type‐I error rate. Second, building a model based on such an assumption ignores genetic structures, like, dominant, recessive, and protective‐risk cases. Ignoring genetic variants may result in spurious conclusions about the associations between a variant and a trait. We propose an assumption‐free model built upon data‐consistent inversion (DCI), which is a recently developed measure‐theoretic framework utilized for uncertainty quantification. This proposed DCI‐derived model builds a nonparametric distribution on model inputs that propagates to the distribution of observed data without the required normality assumption of residuals in the regression model. This characteristic enables the proposed DCI‐derived model to cover all genetic variants without emphasizing on additivity of the classic‐GWAS model. Simulations and a replication GWAS with data from the COPDGene demonstrate the ability of this model to control the Type‐I error rate at least as well as the classic‐GWAS (additive linear model) approach while having similar or greater power to discover variants in different genetic modes of transmission.more » « less
-
Abstract With advances in biomedical research, biomarkers are becoming increasingly important prognostic factors for predicting overall survival, while the measurement of biomarkers is often censored due to instruments' lower limits of detection. This leads to two types of censoring: random censoring in overall survival outcomes and fixed censoring in biomarker covariates, posing new challenges in statistical modeling and inference. Existing methods for analyzing such data focus primarily on linear regression ignoring censored responses or semiparametric accelerated failure time models with covariates under detection limits (DL). In this paper, we propose a quantile regression for survival data with covariates subject to DL. Comparing to existing methods, the proposed approach provides a more versatile tool for modeling the distribution of survival outcomes by allowing covariate effects to vary across conditional quantiles of the survival time and requiring no parametric distribution assumptions for outcome data. To estimate the quantile process of regression coefficients, we develop a novel multiple imputation approach based on another quantile regression for covariates under DL, avoiding stringent parametric restrictions on censored covariates as often assumed in the literature. Under regularity conditions, we show that the estimation procedure yields uniformly consistent and asymptotically normal estimators. Simulation results demonstrate the satisfactory finite‐sample performance of the method. We also apply our method to the motivating data from a study of genetic and inflammatory markers of Sepsis.more » « less
-
Gait speed assessment increases the predictive value of mortality and morbidity following older adults’ cardiac surgery. The purpose of this study was to improve clinical assessment and prediction of mortality and morbidity among older patients undergoing cardiac surgery through the identification of the relationships between preoperative gait and postural stability characteristics utilizing a noninvasive-wearable mobile phone device and postoperative cardiac surgical outcomes. This research was a prospective study of ambulatory patients aged over 70 years undergoing non-emergent cardiac surgery. Sixteen older adults with cardiovascular disease (Age 76.1 ± 3.6 years) scheduled for cardiac surgery within the next 24 h were recruited for this study. As per the Society of Thoracic Surgeons (STS) recommendation guidelines, eight of the cardiovascular disease (CVD) patients were classified as frail (prone to adverse outcomes with gait speed ≤0.833 m/s) and the remaining eight patients as non-frail (gait speed >0.833 m/s). Treating physicians and patients were blinded to gait and posture assessment results not to influence the decision to proceed with surgery or postoperative management. Follow-ups regarding patient outcomes were continued until patients were discharged or transferred from the hospital, at which time data regarding outcomes were extracted from the records. In the preoperative setting, patients performed the 5-m walk and stand still for 30 s in the clinic while wearing a mobile phone with a customized app “Lockhart Monitor” available at iOS App Store. Systematic evaluations of different gait and posture measures identified a subset of smartphone measures most sensitive to differences in two groups (frail versus non-frail) with adverse postoperative outcomes (morbidity/mortality). A regression model based on these smartphone measures tested positive on five CVD patients. Thus, clinical settings can readily utilize mobile technology, and the proposed regression model can predict adverse postoperative outcomes such as morbidity or mortality events.more » « less
-
False power consumption data injected from compromised smart meters in Advanced Metering Infrastructure (AMI) of smart grids is a threat that negatively affects both customers and utilities. In particular, organized and stealthy adversaries can launch various types of data falsification attacks from multiple meters using smart or persistent strategies. In this paper, we propose a real time, two tier attack detection scheme to detect orchestrated data falsification under a sophisticated threat model in decentralized micro-grids. The first detection tier monitors whether the Harmonic to Arithmetic Mean Ratio of aggregated daily power consumption data is outside a normal range known as safe margin. To confirm whether discrepancies in the first detection tier is indeed an attack, the second detection tier monitors the sum of the residuals (difference) between the proposed ratio metric and the safe margin over a frame of multiple days. If the sum of residuals is beyond a standard limit range, the presence of a data falsification attack is confirmed. Both the ‘safe margins’ and the ‘standard limits’ are designed through a ‘system identification phase’, where the signature of proposed metrics under normal conditions are studied using real AMI micro-grid data sets from two different countries over multiple years. Subsequently, we show how the proposed metrics trigger unique signatures under various attacks which aids in attack reconstruction and also limit the impact of persistent attacks. Unlike metrics such as CUSUM or EWMA, the stability of the proposed metrics under normal conditions allows successful real time detection of various stealthy attacks with ultra-low false alarms.more » « less
An official website of the United States government

