skip to main content


Title: Calibrated percentile double bootstrap for robust linear regression inference
We consider inference for the parameters of a linear model when the covariates are random and the relationship between response and covariates is possibly non-linear. Conventional inference methods such as z intervals perform poorly in these cases. We propose a double bootstrap-based calibrated percentile method, perc-cal, as a general-purpose CI method which performs very well relative to alternative methods in challenging situations such as these. The superior performance of perc-cal is demonstrated by a thorough, full-factorial design synthetic data study as well as a data example involving the length of criminal sentences. We also provide theoretical justification for the perc-cal method under mild conditions. The method is implemented in the R package "perccal", available through CRAN and coded primarily in C++, to make it easier for practitioners to use.  more » « less
Award ID(s):
1633212 1613112 1309619
NSF-PAR ID:
10128892
Author(s) / Creator(s):
Date Published:
Journal Name:
Statistica sinica
Volume:
28
ISSN:
1017-0405
Page Range / eLocation ID:
2565-2589
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    With advances in biomedical research, biomarkers are becoming increasingly important prognostic factors for predicting overall survival, while the measurement of biomarkers is often censored due to instruments' lower limits of detection. This leads to two types of censoring: random censoring in overall survival outcomes and fixed censoring in biomarker covariates, posing new challenges in statistical modeling and inference. Existing methods for analyzing such data focus primarily on linear regression ignoring censored responses or semiparametric accelerated failure time models with covariates under detection limits (DL). In this paper, we propose a quantile regression for survival data with covariates subject to DL. Comparing to existing methods, the proposed approach provides a more versatile tool for modeling the distribution of survival outcomes by allowing covariate effects to vary across conditional quantiles of the survival time and requiring no parametric distribution assumptions for outcome data. To estimate the quantile process of regression coefficients, we develop a novel multiple imputation approach based on another quantile regression for covariates under DL, avoiding stringent parametric restrictions on censored covariates as often assumed in the literature. Under regularity conditions, we show that the estimation procedure yields uniformly consistent and asymptotically normal estimators. Simulation results demonstrate the satisfactory finite‐sample performance of the method. We also apply our method to the motivating data from a study of genetic and inflammatory markers of Sepsis.

     
    more » « less
  2. Abstract Linear regression on network-linked observations has been an essential tool in modelling the relationship between response and covariates with additional network structures. Previous methods either lack inference tools or rely on restrictive assumptions on social effects and usually assume that networks are observed without errors. This paper proposes a regression model with non-parametric network effects. The model does not assume that the relational data or network structure is exactly observed and can be provably robust to network perturbations. Asymptotic inference framework is established under a general requirement of the network observational errors, and the robustness of this method is studied in the specific setting when the errors come from random network models. We discover a phase-transition phenomenon of the inference validity concerning the network density when no prior knowledge of the network model is available while also showing a significant improvement achieved by knowing the network model. Simulation studies are conducted to verify these theoretical results and demonstrate the advantage of the proposed method over existing work in terms of accuracy and computational efficiency under different data-generating models. The method is then applied to middle school students' network data to study the effectiveness of educational workshops in reducing school conflicts. 
    more » « less
  3. Linear regression on network-linked observations has been an essential tool in modelling the relationship between response and covariates with additional network structures. Previous methods either lack inference tools or rely on restrictive assumptions on social effects and usually assume that networks are observed without errors. This paper proposes a regression model with non-parametric network effects. The model does not assume that the relational data or network structure is exactly observed and can be provably robust to network perturbations. Asymptotic inference framework is established under a general requirement of the network observational errors, and the robustness of this method is studied in the specific setting when the errors come from random network models. We discover a phase-transition phenomenon of the inference validity concerning the network density when no prior knowledge of the network model is available while also showing a significant improvement achieved by knowing the network model. Simulation studies are conducted to verify these theoretical results and demonstrate the advantage of the proposed method over existing work in terms of accuracy and computational efficiency under different data-generating models. The method is then applied to middle school students' network data to study the effectiveness of educational workshops in reducing school conflicts. 
    more » « less
  4. In this paper, we study several profile estimation methods for the generalized semiparametric varying-coefficient additive model for longitudinal data by utilizing the within-subject correlations. The model is flexible in allowing timevarying effects for some covariates and constant effects for others, and in having the option to choose different link functions which can used to analyze both discrete and continuous longitudinal responses.We investigated the profile generalized estimating equation (GEE) approaches and the profile quadratic inference function (QIF) approach. The profile estimations are assisted with the local linear smoothing technique to estimate the time-varying effects. Several approaches that incorporate the within-subject correlations are investigated including the quasi-likelihood (QL), the minimum generalized variance (MGV), the quadratic inference function and the weighted least squares (WLS). The proposed estimation procedures can accommodate flexible sampling schemes. These methods provide a unified approach that work well for discrete longitudinal responses as well as for continuous longitudinal responses. Finite sample performances of these methods are examined through Monto Carlo simulations under various correlation structures for both discrete and continuous longitudinal responses. The simulation results show efficiency improvement over the working independence approach by utilizing the within-subject correlations as well as comparative performances of different approaches. 
    more » « less
  5. Abstract

    Accurate estimations of animal populations are necessary for management, conservation, and policy decisions. However, methods for surveying animal communities disproportionately represent specific groups or guilds. For example, transect surveys can provide robust data for large arboreal species but underestimate cryptic or small‐bodied terrestrial species, whereas camera traps have the inverse tendency. The integration of information from multiple methodologies would provide the most complete inference on population size or responses to putative covariates, yet a simple, robust framework that allows integration and comparison of multiple data sources has been lacking. We use 27,813 counts of 35 species or species groups derived from concurrent visual transects, dung transects, and camera trap surveys in tropical forests and compare them within a generalized joint attribute modeling framework (GJAM) that both compares and integrates field‐collected dung, visual, and camera trap data to quantify the species‐ and trait‐specific differences in detection for each method. The effectiveness of survey method was strongly dependent on species, as well as animal traits. These differences in effectiveness contributed to meaningful differences in the reported strength of a known important covariate for animal communities (distance to nearest village). Data fusion through GJAM allows clear and unambiguous comparisons of the counts provided from each different methodology, the incorporation of trait information, and fusion of all three data streams to generate a more complete estimate of the effects of an anthropogenic disturbance covariate. Research and conservation resources are extremely limited, which often means that field campaigns attempt to maximize the amount of information gathered especially in remote, hard‐to‐access areas. Advances in these understudied areas will be accelerated by analytical methods that can fully leverage the total body of diverse biodiversity field data, even when they are collected using different methods. We demonstrate that survey methods vary in their effectiveness for counting species based on biological traits, but more importantly that generative models like GJAM can integrate data from multiple sources in one cohesive statistical framework to make improved inference in understudied environments.

     
    more » « less