skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: Robust estimation in the nested case‐control design under a misspecified covariate functional form

The Cox proportional hazards model is typically used to analyze time‐to‐event data. If the event of interest is rare and covariates are difficult or expensive to collect, the nested case‐control (NCC) design provides consistent estimates at reduced costs with minimal impact on precision if the model is specified correctly. If our scientific goal is to conduct inference regarding an association of interest, it is essential that we specify the model a priori to avoid multiple testing bias. We cannot, however, be certain that all assumptions will be satisfied so it is important to consider robustness of the NCC design under model misspecification. In this manuscript, we show that in finite sample settings where the functional form of a covariate of interest is misspecified, the estimates resulting from the partial likelihood estimator under the NCC design depend on the number of controls sampled at each event time. To account for this dependency, we propose an estimator that recovers the results obtained using using the full cohort, where full covariate information is available for all study participants. We present the utility of our estimator using simulation studies and show the theoretical properties. We end by applying our estimator to motivating data from the Alzheimer's Disease Neuroimaging Initiative.

 
more » « less
NSF-PAR ID:
10453471
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Statistics in Medicine
Volume:
40
Issue:
2
ISSN:
0277-6715
Page Range / eLocation ID:
p. 299-311
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Analysis of time‐to‐event data using Cox's proportional hazards (PH) model is ubiquitous in scientific research. A sample is taken from the population of interest and covariate information is collected on everyone. If the event of interest is rare and covariate information is difficult to collect, the nested case‐control (NCC) design reduces costs with minimal impact on inferential precision. Under PH, application of the Cox model to data from a NCC sample provides consistent estimation of the hazard ratio. However, under non‐PH, the finite‐sample estimates corresponding to the Cox estimator depend on the number of controls sampled and the censoring distribution. We propose two estimators based on a binary predictor of interest: one recovers the estimand corresponding to the Cox model under a simple random sample, while the other recovers an estimand that does not depend on the censoring distribution. We derive the asymptotic distribution and provide finite‐sample variance estimators.

     
    more » « less
  2. Abstract

    In prevalent cohort studies where subjects are recruited at a cross‐section, the time to an event may be subject to length‐biased sampling, with the observed data being either the forward recurrence time, or the backward recurrence time, or their sum. In the regression setting, assuming a semiparametric accelerated failure time model for the underlying event time, where the intercept parameter is absorbed into the nuisance parameter, it has been shown that the model remains invariant under these observed data setups and can be fitted using standard methodology for accelerated failure time model estimation, ignoring the length bias. However, the efficiency of these estimators is unclear, owing to the fact that the observed covariate distribution, which is also length biased, may contain information about the regression parameter in the accelerated life model. We demonstrate that if the true covariate distribution is completely unspecified, then the naive estimator based on the conditional likelihood given the covariates is fully efficient for the slope.

     
    more » « less
  3. ABSTRACT

    We propose a communication‐efficient algorithm to estimate the average treatment effect (ATE), when the data are distributed across multiple sites and the number of covariates is possibly much larger than the sample size in each site. Our main idea is to calibrate the estimates of the propensity score and outcome models using some proper surrogate loss functions to approximately attain the desired covariate balancing property. We show that under possible model misspecification, our distributed covariate balancing propensity score estimator (disthdCBPS) can approximate the global estimator, obtained by pooling together the data from multiple sites, at a fast rate. Thus, our estimator remains consistent and asymptotically normal. In addition, when both the propensity score and the outcome models are correctly specified, the proposed estimator attains the semi‐parametric efficiency bound. We illustrate the empirical performance of the proposed method in both simulation and empirical studies.

     
    more » « less
  4. In the absence of data from a randomized trial, researchers may aim to use observational data to draw causal inference about the effect of a treatment on a time-to-event outcome. In this context, interest often focuses on the treatment-specific survival curves, that is, the survival curves were the population under study to be assigned to receive the treatment or not. Under certain conditions, including that all confounders of the treatment-outcome relationship are observed, the treatment-specific survival curve can be identified with a covariate-adjusted survival curve. In this article, we propose a novel cross-fitted doubly-robust estimator that incorporates data-adaptive (e.g. machine learning) estimators of the conditional survival functions. We establish conditions on the nuisance estimators under which our estimator is consistent and asymptotically linear, both pointwise and uniformly in time. We also propose a novel ensemble learner for combining multiple candidate estimators of the conditional survival estimators. Notably, our methods and results accommodate events occurring in discrete or continuous time, or an arbitrary mix of the two. We investigate the practical performance of our methods using numerical studies and an application to the effect of a surgical treatment to prevent metastases of parotid carcinoma on mortality. 
    more » « less
  5. Abstract

    Multivariate failure time data are frequently analyzed using the marginal proportional hazards models and the frailty models. When the sample size is extraordinarily large, using either approach could face computational challenges. In this paper, we focus on the marginal model approach and propose a divide‐and‐combine method to analyze large‐scale multivariate failure time data. Our method is motivated by the Myocardial Infarction Data Acquisition System (MIDAS), a New Jersey statewide database that includes 73,725,160 admissions to nonfederal hospitals and emergency rooms (ERs) from 1995 to 2017. We propose to randomly divide the full data into multiple subsets and propose a weighted method to combine these estimators obtained from individual subsets using three weights. Under mild conditions, we show that the combined estimator is asymptotically equivalent to the estimator obtained from the full data as if the data were analyzed all at once. In addition, to screen out risk factors with weak signals, we propose to perform the regularized estimation on the combined estimator using its combined confidence distribution. Theoretical properties, such as consistency, oracle properties, and asymptotic equivalence between the divide‐and‐combine approach and the full data approach are studied. Performance of the proposed method is investigated using simulation studies. Our method is applied to the MIDAS data to identify risk factors related to multivariate cardiovascular‐related health outcomes.

     
    more » « less