skip to main content

Title: Local Polynomial Regression and Simulation–Extrapolation

The paper introduces a new local polynomial estimator and develops supporting asymptotic theory for nonparametric regression in the presence of covariate measurement error. We address the measurement error with Cook and Stefanski's simulation–extrapolation (SIMEX) algorithm. Our method improves on previous local polynomial estimators for this problem by using a bandwidth selection procedure that addresses SIMEX's particular estimation method and considers higher degree local polynomial estimators. We illustrate the accuracy of our asymptotic expressions with a Monte Carlo study, compare our method with other estimators with a second set of Monte Carlo simulations and apply our method to a data set from nutritional epidemiology. SIMEX was originally developed for parametric models. Although SIMEX is, in principle, applicable to nonparametric models, a serious problem arises with SIMEX in nonparametric situations. The problem is that smoothing parameter selectors that are developed for data without measurement error are no longer appropriate and can result in considerable undersmoothing. We believe that this is the first paper to address this difficulty.

more » « less
Author(s) / Creator(s):
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Page Range / eLocation ID:
p. 17-30
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    In electronic health records (EHRs) data analysis, nonparametric regression and classification using International Classification of Disease (ICD) codes as covariates remain understudied. Automated methods have been developed over the years for predicting biomedical responses using EHRs, but relatively less attention has been paid to developing patient similarity measures that use ICD codes and chronic conditions, where a chronic condition is defined as a set of ICD codes. We address this problem by first developing a string kernel function for measuring the similarity between a pair of primary chronic conditions, represented as subsets of ICD codes. Second, we extend this similarity measure to a family of covariance functions on subsets of chronic conditions. This family is used in developing Gaussian process (GP) priors for Bayesian nonparametric regression and classification using diagnoses and other demographic information as covariates. Markov chain Monte Carlo (MCMC) algorithms are used for posterior inference and predictions. The proposed methods are tuning free, so they are ideal for automated prediction of biomedical responses depending on chronic conditions. We evaluate the practical performance of our method on EHR data collected from 1660 patients at the University of Iowa Hospitals and Clinics (UIHC) with six different primary cancer sites. Our method provides better sensitivity and specificity than its competitors in classifying different primary cancer sites and estimates the marginal associations between chronic conditions and primary cancer sites.

    more » « less
  2. Abstract

    Statistical analysis of longitudinal data often involves modeling treatment effects on clinically relevant longitudinal biomarkers since an initial event (the time origin). In some studies including preventive HIV vaccine efficacy trials, some participants have biomarkers measured starting at the time origin, whereas others have biomarkers measured starting later with the time origin unknown. The semiparametric additive time-varying coefficient model is investigated where the effects of some covariates vary nonparametrically with time while the effects of others remain constant. Weighted profile least squares estimators coupled with kernel smoothing are developed. The method uses the expectation maximization approach to deal with the censored time origin. The Kaplan–Meier estimator and other failure time regression models such as the Cox model can be utilized to estimate the distribution and the conditional distribution of left censored event time related to the censored time origin. Asymptotic properties of the parametric and nonparametric estimators and consistent asymptotic variance estimators are derived. A two-stage estimation procedure for choosing weight is proposed to improve estimation efficiency. Numerical simulations are conducted to examine finite sample properties of the proposed estimators. The simulation results show that the theory and methods work well. The efficiency gain of the two-stage estimation procedure depends on the distribution of the longitudinal error processes. The method is applied to analyze data from the Merck 023/HVTN 502 Step HIV vaccine study.

    more » « less
  3. Abstract

    In causal inference problems, one is often tasked with estimating causal effects which are analytically intractable functionals of the data‐generating mechanism. Relevant settings include estimating intention‐to‐treat effects in longitudinal problems with missing data or computing direct and indirect effects in mediation analysis. One approach to computing these effects is to use theg‐formula implemented via Monte Carlo integration; when simulation‐based methods such as the nonparametric bootstrap or Markov chain Monte Carlo are used for inference, Monte Carlo integration must be nested within an already computationally intensive algorithm. We develop a widely‐applicable approach to accelerating this Monte Carlo integration step which greatly reduces the computational burden of existingg‐computation algorithms. We refer to our method as acceleratedg‐computation (AGC). The algorithms we present are similar in spirit to multiple imputation, but require removing within‐imputation variance from the standard error rather than adding it. We illustrate the use of AGC on a mediation analysis problem using a beta regression model and in a longitudinal clinical trial subject to nonignorable missingness using a Bayesian additive regression trees model.

    more » « less
  4. Biomedical researchers are often interested in estimating the effect of an environmental exposure in relation to a chronic disease endpoint. However, the exposure variable of interest may be measured with errors. In a subset of the whole cohort, a surrogate variable is available for the true unobserved exposure variable. The surrogate variable satisfies an additive measurement error model, but it may not have repeated measurements. The subset in which the surrogate variables are available is called acalibration sample. In addition to the surrogate variables that are available among the subjects in the calibration sample, we consider the situation when there is an instrumental variable available for all study subjects. An instrumental variable is correlated with the unobserved true exposure variable, and hence can be useful in the estimation of the regression coefficients. In this paper, we propose a nonparametric method for Cox regression using the observed data from the whole cohort. The nonparametric estimator is the best linear combination of a nonparametric correction estimator from the calibration sample and the difference of the naive estimators from the calibration sample and the whole cohort. The asymptotic distribution is derived, and the finite sample performance of the proposed estimator is examined via intensive simulation studies. The methods are applied to the Nutritional Biomarkers Study of the Women's Health Initiative.

    more » « less
  5. Recent research has developed several Monte Carlo methods for estimating the normalization constant (partition function) based on the idea of annealing. This means sampling successively from a path of distributions that interpolate between a tractable "proposal" distribution and the unnormalized "target" distribution. Prominent estimators in this family include annealed importance sampling and annealed noise-contrastive estimation (NCE). Such methods hinge on a number of design choices: which estimator to use, which path of distributions to use and whether to use a path at all; so far, there is no definitive theory on which choices are efficient. Here, we evaluate each design choice by the asymptotic estimation error it produces. First, we show that using NCE is more efficient than the importance sampling estimator, but in the limit of infinitesimal path steps, the difference vanishes. Second, we find that using the geometric path brings down the estimation error from an exponential to a polynomial function of the parameter distance between the target and proposal distributions. Third, we find that the arithmetic path, while rarely used, can offer optimality properties over the universally-used geometric path. In fact, in a particular limit, the optimal path is arithmetic. Based on this theory, we finally propose a two-step estimator to approximate the optimal path in an efficient way. 
    more » « less