Primary analysis of case–control studies focuses on the relationship between disease D and a set of covariates of interest (Y, X). A secondary application of the case–control study, which is often invoked in modern genetic epidemiologic association studies, is to investigate the interrelationship between the covariates themselves. The task is complicated owing to the case–control sampling, where the regression of Y on X is different from what it is in the population. Previous work has assumed a parametric distribution for Y given X and derived semiparametric efficient estimation and inference without any distributional assumptions about X. We take up the issue of estimation of a regression function when Y given X follows a homoscedastic regression model, but otherwise the distribution of Y is unspecified. The semiparametric efficient approaches can be used to construct semiparametric efficient estimates, but they suffer from a lack of robustness to the assumed model for Y given X. We take an entirely different approach. We show how to estimate the regression parameters consistently even if the assumed model for Y given X is incorrect, and thus the estimates are model robust. For this we make the assumption that the disease rate is known or well estimated. The assumption can be dropped when the disease is rare, which is typically so for most case–control studies, and the estimation algorithm simplifies. Simulations and empirical examples are used to illustrate the approach.
We study the regression relationship between covariates in case–control data: an area known as the secondary analysis of case–control studies. The context is such that only the form of the regression mean is specified, so that we allow an arbitrary regression error distribution, which can depend on the covariates and thus can be heteroscedastic. Under mild regularity conditions we establish the theoretical identifiability of such models. Previous work in this context has either specified a fully parametric distribution for the regression errors, specified a homoscedastic distribution for the regression errors, has specified the rate of disease in the population (we refer to this as the true population) or has made a rare disease approximation. We construct a class of semiparametric estimation procedures that rely on none of these. The estimators differ from the usual semiparametric estimators in that they draw conclusions about the true population, while technically operating in a hypothetical superpopulation. We also construct estimators with a unique feature, in that they are robust against the misspecification of the regression error distribution in terms of variance structure, whereas all other non-parametric effects are estimated despite the biased samples. We establish the asymptotic properties of the estimators and illustrate their finite sample performance through simulation studies, as well as through an empirical example on the relationship between red meat consumption and hetero-cyclic amines. Our analysis verified the positive relationship between red meat consumption and two forms of hetro-cyclic amines, indicating that increased red meat consumption leads to increased levels of MeIQx and PhIP, both being risk factors for colorectal cancer. Computer software as well as data to illustrate the methodology are available from http://www.stat.tamu.edu/~carroll/matlab__programs/software.php .
more » « less- PAR ID:
- 10397455
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Journal of the Royal Statistical Society Series B: Statistical Methodology
- Volume:
- 78
- Issue:
- 1
- ISSN:
- 1369-7412
- Page Range / eLocation ID:
- p. 127-151
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Summary -
Summary Analysing secondary outcomes is a common practice for case–control studies. Traditional secondary analysis employs either completely parametric models or conditional mean regression models to link the secondary outcome to covariates. In many situations, quantile regression models complement mean-based analyses and provide alternative new insights on the associations of interest. For example, biomedical outcomes are often highly asymmetric, and median regression is more useful in describing the ‘central’ behaviour than mean regressions. There are also cases where the research interest is to study the high or low quantiles of a population, as they are more likely to be at risk. We approach the secondary quantile regression problem from a semiparametric perspective, allowing the covariate distribution to be completely unspecified. We derive a class of consistent semiparametric estimators and identify the efficient member. The asymptotic properties of the resulting estimators are established. Simulation results and a real data analysis are provided to demonstrate the superior performance of our approach with a comparison with the only existing approach so far in the literature.
-
Abstract With advances in biomedical research, biomarkers are becoming increasingly important prognostic factors for predicting overall survival, while the measurement of biomarkers is often censored due to instruments' lower limits of detection. This leads to two types of censoring: random censoring in overall survival outcomes and fixed censoring in biomarker covariates, posing new challenges in statistical modeling and inference. Existing methods for analyzing such data focus primarily on linear regression ignoring censored responses or semiparametric accelerated failure time models with covariates under detection limits (DL). In this paper, we propose a quantile regression for survival data with covariates subject to DL. Comparing to existing methods, the proposed approach provides a more versatile tool for modeling the distribution of survival outcomes by allowing covariate effects to vary across conditional quantiles of the survival time and requiring no parametric distribution assumptions for outcome data. To estimate the quantile process of regression coefficients, we develop a novel multiple imputation approach based on another quantile regression for covariates under DL, avoiding stringent parametric restrictions on censored covariates as often assumed in the literature. Under regularity conditions, we show that the estimation procedure yields uniformly consistent and asymptotically normal estimators. Simulation results demonstrate the satisfactory finite‐sample performance of the method. We also apply our method to the motivating data from a study of genetic and inflammatory markers of Sepsis.
-
Abstract This article presents generalized semiparametric regression models for conditional cumulative incidence functions with competing risks data when covariates are missing by sampling design or happenstance. A doubly robust augmented inverse probability weighted (AIPW) complete‐case approach to estimation and inference is investigated. This approach modifies IPW complete‐case estimating equations by exploiting the key features in the relationship between the missing covariates and the phase‐one data to improve efficiency. An iterative numerical procedure is derived to solve the nonlinear estimating equations. The asymptotic properties of the proposed estimators are established. A simulation study examining the finite‐sample performances of the proposed estimators shows that the AIPW estimators are more efficient than the IPW estimators. The developed method is applied to the RV144 HIV‐1 vaccine efficacy trial to investigate vaccine‐induced IgG binding antibodies to HIV‐1 as correlates of acquisition of HIV‐1 infection while taking account of whether the HIV‐1 sequences are near or far from the HIV‐1 sequences represented in the vaccine construct.
-
Abstract Disease registries, surveillance data, and other datasets with extremely large sample sizes become increasingly available in providing population‐based information on disease incidence, survival probability, or other important public health characteristics. Such information can be leveraged in studies that collect detailed measurements but with smaller sample sizes. In contrast to recent proposals that formulate additional information as constraints in optimization problems, we develop a general framework to construct simple estimators that update the usual regression estimators with some functionals of data that incorporate the additional information. We consider general settings that incorporate nuisance parameters in the auxiliary information, non‐
i.i.d . data such as those from case‐control studies, and semiparametric models with infinite‐dimensional parameters common in survival analysis. Details of several important data and sampling settings are provided with numerical examples.