skip to main content

Title: Additive partially linear models for ultra‐high‐dimensional regression

We consider a semiparametric additive partially linear regression model (APLM) for analysing ultra‐high‐dimensional data where both the number of linear components and the number of non‐linear components can be much larger than the sample size. We propose a two‐step approach for estimation, selection, and simultaneous inference of the components in the APLM. In the first step, the non‐linear additive components are approximated using polynomial spline basis functions, and a doubly penalized procedure is proposed to select nonzero linear and non‐linear components based on adaptive lasso. In the second step, local linear smoothing is then applied to the data with the selected variables to obtain the asymptotic distribution of the estimators of the nonparametric functions of interest. The proposed method selects the correct model with probability approaching one under regularity conditions. The estimators of both the linear part and the non‐linear part are consistent and asymptotically normal, which enables us to construct confidence intervals and make inferences about the regression coefficients and the component functions. The performance of the method is evaluated by simulation studies. The proposed method is also applied to a dataset on the shoot apical meristem of maize genotypes.

more » « less
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    The paper studies estimation of partially linear hazard regression models with varying coefficients for multivariate survival data. A profile pseudo-partial-likelihood estimation method is proposed. The estimation of the parameters of the linear part is accomplished via maximization of the profile pseudo-partial-likelihood, whereas the varying-coefficient functions are considered as nuisance parameters that are profiled out of the likelihood. It is shown that the estimators of the parameters are root n consistent and the estimators of the non-parametric coefficient functions achieve optimal convergence rates. Asymptotic normality is obtained for the estimators of the finite parameters and varying-coefficient functions. Consistent estimators of the asymptotic variances are derived and empirically tested, which facilitate inference for the model. We prove that the varying-coefficient functions can be estimated as well as if the parametric components were known and the failure times within each subject were independent. Simulations are conducted to demonstrate the performance of the estimators proposed. A real data set is analysed to illustrate the methodology proposed.

    more » « less
  2. null (Ed.)
    Summary We consider the problem of approximating smoothing spline estimators in a nonparametric regression model. When applied to a sample of size $n$, the smoothing spline estimator can be expressed as a linear combination of $n$ basis functions, requiring $O(n^3)$ computational time when the number $d$ of predictors is two or more. Such a sizeable computational cost hinders the broad applicability of smoothing splines. In practice, the full-sample smoothing spline estimator can be approximated by an estimator based on $q$ randomly selected basis functions, resulting in a computational cost of $O(nq^2)$. It is known that these two estimators converge at the same rate when $q$ is of order $O\{n^{2/(pr+1)}\}$, where $p\in [1,2]$ depends on the true function and $r > 1$ depends on the type of spline. Such a $q$ is called the essential number of basis functions. In this article, we develop a more efficient basis selection method. By selecting basis functions corresponding to approximately equally spaced observations, the proposed method chooses a set of basis functions with great diversity. The asymptotic analysis shows that the proposed smoothing spline estimator can decrease $q$ to around $O\{n^{1/(pr+1)}\}$ when $d\leq pr+1$. Applications to synthetic and real-world datasets show that the proposed method leads to a smaller prediction error than other basis selection methods. 
    more » « less
  3. Abstract

    Statistical analysis of longitudinal data often involves modeling treatment effects on clinically relevant longitudinal biomarkers since an initial event (the time origin). In some studies including preventive HIV vaccine efficacy trials, some participants have biomarkers measured starting at the time origin, whereas others have biomarkers measured starting later with the time origin unknown. The semiparametric additive time-varying coefficient model is investigated where the effects of some covariates vary nonparametrically with time while the effects of others remain constant. Weighted profile least squares estimators coupled with kernel smoothing are developed. The method uses the expectation maximization approach to deal with the censored time origin. The Kaplan–Meier estimator and other failure time regression models such as the Cox model can be utilized to estimate the distribution and the conditional distribution of left censored event time related to the censored time origin. Asymptotic properties of the parametric and nonparametric estimators and consistent asymptotic variance estimators are derived. A two-stage estimation procedure for choosing weight is proposed to improve estimation efficiency. Numerical simulations are conducted to examine finite sample properties of the proposed estimators. The simulation results show that the theory and methods work well. The efficiency gain of the two-stage estimation procedure depends on the distribution of the longitudinal error processes. The method is applied to analyze data from the Merck 023/HVTN 502 Step HIV vaccine study.

    more » « less
  4. Biomedical researchers are often interested in estimating the effect of an environmental exposure in relation to a chronic disease endpoint. However, the exposure variable of interest may be measured with errors. In a subset of the whole cohort, a surrogate variable is available for the true unobserved exposure variable. The surrogate variable satisfies an additive measurement error model, but it may not have repeated measurements. The subset in which the surrogate variables are available is called acalibration sample. In addition to the surrogate variables that are available among the subjects in the calibration sample, we consider the situation when there is an instrumental variable available for all study subjects. An instrumental variable is correlated with the unobserved true exposure variable, and hence can be useful in the estimation of the regression coefficients. In this paper, we propose a nonparametric method for Cox regression using the observed data from the whole cohort. The nonparametric estimator is the best linear combination of a nonparametric correction estimator from the calibration sample and the difference of the naive estimators from the calibration sample and the whole cohort. The asymptotic distribution is derived, and the finite sample performance of the proposed estimator is examined via intensive simulation studies. The methods are applied to the Nutritional Biomarkers Study of the Women's Health Initiative.

    more » « less
  5. Abstract

    In this article, we introduce a functional structural equation model for estimating directional relations from multivariate functional data. We decouple the estimation into two major steps: directional order determination and selection through sparse functional regression. We first propose a score function at the linear operator level, and show that its minimization can recover the true directional order when the relation between each function and its parental functions is nonlinear. We then develop a sparse functional additive regression, where both the response and the multivariate predictors are functions and the regression relation is additive and nonlinear. We also propose strategies to speed up the computation and scale up our method. In theory, we establish the consistencies of order determination, sparse functional additive regression, and directed acyclic graph estimation, while allowing both the dimension of the Karhunen–Loéve expansion coefficients and the number of random functions to diverge with the sample size. We illustrate the efficacy of our method through simulations, and an application to brain effective connectivity analysis.

    more » « less