skip to main content

Title: Additive partially linear models for ultra‐high‐dimensional regression

We consider a semiparametric additive partially linear regression model (APLM) for analysing ultra‐high‐dimensional data where both the number of linear components and the number of non‐linear components can be much larger than the sample size. We propose a two‐step approach for estimation, selection, and simultaneous inference of the components in the APLM. In the first step, the non‐linear additive components are approximated using polynomial spline basis functions, and a doubly penalized procedure is proposed to select nonzero linear and non‐linear components based on adaptive lasso. In the second step, local linear smoothing is then applied to the data with the selected variables to obtain the asymptotic distribution of the estimators of the nonparametric functions of interest. The proposed method selects the correct model with probability approaching one under regularity conditions. The estimators of both the linear part and the non‐linear part are consistent and asymptotically normal, which enables us to construct confidence intervals and make inferences about the regression coefficients and the component functions. The performance of the method is evaluated by simulation studies. The proposed method is also applied to a dataset on the shoot apical meristem of maize genotypes.

more » « less
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    The paper studies estimation of partially linear hazard regression models with varying coefficients for multivariate survival data. A profile pseudo-partial-likelihood estimation method is proposed. The estimation of the parameters of the linear part is accomplished via maximization of the profile pseudo-partial-likelihood, whereas the varying-coefficient functions are considered as nuisance parameters that are profiled out of the likelihood. It is shown that the estimators of the parameters are root n consistent and the estimators of the non-parametric coefficient functions achieve optimal convergence rates. Asymptotic normality is obtained for the estimators of the finite parameters and varying-coefficient functions. Consistent estimators of the asymptotic variances are derived and empirically tested, which facilitate inference for the model. We prove that the varying-coefficient functions can be estimated as well as if the parametric components were known and the failure times within each subject were independent. Simulations are conducted to demonstrate the performance of the estimators proposed. A real data set is analysed to illustrate the methodology proposed.

    more » « less
  2. Independent component analysis (ICA) decomposes multivariate data into mutually independent components (ICs). The ICA model is subject to a constraint that at most one of these components is Gaussian, which is required for model identifiability. Linear non‐Gaussian component analysis (LNGCA) generalizes the ICA model to a linear latent factor model with any number of both non‐Gaussian components (signals) and Gaussian components (noise), where observations are linear combinations of independent components. Although the individual Gaussian components are not identifiable, the Gaussian subspace is identifiable. We introduce an estimator along with its optimization approach in which non‐Gaussian and Gaussian components are estimated simultaneously, maximizing the discrepancy of each non‐Gaussian component from Gaussianity while minimizing the discrepancy of each Gaussian component from Gaussianity. When the number of non‐Gaussian components is unknown, we develop a statistical test to determine it based on resampling and the discrepancy of estimated components. Through a variety of simulation studies, we demonstrate the improvements of our estimator over competing estimators, and we illustrate the effectiveness of our test to determine the number of non‐Gaussian components. Further, we apply our method to real data examples and show its practical value.

    more » « less
  3. Abstract

    In this article, we introduce a functional structural equation model for estimating directional relations from multivariate functional data. We decouple the estimation into two major steps: directional order determination and selection through sparse functional regression. We first propose a score function at the linear operator level, and show that its minimization can recover the true directional order when the relation between each function and its parental functions is nonlinear. We then develop a sparse functional additive regression, where both the response and the multivariate predictors are functions and the regression relation is additive and nonlinear. We also propose strategies to speed up the computation and scale up our method. In theory, we establish the consistencies of order determination, sparse functional additive regression, and directed acyclic graph estimation, while allowing both the dimension of the Karhunen–Loéve expansion coefficients and the number of random functions to diverge with the sample size. We illustrate the efficacy of our method through simulations, and an application to brain effective connectivity analysis.

    more » « less
  4. In this work, we propose a deep neural networks‐based method to perform non‐parametric regression for functional data. The proposed estimators are based on sparsely connected deep neural networks with rectifier linear unit (ReLU) activation function. We provide the convergence rate of the proposed deep neural networks estimator in terms of the empirical norm. Through Monte Carlo simulation studies, we examine the finite sample performance of the proposed method. Finally, the proposed method is applied to analyse positron emission tomography images of patients with Alzheimer's disease obtained from the Alzheimer Disease Neuroimaging Initiative database.

    more » « less
  5. null (Ed.)
    Summary We consider the problem of approximating smoothing spline estimators in a nonparametric regression model. When applied to a sample of size $n$, the smoothing spline estimator can be expressed as a linear combination of $n$ basis functions, requiring $O(n^3)$ computational time when the number $d$ of predictors is two or more. Such a sizeable computational cost hinders the broad applicability of smoothing splines. In practice, the full-sample smoothing spline estimator can be approximated by an estimator based on $q$ randomly selected basis functions, resulting in a computational cost of $O(nq^2)$. It is known that these two estimators converge at the same rate when $q$ is of order $O\{n^{2/(pr+1)}\}$, where $p\in [1,2]$ depends on the true function and $r > 1$ depends on the type of spline. Such a $q$ is called the essential number of basis functions. In this article, we develop a more efficient basis selection method. By selecting basis functions corresponding to approximately equally spaced observations, the proposed method chooses a set of basis functions with great diversity. The asymptotic analysis shows that the proposed smoothing spline estimator can decrease $q$ to around $O\{n^{1/(pr+1)}\}$ when $d\leq pr+1$. Applications to synthetic and real-world datasets show that the proposed method leads to a smaller prediction error than other basis selection methods. 
    more » « less