skip to main content


Title: Sparse Partial Least Squares Regression for Simultaneous Dimension Reduction and Variable Selection
Summary

Partial least squares regression has been an alternative to ordinary least squares for handling multicollinearity in several areas of scientific research since the 1960s. It has recently gained much attention in the analysis of high dimensional genomic data. We show that known asymptotic consistency of the partial least squares estimator for a univariate response does not hold with the very large p and small n paradigm. We derive a similar result for a multivariate response regression with partial least squares. We then propose a sparse partial least squares formulation which aims simultaneously to achieve good predictive performance and variable selection by producing sparse linear combinations of the original predictors. We provide an efficient implementation of sparse partial least squares regression and compare it with well-known variable selection and dimension reduction approaches via simulation experiments. We illustrate the practical utility of sparse partial least squares regression in a joint analysis of gene expression and genomewide binding data.

 
more » « less
PAR ID:
10401379
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
72
Issue:
1
ISSN:
1369-7412
Format(s):
Medium: X Size: p. 3-25
Size(s):
p. 3-25
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    The envelope model was first introduced as a parsimonious version of multivariate linear regression. It uses dimension reduction techniques to remove immaterial variation in the data and has the potential to gain efficiency in estimation and improve prediction. Many advances have taken place since its introduction, and the envelope model has been applied to many contexts in multivariate analysis, including partial least squares, generalised linear models, Bayesian analysis, variable selection and quantile regression, among others. This article serves as a review of the envelope model and its developments for those who are new to the area.

     
    more » « less
  2. Motivated by mobile devices that record data at a high frequency, we propose a new methodological framework for analyzing a semi-parametric regression model that allow us to study a nonlinear relationship between a scalar response and multiple functional predictors in the presence of scalar covariates. Utilizing functional principal component analysis (FPCA) and the least-squares kernel machine method (LSKM), we are able to substantially extend the framework of semi-parametric regression models of scalar responses on scalar predictors by allowing multiple functional predictors to enter the nonlinear model. Regularization is established for feature selection in the setting of reproducing kernel Hilbert spaces. Our method performs simultaneously model fitting and variable selection on functional features. For the implementation, we propose an effective algorithm to solve related optimization problems in that iterations take place between both linear mixed-effects models and a variable selection method (e.g., sparse group lasso). We show algorithmic convergence results and theoretical guarantees for the proposed methodology. We illustrate its performance through simulation experiments and an analysis of accelerometer data. 
    more » « less
  3. Summary

    Variable selection for recovering sparsity in nonadditive and nonparametric models with high-dimensional variables has been challenging. This problem becomes even more difficult due to complications in modeling unknown interaction terms among high-dimensional variables. There is currently no variable selection method to overcome these limitations. Hence, in this article we propose a variable selection approach that is developed by connecting a kernel machine with the nonparametric regression model. The advantages of our approach are that it can: (i) recover the sparsity; (ii) automatically model unknown and complicated interactions; (iii) connect with several existing approaches including linear nonnegative garrote and multiple kernel learning; and (iv) provide flexibility for both additive and nonadditive nonparametric models. Our approach can be viewed as a nonlinear version of a nonnegative garrote method. We model the smoothing function by a Least Squares Kernel Machine (LSKM) and construct the nonnegative garrote objective function as the function of the sparse scale parameters of kernel machine to recover sparsity of input variables whose relevances to the response are measured by the scale parameters. We also provide the asymptotic properties of our approach. We show that sparsistency is satisfied with consistent initial kernel function coefficients under certain conditions. An efficient coordinate descent/backfitting algorithm is developed. A resampling procedure for our variable selection methodology is also proposed to improve the power.

     
    more » « less
  4. Inference for high-dimensional models is challenging as regular asymptotic the- ories are not applicable. This paper proposes a new framework of simultaneous estimation and inference for high-dimensional linear models. By smoothing over par- tial regression estimates based on a given variable selection scheme, we reduce the problem to a low-dimensional least squares estimation. The procedure, termed as Selection-assisted Partial Regression and Smoothing (SPARES), utilizes data split- ting along with variable selection and partial regression. We show that the SPARES estimator is asymptotically unbiased and normal, and derive its variance via a non- parametric delta method. The utility of the procedure is evaluated under various simulation scenarios and via comparisons with the de-biased LASSO estimators, a major competitor. We apply the method to analyze two genomic datasets and obtain biologically meaningful results. 
    more » « less
  5. Summary We consider scenarios in which the likelihood function for a semiparametric regression model factors into separate components, with an efficient estimator of the regression parameter available for each component. An optimal weighted combination of the component estimators, named an ensemble estimator, may be employed as an overall estimate of the regression parameter, and may be fully efficient under uncorrelatedness conditions. This approach is useful when the full likelihood function may be difficult to maximize, but the components are easy to maximize. It covers settings where the nuisance parameter may be estimated at different rates in the component likelihoods. As a motivating example we consider proportional hazards regression with prospective doubly censored data, in which the likelihood factors into a current status data likelihood and a left-truncated right-censored data likelihood. Variable selection is important in such regression modelling, but the applicability of existing techniques is unclear in the ensemble approach. We propose ensemble variable selection using the least squares approximation technique on the unpenalized ensemble estimator, followed by ensemble re-estimation under the selected model. The resulting estimator has the oracle property such that the set of nonzero parameters is successfully recovered and the semiparametric efficiency bound is achieved for this parameter set. Simulations show that the proposed method performs well relative to alternative approaches. Analysis of an AIDS cohort study illustrates the practical utility of the method. 
    more » « less