skip to main content


Title: Correlation Pursuit: Forward Stepwise Variable Selection for Index Models
Summary

A stepwise procedure, correlation pursuit (COP), is developed for variable selection under the sufficient dimension reduction framework, in which the response variable Y is influenced by the predictors X1,X2,…,Xp through an unknown function of a few linear combinations of them. Unlike linear stepwise regression, COP does not impose a special form of relationship (such as linear) between the response variable and the predictor variables. The COP procedure selects variables that attain the maximum correlation between the transformed response and the linear combination of the variables. Various asymptotic properties of the COP procedure are established and, in particular, its variable selection performance under a diverging number of predictors and sample size is investigated. The excellent empirical performance of the COP procedure in comparison with existing methods is demonstrated by both extensive simulation studies and a real example in functional genomics.

 
more » « less
NSF-PAR ID:
10401214
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
74
Issue:
5
ISSN:
1369-7412
Page Range / eLocation ID:
p. 849-870
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    Variance estimation is a fundamental problem in statistical modelling. In ultrahigh dimensional linear regression where the dimensionality is much larger than the sample size, traditional variance estimation techniques are not applicable. Recent advances in variable selection in ultrahigh dimensional linear regression make this problem accessible. One of the major problems in ultrahigh dimensional regression is the high spurious correlation between the unobserved realized noise and some of the predictors. As a result, the realized noises are actually predicted when extra irrelevant variables are selected, leading to a serious underestimate of the level of noise. We propose a two-stage refitted procedure via a data splitting technique, called refitted cross-validation, to attenuate the influence of irrelevant variables with high spurious correlations. Our asymptotic results show that the resulting procedure performs as well as the oracle estimator, which knows in advance the mean regression function. The simulation studies lend further support to our theoretical claims. The naive two-stage estimator and the plug-in one-stage estimators using the lasso and smoothly clipped absolute deviation are also studied and compared. Their performances can be improved by the refitted cross-validation method proposed.

     
    more » « less
  2. null (Ed.)
    Predictive models play a central role in decision making. Penalized regression approaches, such as least absolute shrinkage and selection operator (LASSO), have been widely used to construct predictive models and explain the impacts of the selected predictors, but the estimates are typically biased. Moreover, when data are ultrahigh-dimensional, penalized regression is usable only after applying variable screening methods to downsize variables. We propose a stepwise procedure for fitting generalized linear models with ultrahigh dimensional predictors. Our procedure can provide a final model; control both false negatives and false positives; and yield consistent estimates, which are useful to gauge the actual effect size of risk factors. Simulations and applications to two clinical studies verify the utility of the method. 
    more » « less
  3. Summary

    Variable selection for recovering sparsity in nonadditive and nonparametric models with high-dimensional variables has been challenging. This problem becomes even more difficult due to complications in modeling unknown interaction terms among high-dimensional variables. There is currently no variable selection method to overcome these limitations. Hence, in this article we propose a variable selection approach that is developed by connecting a kernel machine with the nonparametric regression model. The advantages of our approach are that it can: (i) recover the sparsity; (ii) automatically model unknown and complicated interactions; (iii) connect with several existing approaches including linear nonnegative garrote and multiple kernel learning; and (iv) provide flexibility for both additive and nonadditive nonparametric models. Our approach can be viewed as a nonlinear version of a nonnegative garrote method. We model the smoothing function by a Least Squares Kernel Machine (LSKM) and construct the nonnegative garrote objective function as the function of the sparse scale parameters of kernel machine to recover sparsity of input variables whose relevances to the response are measured by the scale parameters. We also provide the asymptotic properties of our approach. We show that sparsistency is satisfied with consistent initial kernel function coefficients under certain conditions. An efficient coordinate descent/backfitting algorithm is developed. A resampling procedure for our variable selection methodology is also proposed to improve the power.

     
    more » « less
  4. In many practices, scientists are particularly interested in detecting which of the predictors are truly associated with a multivariate response. It is more accurate to model multiple responses as one vector rather than separating each component one by one. This is particularly true for complex traits having multiple correlated components. A Bayesian multivariate variable selection (BMVS) approach is proposed to select important predictors influencing the multivariate response from a candidate pool with ultrahigh dimension. By applying the sample‐size‐dependent spike and slab priors, the BMVS approach satisfies the strong selection consistency property under certain conditions, which represents the advantages of BMVS over other existing Bayesian multivariate regression‐based approaches. The proposed approach considers the covariance structure of multiple responses without assuming independence and integrates the estimation of covariance‐related parameters together with all regression parameters into one framework through a fast‐updating Markov chain Monte Carlo (MCMC) procedure. It is demonstrated through simulations that the BMVS approach outperforms some other relevant frequentist and Bayesian approaches. The proposed BMVS approach possesses a flexibility of wide applications, including genome‐wide association studies with multiple correlated phenotypes and a large scale of genetic variants and/or environmental variables, as demonstrated in the real data analyses section. The computer code and test data of the proposed method are available as an R package.

     
    more » « less
  5. Summary

    We propose a multivariate sparse group lasso variable selection and estimation method for data with high-dimensional predictors as well as high-dimensional response variables. The method is carried out through a penalized multivariate multiple linear regression model with an arbitrary group structure for the regression coefficient matrix. It suits many biology studies well in detecting associations between multiple traits and multiple predictors, with each trait and each predictor embedded in some biological functional groups such as genes, pathways or brain regions. The method is able to effectively remove unimportant groups as well as unimportant individual coefficients within important groups, particularly for large p small n problems, and is flexible in handling various complex group structures such as overlapping or nested or multilevel hierarchical structures. The method is evaluated through extensive simulations with comparisons to the conventional lasso and group lasso methods, and is applied to an eQTL association study.

     
    more » « less