skip to main content


Title: Sure Independence Screening for Ultrahigh Dimensional Feature Space
Summary

Variable selection plays an important role in high dimensional statistical modelling which nowadays appears in many areas and is key to various scientific discoveries. For problems of large scale or dimensionality p, accuracy of estimation and computational cost are two top concerns. Recently, Candes and Tao have proposed the Dantzig selector using L1-regularization and showed that it achieves the ideal risk up to a logarithmic factor log(p). Their innovative procedure and remarkable result are challenged when the dimensionality is ultrahigh as the factor log(p) can be large and their uniform uncertainty principle can fail. Motivated by these concerns, we introduce the concept of sure screening and propose a sure screening method that is based on correlation learning, called sure independence screening, to reduce dimensionality from high to a moderate scale that is below the sample size. In a fairly general asymptotic framework, correlation learning is shown to have the sure screening property for even exponentially growing dimensionality. As a methodological extension, iterative sure independence screening is also proposed to enhance its finite sample performance. With dimension reduced accurately from high to below sample size, variable selection can be improved on both speed and accuracy, and can then be accomplished by a well-developed method such as smoothly clipped absolute deviation, the Dantzig selector, lasso or adaptive lasso. The connections between these penalized least squares methods are also elucidated.

 
more » « less
NSF-PAR ID:
10405649
Author(s) / Creator(s):
;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Volume:
70
Issue:
5
ISSN:
1369-7412
Page Range / eLocation ID:
p. 849-911
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    We propose a Bayesian variable selection approach for ultrahigh dimensional linear regression based on the strategy of split and merge. The approach proposed consists of two stages: split the ultrahigh dimensional data set into a number of lower dimensional subsets and select relevant variables from each of the subsets, and aggregate the variables selected from each subset and then select relevant variables from the aggregated data set. Since the approach proposed has an embarrassingly parallel structure, it can be easily implemented in a parallel architecture and applied to big data problems with millions or more of explanatory variables. Under mild conditions, we show that the approach proposed is consistent, i.e. the true explanatory variables can be correctly identified by the approach as the sample size becomes large. Extensive comparisons of the approach proposed have been made with penalized likelihood approaches, such as the lasso, elastic net, sure independence screening and iterative sure independence screening. The numerical results show that the approach proposed generally outperforms penalized likelihood approaches: the models selected by the approach tend to be more sparse and closer to the true model.

     
    more » « less
  2. Big data is ubiquitous in various fields of sciences, engineering, medicine, social sciences, and humanities. It is often accompanied by a large number of variables and features. While adding much greater flexibility to modeling with enriched feature space, ultra-high dimensional data analysis poses fundamental challenges to scalable learning and inference with good statistical efficiency. Sure independence screening is a simple and effective method to this endeavor. This framework of two-scale statistical learning, consisting of large-scale screening followed by moderate-scale variable selection introduced in Fan and Lv (2008), has been extensively investigated and extended to various model settings ranging from parametric to semiparametric and nonparametric for regression, classification, and survival analysis. This article provides an overview on the developments of sure independence screening over the past decade. These developments demonstrate the wide applicability of the sure independence screening based learning and inference for big data analysis with desired scalability and theoretical guarantees. 
    more » « less
  3. Marginal screening is a widely applied technique to handily reduce the dimensionality of the data when the number of potential features overwhelms the sample size. Because of the nature of the marginal screening procedures, they are also known for their difficulty in identifying the so‐called hidden variables that are jointly important but have weak marginal associations with the response variable. Failing to include a hidden variable in the screening stage has two undesirable consequences: (1) important features are missed out in model selection, and (2) biased inference is likely to occur in the subsequent analysis. Motivated by some recent work in conditional screening, we propose a data‐driven conditional screening algorithm, which is computationally efficient, enjoys the sure screening property under weaker assumptions on the model and works robustly in a variety of settings to reduce false negatives of hidden variables. Numerical comparison with alternatives screening procedures is also made to shed light on the relative merit of the proposed method. We illustrate the proposed methodology using a leukaemia microarray data example. Copyright © 2016 John Wiley & Sons, Ltd.

     
    more » « less
  4. Summary

    Variance estimation is a fundamental problem in statistical modelling. In ultrahigh dimensional linear regression where the dimensionality is much larger than the sample size, traditional variance estimation techniques are not applicable. Recent advances in variable selection in ultrahigh dimensional linear regression make this problem accessible. One of the major problems in ultrahigh dimensional regression is the high spurious correlation between the unobserved realized noise and some of the predictors. As a result, the realized noises are actually predicted when extra irrelevant variables are selected, leading to a serious underestimate of the level of noise. We propose a two-stage refitted procedure via a data splitting technique, called refitted cross-validation, to attenuate the influence of irrelevant variables with high spurious correlations. Our asymptotic results show that the resulting procedure performs as well as the oracle estimator, which knows in advance the mean regression function. The simulation studies lend further support to our theoretical claims. The naive two-stage estimator and the plug-in one-stage estimators using the lasso and smoothly clipped absolute deviation are also studied and compared. Their performances can be improved by the refitted cross-validation method proposed.

     
    more » « less
  5. Abstract

    Singlet fission (SF), the conversion of one singlet exciton into two triplet excitons, could significantly enhance solar cell efficiency. Molecular crystals that undergo SF are scarce. Computational exploration may accelerate the discovery of SF materials. However, many-body perturbation theory (MBPT) calculations of the excitonic properties of molecular crystals are impractical for large-scale materials screening. We use the sure-independence-screening-and-sparsifying-operator (SISSO) machine-learning algorithm to generate computationally efficient models that can predict the MBPT thermodynamic driving force for SF for a dataset of 101 polycyclic aromatic hydrocarbons (PAH101). SISSO generates models by iteratively combining physical primary features. The best models are selected by linear regression with cross-validation. The SISSO models successfully predict the SF driving force with errors below 0.2 eV. Based on the cost, accuracy, and classification performance of SISSO models, we propose a hierarchical materials screening workflow. Three potential SF candidates are found in the PAH101 set.

     
    more » « less