skip to main content


Title: A Computational Perspective on Projection Pursuit in High Dimensions: Feasible or Infeasible Feature Extraction
Summary

Finding a suitable representation of multivariate data is fundamental in many scientific disciplines. Projection pursuit ( ) aims to extract interesting ‘non‐Gaussian’ features from multivariate data, and tends to be computationally intensive even when applied to data of low dimension. In high‐dimensional settings, a recent work (Bickel et al., 2018) on addresses asymptotic characterization and conjectures of the feasible projections as the dimension grows with sample size. To gain practical utility of and learn theoretical insights into in an integral way, data analytic tools needed to evaluate the behaviour of in high dimensions become increasingly desirable but are less explored in the literature. This paper focuses on developing computationally fast and effective approaches central to finite sample studies for (i) visualizing the feasibility of in extracting features from high‐dimensional data, as compared with alternative methods like and , and (ii) assessing the plausibility of in cases where asymptotic studies are lacking or unavailable, with the goal of better understanding the practicality, limitation and challenge of in the analysis of large data sets.

 
more » « less
Award ID(s):
2013486 1712418
NSF-PAR ID:
10405192
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
International Statistical Review
Volume:
91
Issue:
1
ISSN:
0306-7734
Page Range / eLocation ID:
p. 140-161
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Summary

    Motivated by the increasing use of kernel-based metrics for high-dimensional and large-scale data, we study the asymptotic behaviour of kernel two-sample tests when the dimension and sample sizes both diverge to infinity. We focus on the maximum mean discrepancy using an isotropic kernel, which includes maximum mean discrepancy with the Gaussian kernel and the Laplace kernel, and the energy distance as special cases. We derive asymptotic expansions of the kernel two-sample statistics, based on which we establish a central limit theorem under both the null hypothesis and the local and fixed alternatives. The new nonnull central limit theorem results allow us to perform asymptotic exact power analysis, which reveals a delicate interplay between the moment discrepancy that can be detected by the kernel two-sample tests and the dimension-and-sample orders. The asymptotic theory is further corroborated through numerical studies.

     
    more » « less
  2. In the context of principal components analysis (PCA), the bootstrap is commonly applied to solve a variety of inference problems, such as constructing confidence intervals for the eigenvalues of the population covariance matrix Σ. However, when the data are high-dimensional, there are relatively few theoretical guarantees that quantify the performance of the bootstrap. Our aim in this paper is to analyze how well the bootstrap can approximate the joint distribution of the leading eigenvalues of the sample covariance matrix \hat{Σ}, and we establish non-asymptotic rates of approximation with respect to the multivariate Kolmogorov metric. Under certain assumptions, we show that the bootstrap can achieve a dimension-free rate of r(Σ)/sqrt{n} up to logarithmic factors, where r(Σ) is the effective rank of Σ, and n is the sample size. From a methodological standpoint, we show that applying a transformation to the eigenvalues of \hat{Σ} before bootstrapping is an important consideration in high-dimensional settings. 
    more » « less
  3. Summary

    We introduce an L2-type test for testing mutual independence and banded dependence structure for high dimensional data. The test is constructed on the basis of the pairwise distance covariance and it accounts for the non-linear and non-monotone dependences among the data, which cannot be fully captured by the existing tests based on either Pearson correlation or rank correlation. Our test can be conveniently implemented in practice as the limiting null distribution of the test statistic is shown to be standard normal. It exhibits excellent finite sample performance in our simulation studies even when the sample size is small albeit the dimension is high and is shown to identify non-linear dependence in empirical data analysis successfully. On the theory side, asymptotic normality of our test statistic is shown under quite mild moment assumptions and with little restriction on the growth rate of the dimension as a function of sample size. As a demonstration of good power properties for our distance-covariance-based test, we further show that an infeasible version of our test statistic has the rate optimality in the class of Gaussian distributions with equal correlation.

     
    more » « less
  4. Abstract

    Data with a huge size present great challenges in modeling, inferences, and computation. In handling big data, much attention has been directed to settings with “largepsmalln”, and relatively less work has been done to address problems withpandnbeing both large, though data with such a feature have now become more accessible than before, whereprepresents the number of variables andnstands for the sample size. The big volume of data does not automatically ensure good quality of inferences because a large number of unimportant variables may be collected in the process of gathering informative variables. To carry out valid statistical analysis, it is imperative to screen out noisy variables that have no predictive value for explaining the outcome variable. In this paper, we develop a screening method for handling large‐sized survival data, where the sample sizenis large and the dimensionpof covariates is of non‐polynomial order of the sample sizen, or the so‐called NP‐dimension. We rigorously establish theoretical results for the proposed method and conduct numerical studies to assess its performance. Our research offers multiple extensions of existing work and enlarges the scope of high‐dimensional data analysis. The proposed method capitalizes on the connections among useful regression settings and offers a computationally efficient screening procedure. Our method can be applied to different situations with large‐scale data including genomic data.

     
    more » « less
  5. Abstract

    Change point detection for high‐dimensional data is an important yet challenging problem for many applications. In this article, we consider multiple change point detection in the context of high‐dimensional generalized linear models, allowing the covariate dimension to grow exponentially with the sample size . The model considered is general and flexible in the sense that it covers various specific models as special cases. It can automatically account for the underlying data generation mechanism without specifying any prior knowledge about the number of change points. Based on dynamic programming and binary segmentation techniques, two algorithms are proposed to detect multiple change points, allowing the number of change points to grow with . To further improve the computational efficiency, a more efficient algorithm designed for the case of a single change point is proposed. We present theoretical properties of our proposed algorithms, including estimation consistency for the number and locations of change points as well as consistency and asymptotic distributions for the underlying regression coefficients. Finally, extensive simulation studies and application to the Alzheimer's Disease Neuroimaging Initiative data further demonstrate the competitive performance of our proposed methods.

     
    more » « less