skip to main content

Title: Projection Test with Sparse Optimal Direction for High-Dimensional One Sample Mean Problem
Testing whether the mean vector from some population is zero or not is a fundamental problem in statistics. In the high-dimensional regime, where the dimension of data p is greater than the sample size n, traditional methods such as Hotelling’s T2 test cannot be directly applied. One can project the high-dimensional vector onto a space of low dimension and then traditional methods can be applied. In this paper, we propose a projection test based on a new estimation of the optimal projection direction Σ^{−1}μ. Under the assumption that the optimal projection Σ^{−1}μ is sparse, we use a regularized quadratic programming with nonconvex penalty and linear constraint to estimate it. Simulation studies and real data analysis are conducted to examine the finite sample performance of different tests in terms of type I error and power.
Authors:
;
Editors:
Fan, J; Pan, J.
Award ID(s):
1820702
Publication Date:
NSF-PAR ID:
10217341
Journal Name:
Contemporary Experimental Design, Multivariate Analysis and Data Mining
Page Range or eLocation-ID:
295-309
Sponsoring Org:
National Science Foundation
More Like this
  1. We provide improved differentially private algorithms for identity testing of high-dimensional distributions. Specifically, for d-dimensional Gaussian distributions with known covariance Σ, we can test whether the distribution comes from N(μ∗,Σ) for some fixed μ∗ or from some N(μ,Σ) with total variation distance at least α from N(μ∗,Σ) with (ε,0)-differential privacy, using only O~(d1/2α2+d1/3α4/3⋅ε2/3+1α⋅ε) samples if the algorithm is allowed to be computationally inefficient, and only O~(d1/2α2+d1/4α⋅ε) samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including meanmore »testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over {−1,1}d, and tolerant testing. Our results improve over the previous best work of Canonne et al.~\cite{CanonneKMUZ20} for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{non-private} sample complexity of O(d√α2) in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of d-dimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size d \cite{AcharyaSZ18}, which refutes a conjectured lower bound of~\cite{CanonneKMUZ20}.« less
  2. We study the fundamental problem of high-dimensional mean estimation in a robust model where a constant fraction of the samples are adversarially corrupted. Recent work gave the first polynomial time algorithms for this problem with dimension-independent error guarantees for several families of structured distributions. In this work, we give the first nearly-linear time algorithms for high-dimensional robust mean estimation. Specifically, we focus on distributions with (i) known covariance and sub-gaussian tails, and (ii) unknown bounded covariance. Given N samples on R^d, an \eps-fraction of which may be arbitrarily corrupted, our algorithms run in time eO(Nd)/poly(\eps) and approximate the true meanmore »within the information-theoretically optimal error, up to constant factors. Previous robust algorithms with comparable error guarantees have running times \Omega(Nd^2), for \eps= O(1) Our algorithms rely on a natural family of SDPs parameterized by our current guess ν for the unknown mean μ. We give a win-win analysis establishing the following: either a near-optimal solution to the primal SDP yields a good candidate for μ — independent of our current guess ν — or a near-optimal solution to the dual SDP yields a new guess ν0 whose distance from μ is smaller by a constant factor. We exploit the special structure of the corresponding SDPs to show that they are approximately solvable in nearly-linear time. Our approach is quite general, and we believe it can also be applied to obtain nearly-linear time algorithms for other high-dimensional robust learning problems.« less
  3. We analyze the performance of the Tukey median estimator under total variation (TV) distance corruptions. Previous results show that under Huber's additive corruption model, the breakdown point is 1/3 for high-dimensional halfspace-symmetric distributions. We show that under TV corruptions, the breakdown point reduces to 1/4 for the same set of distributions. We also show that a certain projection algorithm can attain the optimal breakdown point of 1/2. Both the Tukey median estimator and the projection algorithm achieve sample complexity linear in dimension.
  4. Context. As primary anchors of the distance scale, Cepheid stars play a crucial role in our understanding of the distance scale of the Universe because of their period-luminosity relation. Determining precise and consistent parameters (radius, temperature, color excess, and projection factor) of Cepheid pulsating stars is therefore very important. Aims. With the high-precision parallaxes delivered by the early third Gaia data release (EDR3), we aim to derive various parameters of Cepheid stars in order to calibrate the period-luminosity and period-radius relations and to investigate the relation of period to p -factor. Methods. We applied an implementation of the parallax-of-pulsation methodmore »through the algorithm called spectro-photo-interferometry of pulsating stars (SPIPS), which combines all types of available data for a variable star (multiband and multicolor photometry, radial velocity, effective temperature, and interferometry measurements) in a global modeling of its pulsation. Results. We present the SPIPS modeling of a sample of 63 Galactic Cepheids. Adopting Gaia EDR3 parallaxes as an input associated with the best available dataset, we derive consistent values of parameters for these stars such as the radius, multiband apparent magnitudes, effective temperatures, color excesses, period changes, Fourier parameters, and the projection factor. Conclusions. Using the best set of data and the most precise distances for Milky Way Cepheids, we derive new calibrations of the period-luminosity and period-radius relations: M K S = −5.529 ±0.015   −  3.141 ±0.050 (log P   −  0.9) and log R = 1.763 ±0.003   +  0.653 ±0.012 (log P   −  0.9). After investigating the dependences of the projection factor on the parameters of the stars, we find a high dispersion of its values and no evidence of its correlation with the period or with any other parameters such as radial velocity, temperature, or metallicity. Statistically, the p -factor has an average value of p  = 1.26 ± 0.07, but with an unsatisfactory agreement ( σ  = 0.15). In absence of any clear correlation between the p -factor and other quantities, the best agreement is obtained under the assumption that the p -factor can take any value in a band with a width of 0.15. This result highlights the need for a further examination of the physics behind the p -factor.« less
  5. A fundamental question in many data analysis settings is the problem of discerning the ``natural'' dimension of a data set. That is, when a data set is drawn from a manifold (possibly with noise), a meaningful aspect of the data is the dimension of that manifold. Various approaches exist for estimating this dimension, such as the method of Secant-Avoidance Projection (SAP). Intuitively, the SAP algorithm seeks to determine a projection which best preserves the lengths of all secants between points in a data set; by applying the algorithm to find the best projections to vector spaces of various dimensions, onemore »may infer the dimension of the manifold of origination. That is, one may learn the dimension at which it is possible to construct a diffeomorphic copy of the data in a lower-dimensional Euclidean space. Using Whitney's embedding theorem, we can relate this information to the natural dimension of the data. A drawback of the SAP algorithm is that a data set with $n$ points has $n(n-1)/2$ secants, making the computation and storage of all secants infeasible for very large data sets. In this paper, we propose a novel algorithm that generalizes the SAP algorithm with an emphasis on addressing this issue. That is, we propose a hierarchical secant-based dimensionality-reduction method, which can be employed for data sets where explicitly calculating all secants is not feasible.« less