We provide improved differentially private algorithms for identity testing of high-dimensional distributions. Specifically, for d-dimensional Gaussian distributions with known covariance Σ, we can test whether the distribution comes from N(μ∗,Σ) for some fixed μ∗ or from some N(μ,Σ) with total variation distance at least α from N(μ∗,Σ) with (ε,0)-differential privacy, using only O~(d1/2α2+d1/3α4/3⋅ε2/3+1α⋅ε) samples if the algorithm is allowed to be computationally inefficient, and only O~(d1/2α2+d1/4α⋅ε) samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including meanmore »
Projection Test with Sparse Optimal Direction for High-Dimensional One Sample Mean Problem
Testing whether the mean vector from some population is zero or not is a fundamental problem in statistics. In the high-dimensional regime, where the
dimension of data p is greater than the sample size n, traditional methods such as
Hotelling’s T2 test cannot be directly applied. One can project the high-dimensional
vector onto a space of low dimension and then traditional methods can be applied.
In this paper, we propose a projection test based on a new estimation of the optimal
projection direction Σ^{−1}μ. Under the assumption that the optimal projection Σ^{−1}μ is sparse, we use a regularized quadratic programming with nonconvex penalty and linear constraint to estimate it. Simulation studies and real data analysis are conducted to examine the finite sample performance of different tests in terms of type I error and power.
- Award ID(s):
- 1820702
- Publication Date:
- NSF-PAR ID:
- 10217341
- Journal Name:
- Contemporary Experimental Design, Multivariate Analysis and Data Mining
- Page Range or eLocation-ID:
- 295-309
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We study the fundamental problem of high-dimensional mean estimation in a robust model where a constant fraction of the samples are adversarially corrupted. Recent work gave the first polynomial time algorithms for this problem with dimension-independent error guarantees for several families of structured distributions. In this work, we give the first nearly-linear time algorithms for high-dimensional robust mean estimation. Specifically, we focus on distributions with (i) known covariance and sub-gaussian tails, and (ii) unknown bounded covariance. Given N samples on R^d, an \eps-fraction of which may be arbitrarily corrupted, our algorithms run in time eO(Nd)/poly(\eps) and approximate the true meanmore »
-
We analyze the performance of the Tukey median estimator under total variation (TV) distance corruptions. Previous results show that under Huber's additive corruption model, the breakdown point is 1/3 for high-dimensional halfspace-symmetric distributions. We show that under TV corruptions, the breakdown point reduces to 1/4 for the same set of distributions. We also show that a certain projection algorithm can attain the optimal breakdown point of 1/2. Both the Tukey median estimator and the projection algorithm achieve sample complexity linear in dimension.
-
Context. As primary anchors of the distance scale, Cepheid stars play a crucial role in our understanding of the distance scale of the Universe because of their period-luminosity relation. Determining precise and consistent parameters (radius, temperature, color excess, and projection factor) of Cepheid pulsating stars is therefore very important. Aims. With the high-precision parallaxes delivered by the early third Gaia data release (EDR3), we aim to derive various parameters of Cepheid stars in order to calibrate the period-luminosity and period-radius relations and to investigate the relation of period to p -factor. Methods. We applied an implementation of the parallax-of-pulsation methodmore »
-
A fundamental question in many data analysis settings is the problem of discerning the ``natural'' dimension of a data set. That is, when a data set is drawn from a manifold (possibly with noise), a meaningful aspect of the data is the dimension of that manifold. Various approaches exist for estimating this dimension, such as the method of Secant-Avoidance Projection (SAP). Intuitively, the SAP algorithm seeks to determine a projection which best preserves the lengths of all secants between points in a data set; by applying the algorithm to find the best projections to vector spaces of various dimensions, onemore »