We provide improved differentially private algorithms for identity testing of highdimensional distributions. Specifically, for ddimensional Gaussian distributions with known covariance Σ, we can test whether the distribution comes from N(μ∗,Σ) for some fixed μ∗ or from some N(μ,Σ) with total variation distance at least α from N(μ∗,Σ) with (ε,0)differential privacy, using only O~(d1/2α2+d1/3α4/3⋅ε2/3+1α⋅ε) samples if the algorithm is allowed to be computationally inefficient, and only O~(d1/2α2+d1/4α⋅ε) samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including meanmore »
Projection Test with Sparse Optimal Direction for HighDimensional One Sample Mean Problem
Testing whether the mean vector from some population is zero or not is a fundamental problem in statistics. In the highdimensional regime, where the
dimension of data p is greater than the sample size n, traditional methods such as
Hotelling’s T2 test cannot be directly applied. One can project the highdimensional
vector onto a space of low dimension and then traditional methods can be applied.
In this paper, we propose a projection test based on a new estimation of the optimal
projection direction Σ^{−1}μ. Under the assumption that the optimal projection Σ^{−1}μ is sparse, we use a regularized quadratic programming with nonconvex penalty and linear constraint to estimate it. Simulation studies and real data analysis are conducted to examine the finite sample performance of different tests in terms of type I error and power.
 Award ID(s):
 1820702
 Publication Date:
 NSFPAR ID:
 10217341
 Journal Name:
 Contemporary Experimental Design, Multivariate Analysis and Data Mining
 Page Range or eLocationID:
 295309
 Sponsoring Org:
 National Science Foundation
More Like this


We study the fundamental problem of highdimensional mean estimation in a robust model where a constant fraction of the samples are adversarially corrupted. Recent work gave the first polynomial time algorithms for this problem with dimensionindependent error guarantees for several families of structured distributions. In this work, we give the first nearlylinear time algorithms for highdimensional robust mean estimation. Specifically, we focus on distributions with (i) known covariance and subgaussian tails, and (ii) unknown bounded covariance. Given N samples on R^d, an \epsfraction of which may be arbitrarily corrupted, our algorithms run in time eO(Nd)/poly(\eps) and approximate the true meanmore »

We analyze the performance of the Tukey median estimator under total variation (TV) distance corruptions. Previous results show that under Huber's additive corruption model, the breakdown point is 1/3 for highdimensional halfspacesymmetric distributions. We show that under TV corruptions, the breakdown point reduces to 1/4 for the same set of distributions. We also show that a certain projection algorithm can attain the optimal breakdown point of 1/2. Both the Tukey median estimator and the projection algorithm achieve sample complexity linear in dimension.

Context. As primary anchors of the distance scale, Cepheid stars play a crucial role in our understanding of the distance scale of the Universe because of their periodluminosity relation. Determining precise and consistent parameters (radius, temperature, color excess, and projection factor) of Cepheid pulsating stars is therefore very important. Aims. With the highprecision parallaxes delivered by the early third Gaia data release (EDR3), we aim to derive various parameters of Cepheid stars in order to calibrate the periodluminosity and periodradius relations and to investigate the relation of period to p factor. Methods. We applied an implementation of the parallaxofpulsation methodmore »

A fundamental question in many data analysis settings is the problem of discerning the ``natural'' dimension of a data set. That is, when a data set is drawn from a manifold (possibly with noise), a meaningful aspect of the data is the dimension of that manifold. Various approaches exist for estimating this dimension, such as the method of SecantAvoidance Projection (SAP). Intuitively, the SAP algorithm seeks to determine a projection which best preserves the lengths of all secants between points in a data set; by applying the algorithm to find the best projections to vector spaces of various dimensions, onemore »