We study the problem of testing identity against a given distribution with a focus on the high confidence regime. More precisely, given samples from an unknown distribution p over n elements, an explicitly given distribution q, and parameters 0< epsilon, delta < 1, we wish to distinguish, with probability at least 1delta, whether the distributions are identical versus epsilonfar in total variation distance. Most prior work focused on the case that delta = Omega(1), for which the sample complexity of identity testing is known to be Theta(sqrt{n}/epsilon^2). Given such an algorithm, one can achieve arbitrarily small values of delta via blackbox amplification, which multiplies the required number of samples by Theta(log(1/delta)). We show that blackbox amplification is suboptimal for any delta = o(1), and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is Theta((1/epsilon^2) (sqrt{n log(1/delta)} + log(1/delta))) for any n, epsilon, and delta. For the special case of uniformity testing, where the given distribution is the uniform distribution U_n over the domain, our new tester is surprisingly simple: to test whether p = U_n versus d_{TV} (p, U_n) >= epsilon, we simply threshold d_{TV}({p^}, U_n), where {p^} is the empirical probability distribution. The fact that this simple "plugin" estimator is sampleoptimal is surprising, even in the constant delta case. Indeed, it was believed that such a tester would not attain sublinear sample complexity even for constant values of epsilon and delta. An important contribution of this work lies in the analysis techniques that we introduce in this context. First, we exploit an underlying strong convexity property to bound from below the expectation gap in the completeness and soundness cases. Second, we give a new, fast method for obtaining provably correct empirical estimates of the true worstcase failure probability for a broad class of uniformity testing statistics over all possible input distributions  including all previously studied statistics for this problem. We believe that our novel analysis techniques will be useful for other distribution testing problems as well.
more »
« less
Which Distribution Distances are Sublinearly Testable?
Given samples from an unknown distribution p and a description of a distribution q, are p and q close or far? This question of “identity testing” has received significant attention in the case of testing whether p and q are equal or far in total variation distance. However, in recent work [VV11a, ADK15, DP17], the following questions have been been critical to solving problems at the frontiers of distribution testing:
Alternative Distances: Can we test whether p and q are far in other distances, say Hellinger?
Tolerance: Can we test when p and q are close, rather than equal? And if so, close in which distances?
Motivated by these questions, we characterize the complexity of distribution testing under a variety of distances, including total variation, ℓ2, Hellinger, KullbackLeibler, and χ2. For each pair of distances d1 and d2, we study the complexity of testing if p and q are close in d1 versus far in d2, with a focus on identifying which problems allow strongly sublinear testers (i.e., those with complexity O(n1–γ) for some γ > 0 where n is the size of the support of the distributions p and q). We provide matching upper and lower bounds for each case. We also study these questions in the case where we only have samples from q (equivalence testing), showing qualitative differences from identity testing in terms of when tolerance can be achieved. Our algorithms fall into the classical paradigm of χ2statistics, but require crucial changes to handle the challenges introduced by each distance we consider. Finally, we survey other recent results in an attempt to serve as a reference for the complexity of various distribution testing problems.
more »
« less
 Award ID(s):
 1741137
 NSFPAR ID:
 10079725
 Date Published:
 Journal Name:
 Proceedings of the annual ACMSIAM Symposium on Discrete Algorithms
 ISSN:
 10719040
 Page Range / eLocation ID:
 27472764
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We provide improved differentially private algorithms for identity testing of highdimensional distributions. Specifically, for ddimensional Gaussian distributions with known covariance Σ, we can test whether the distribution comes from N(μ∗,Σ) for some fixed μ∗ or from some N(μ,Σ) with total variation distance at least α from N(μ∗,Σ) with (ε,0)differential privacy, using only O~(d1/2α2+d1/3α4/3⋅ε2/3+1α⋅ε) samples if the algorithm is allowed to be computationally inefficient, and only O~(d1/2α2+d1/4α⋅ε) samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including mean testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over {−1,1}d, and tolerant testing. Our results improve over the previous best work of Canonne et al.~\cite{CanonneKMUZ20} for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{nonprivate} sample complexity of O(d√α2) in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of ddimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size d \cite{AcharyaSZ18}, which refutes a conjectured lower bound of~\cite{CanonneKMUZ20}.more » « less

There has been significant study on the sample complexity of testing properties of distributions over large domains. For many properties, it is known that the sample complexity can be substantially smaller than the domain size. For example, over a domain of size n, distinguishing the uniform distribution from distributions that are far from uniform in ℓ1distance uses only O(n−−√) samples. However, the picture is very different in the presence of arbitrary noise, even when the amount of noise is quite small. In this case, one must distinguish if samples are coming from a distribution that is ϵclose to uniform from the case where the distribution is (1−ϵ)far from uniform. The latter task requires nearly linear in n samples (Valiant, 2008; Valiant and Valiant, 2017a). In this work, we present a noise model that on one hand is more tractable for the testing problem, and on the other hand represents a rich class of noise families. In our model, the noisy distribution is a mixture of the original distribution and noise, where the latter is known to the tester either explicitly or via sample access; the form of the noise is also known \emph{a priori}. Focusing on the identity and closeness testing problems leads to the following mixture testing question: Given samples of distributions p,q1,q2, can we test if p is a mixture of q1 and q2? We consider this general question in various scenarios that differ in terms of how the tester can access the distributions, and show that indeed this problem is more tractable. Our results show that the sample complexity of our testers are exactly the same as for the classical nonmixture case.more » « less

There has been significant study on the sample complexity of testing properties of distributions over large domains. For many properties, it is known that the sample complexity can be substantially smaller than the domain size. For example, over a domain of size n, distinguishing the uniform distribution from distributions that are far from uniform in ℓ1distance uses only O(n−−√) samples. However, the picture is very different in the presence of arbitrary noise, even when the amount of noise is quite small. In this case, one must distinguish if samples are coming from a distribution that is ϵclose to uniform from the case where the distribution is (1−ϵ)far from uniform. The latter task requires nearly linear in n samples (Valiant, 2008; Valiant and Valiant, 2017a). In this work, we present a noise model that on one hand is more tractable for the testing problem, and on the other hand represents a rich class of noise families. In our model, the noisy distribution is a mixture of the original distribution and noise, where the latter is known to the tester either explicitly or via sample access; the form of the noise is also known \emph{a priori}. Focusing on the identity and closeness testing problems leads to the following mixture testing question: Given samples of distributions p,q1,q2, can we test if p is a mixture of q1 and q2? We consider this general question in various scenarios that differ in terms of how the tester can access the distributions, and show that indeed this problem is more tractable. Our results show that the sample complexity of our testers are exactly the same as for the classical nonmixture case.more » « less

null (Ed.)We propose a new setting for testing properties of distributions while receiving samples from several distributions, but few samples per distribution. Given samples from s distributions, p_1, p_2, …, p_s, we design testers for the following problems: (1) Uniformity Testing: Testing whether all the p_i’s are uniform or εfar from being uniform in ℓ_1distance (2) Identity Testing: Testing whether all the p_i’s are equal to an explicitly given distribution q or εfar from q in ℓ_1distance, and (3) Closeness Testing: Testing whether all the p_i’s are equal to a distribution q which we have sample access to, or εfar from q in ℓ_1distance. By assuming an additional natural condition about the source distributions, we provide sample optimal testers for all of these problems.more » « less