We provide improved differentially private algorithms for identity testing of highdimensional distributions. Specifically, for ddimensional Gaussian distributions with known covariance Σ, we can test whether the distribution comes from N(μ∗,Σ) for some fixed μ∗ or from some N(μ,Σ) with total variation distance at least α from N(μ∗,Σ) with (ε,0)differential privacy, using only O~(d1/2α2+d1/3α4/3⋅ε2/3+1α⋅ε) samples if the algorithm is allowed to be computationally inefficient, and only O~(d1/2α2+d1/4α⋅ε) samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including mean testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over {−1,1}d, and tolerant testing. Our results improve over the previous best work of Canonne et al.~\cite{CanonneKMUZ20} for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{nonprivate} sample complexity of O(d√α2) in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of ddimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size d \cite{AcharyaSZ18}, which refutes a conjectured lower bound of~\cite{CanonneKMUZ20}.
Differentially Private Identity and Equivalence Testing of Discrete Distributions
We study the fundamental problems of identity and equivalence testing over a discrete populationfrom random samples. Our goal is to develop efficient testers while guaranteeingdifferential privacy to the individuals of the population. We provide sampleefficient differentially private testers for these problems.Our theoretical results significantly improve over the best known algorithms for identity testing, and are the first results for private equivalence testing. The conceptual message of our work is thatthere exist private hypothesis testers that are nearly as sampleefficient as their nonprivate counterparts. We perform an experimental evaluation of our algorithms on synthetic data. Our experiments illustrate that our private testers achieve small type I and type II errors with sample size sublinear in the domain size of the underlying distributions.
 Publication Date:
 NSFPAR ID:
 10065946
 Journal Name:
 Proceedings of the 35th International Conference on Machine Learning (ICML 2018)
 Page Range or eLocationID:
 169178
 Sponsoring Org:
 National Science Foundation
More Like this


Given samples from an unknown distribution p and a description of a distribution q, are p and q close or far? This question of “identity testing” has received significant attention in the case of testing whether p and q are equal or far in total variation distance. However, in recent work [VV11a, ADK15, DP17], the following questions have been been critical to solving problems at the frontiers of distribution testing: Alternative Distances: Can we test whether p and q are far in other distances, say Hellinger? Tolerance: Can we test when p and q are close, rather than equal? And if so, close in which distances? Motivated by these questions, we characterize the complexity of distribution testing under a variety of distances, including total variation, ℓ2, Hellinger, KullbackLeibler, and χ2. For each pair of distances d1 and d2, we study the complexity of testing if p and q are close in d1 versus far in d2, with a focus on identifying which problems allow strongly sublinear testers (i.e., those with complexity O(n1–γ) for some γ > 0 where n is the size of the support of the distributions p and q). We provide matching upper and lower bounds for eachmore »

There has been significant study on the sample complexity of testing properties of distributions over large domains. For many properties, it is known that the sample complexity can be substantially smaller than the domain size. For example, over a domain of size n, distinguishing the uniform distribution from distributions that are far from uniform in ℓ1distance uses only O(n−−√) samples. However, the picture is very different in the presence of arbitrary noise, even when the amount of noise is quite small. In this case, one must distinguish if samples are coming from a distribution that is ϵclose to uniform from the case where the distribution is (1−ϵ)far from uniform. The latter task requires nearly linear in n samples (Valiant, 2008; Valiant and Valiant, 2017a). In this work, we present a noise model that on one hand is more tractable for the testing problem, and on the other hand represents a rich class of noise families. In our model, the noisy distribution is a mixture of the original distribution and noise, where the latter is known to the tester either explicitly or via sample access; the form of the noise is also known \emph{a priori}. Focusing on the identity and closenessmore »

There has been significant study on the sample complexity of testing properties of distributions over large domains. For many properties, it is known that the sample complexity can be substantially smaller than the domain size. For example, over a domain of size n, distinguishing the uniform distribution from distributions that are far from uniform in ℓ1distance uses only O(n−−√) samples. However, the picture is very different in the presence of arbitrary noise, even when the amount of noise is quite small. In this case, one must distinguish if samples are coming from a distribution that is ϵclose to uniform from the case where the distribution is (1−ϵ)far from uniform. The latter task requires nearly linear in n samples (Valiant, 2008; Valiant and Valiant, 2017a). In this work, we present a noise model that on one hand is more tractable for the testing problem, and on the other hand represents a rich class of noise families. In our model, the noisy distribution is a mixture of the original distribution and noise, where the latter is known to the tester either explicitly or via sample access; the form of the noise is also known \emph{a priori}. Focusing on the identity and closenessmore »

Statistical tests are at the heart of many scientific tasks. To validate their hypothesis, researchers in medical and social sciences use individuals' data. The sensitivity of participants' data requires the design of statistical tests that ensure the privacy of the individuals in the most efficient way. In this paper, we use the framework of property testing to design algorithms to test the properties of the distribution that the data is drawn from with respect to differential privacy. In particular, we investigate testing two fundamental properties of distributions: (1) testing the equivalence of two distributions when we have unequal numbers of samples from the two distributions. (2) Testing independence of two random variables. In both cases, we show that our testers achieve near optimal sample complexity (up to logarithmic factors). Moreover, our dependence on the privacy parameter is an additive term, which indicates that differential privacy can be obtained in most regimes of parameters for free.