skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, December 13 until 2:00 AM ET on Saturday, December 14 due to maintenance. We apologize for the inconvenience.


Title: Complexity of High-Dimensional Identity Testing with Coordinate Conditional Sampling
We study the identity testing problem for high-dimensional distributions. Given as input an explicit distribution μ, an ε>0, and access to sampling oracle(s) for a hidden distribution π, the goal in identity testing is to distinguish whether the two distributions μ and π are identical or are at least ε-far apart. When there is only access to full samples from the hidden distribution π, it is known that exponentially many samples (in the dimension) may be needed for identity testing, and hence previous works have studied identity testing with additional access to various “conditional” sampling oracles. We consider a significantly weaker conditional sampling oracle, which we call the Coordinate Oracle, and provide a computational and statistical characterization of the identity testing problem in this new model. We prove that if an analytic property known as approximate tensorization of entropy holds for an n-dimensional visible distribution μ, then there is an efficient identity testing algorithm for any hidden distribution π using O˜(n/ε) queries to the Coordinate Oracle. Approximate tensorization of entropy is a pertinent condition as recent works have established it for a large class of high-dimensional distributions. We also prove a computational phase transition: for a well-studied class of n-dimensional distributions, specifically sparse antiferromagnetic Ising models over {+1,−1}^n, we show that in the regime where approximate tensorization of entropy fails, there is no efficient identity testing algorithm unless RP=NP. We complement our results with a matching Ω(n/ε) statistical lower bound for the sample complexity of identity testing in the model.  more » « less
Award ID(s):
2007287
PAR ID:
10548827
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
COLT 2023
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We provide improved differentially private algorithms for identity testing of high-dimensional distributions. Specifically, for d-dimensional Gaussian distributions with known covariance Σ, we can test whether the distribution comes from N(μ∗,Σ) for some fixed μ∗ or from some N(μ,Σ) with total variation distance at least α from N(μ∗,Σ) with (ε,0)-differential privacy, using only O~(d1/2α2+d1/3α4/3⋅ε2/3+1α⋅ε) samples if the algorithm is allowed to be computationally inefficient, and only O~(d1/2α2+d1/4α⋅ε) samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including mean testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over {−1,1}d, and tolerant testing. Our results improve over the previous best work of Canonne et al.~\cite{CanonneKMUZ20} for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{non-private} sample complexity of O(d√α2) in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of d-dimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size d \cite{AcharyaSZ18}, which refutes a conjectured lower bound of~\cite{CanonneKMUZ20}. 
    more » « less
  2. A conditional sampling oracle for a probability distribution D returns samples from the conditional distribution of D restricted to a specified subset of the domain. A recent line of work (Chakraborty et al. 2013 and Cannone et al. 2014) has shown that having access to such a conditional sampling oracle requires only polylogarithmic or even constant number of samples to solve distribution testing problems like identity and uniformity. This significantly improves over the standard sampling model where polynomially many samples are necessary. Inspired by these results, we introduce a computational model based on conditional sampling to develop sublinear algorithms with exponentially faster runtimes compared to standard sublinear algorithms. We focus on geometric optimization problems over points in high dimensional Euclidean space. Access to these points is provided via a conditional sampling oracle that takes as input a succinct representation of a subset of the domain and outputs a uniformly random point in that subset. We study two well studied problems: k-means clustering and estimating the weight of the minimum spanning tree. In contrast to prior algorithms for the classic model, our algorithms have time, space and sample complexity that is polynomial in the dimension and polylogarithmic in the number of points. Finally, we comment on the applicability of the model and compare with existing ones like streaming, parallel and distributed computational models. 
    more » « less
  3. We study the identity testing problem in the context of spin systems or undirected graphical models, where it takes the following form: given the parameter specification of the model M and a sampling oracle for the distribution \mu_{M^*} of an unknown model M^*, can we efficiently determine if the two models M and M^* are the same? We consider identity testing for both soft-constraint and hard-constraint systems. In particular, we prove hardness results in two prototypical cases, the Ising model and proper colorings, and explore whether identity testing is any easier than structure learning. For the ferromagnetic (attractive) Ising model, Daskalasis et al. (2018) presented a polynomial time algorithm for identity testing. We prove hardness results in the antiferromagnetic (repulsive) setting in the same regime of parameters where structure learning is known to require a super-polynomial number of samples. In particular, for n-vertex graphs of maximum degree d, we prove that if |\beta| d = \omega(\log n) (where \beta is the inverse temperature parameter), then there is no identity testing algorithm for the antiferromagnetic Ising model that runs in polynomial time unless RP = NP. We also establish computational lower bounds for a broader set of parameters under the (randomized) exponential time hypothesis. In our proofs, we use random graphs as gadgets; this is inspired by similar constructions in seminal works on the hardness of approximate counting. In the hard-constraint setting, we present hardness results for identity testing for proper colorings. Our results are based on the presumed hardness of #BIS, the problem of (approximately) counting independent sets in bipartite graphs. In particular, we prove that identity testing for colorings is hard in the same range of parameters where structure learning is known to be hard, which in turn matches the parameter regime for NP-hardness of the corresponding decision problem. 
    more » « less
  4. null (Ed.)
    We propose a new setting for testing properties of distributions while receiving samples from several distributions, but few samples per distribution. Given samples from s distributions, p_1, p_2, …, p_s, we design testers for the following problems: (1) Uniformity Testing: Testing whether all the p_i’s are uniform or ε-far from being uniform in ℓ_1-distance (2) Identity Testing: Testing whether all the p_i’s are equal to an explicitly given distribution q or ε-far from q in ℓ_1-distance, and (3) Closeness Testing: Testing whether all the p_i’s are equal to a distribution q which we have sample access to, or ε-far from q in ℓ_1-distance. By assuming an additional natural condition about the source distributions, we provide sample optimal testers for all of these problems. 
    more » « less
  5. We propose a new setting for testing properties of distributions while receiving samples from several distributions, but few samples per distribution. Given samples from s distributions, p_1, p_2, …, p_s, we design testers for the following problems: (1) Uniformity Testing: Testing whether all the p_i’s are uniform or ε-far from being uniform in ℓ_1-distance (2) Identity Testing: Testing whether all the p_i’s are equal to an explicitly given distribution q or ε-far from q in ℓ_1-distance, and (3) Closeness Testing: Testing whether all the p_i’s are equal to a distribution q which we have sample access to, or ε-far from q in ℓ_1-distance. By assuming an additional natural condition about the source distributions, we provide sample optimal testers for all of these problems. 
    more » « less