skip to main content


Title: Testing Ising Models
Given samples from an unknown multivariate distribution p, is it possible to distinguish whether p is the product of its marginals versus p being far from every product distribution? Similarly, is it possible to distinguish whether p equals a given distribution q versus p and q being far from each other? These problems of testing independence and goodness-of-fit have received enormous attention in statistics, information theory, and theoretical computer science, with sample-optimal algorithms known in several interesting regimes of parameters [BFF+01, Pan08, VV17, ADK15, DK16]. Unfortunately, it has also been understood that these problems become intractable in large dimensions, necessitating exponential sample complexity. Motivated by the exponential lower bounds for general distributions as well as the ubiquity of Markov Random Fields (MRFs) in the modeling of high-dimensional distributions, we initiate the study of distribution testing on structured multivariate distributions, and in particular the prototypical example of MRFs: the Ising Model. We demonstrate that, in this structured setting, we can avoid the curse of dimensionality, obtaining sample and time efficient testers for independence and goodness-of-fit. One of the key technical challenges we face along the way is bounding the variance of functions of the Ising model.  more » « less
Award ID(s):
1741137 1650733
NSF-PAR ID:
10079726
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the annual ACM-SIAM Symposium on Discrete Algorithms
ISSN:
1557-9468
Page Range / eLocation ID:
1989-2007
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We study the problem of testing identity against a given distribution with a focus on the high confidence regime. More precisely, given samples from an unknown distribution p over n elements, an explicitly given distribution q, and parameters 0< epsilon, delta < 1, we wish to distinguish, with probability at least 1-delta, whether the distributions are identical versus epsilon-far in total variation distance. Most prior work focused on the case that delta = Omega(1), for which the sample complexity of identity testing is known to be Theta(sqrt{n}/epsilon^2). Given such an algorithm, one can achieve arbitrarily small values of delta via black-box amplification, which multiplies the required number of samples by Theta(log(1/delta)). We show that black-box amplification is suboptimal for any delta = o(1), and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is Theta((1/epsilon^2) (sqrt{n log(1/delta)} + log(1/delta))) for any n, epsilon, and delta. For the special case of uniformity testing, where the given distribution is the uniform distribution U_n over the domain, our new tester is surprisingly simple: to test whether p = U_n versus d_{TV} (p, U_n) >= epsilon, we simply threshold d_{TV}({p^}, U_n), where {p^} is the empirical probability distribution. The fact that this simple "plug-in" estimator is sample-optimal is surprising, even in the constant delta case. Indeed, it was believed that such a tester would not attain sublinear sample complexity even for constant values of epsilon and delta. An important contribution of this work lies in the analysis techniques that we introduce in this context. First, we exploit an underlying strong convexity property to bound from below the expectation gap in the completeness and soundness cases. Second, we give a new, fast method for obtaining provably correct empirical estimates of the true worst-case failure probability for a broad class of uniformity testing statistics over all possible input distributions - including all previously studied statistics for this problem. We believe that our novel analysis techniques will be useful for other distribution testing problems as well. 
    more » « less
  2. null (Ed.)
    We propose a new setting for testing properties of distributions while receiving samples from several distributions, but few samples per distribution. Given samples from s distributions, p_1, p_2, …, p_s, we design testers for the following problems: (1) Uniformity Testing: Testing whether all the p_i’s are uniform or ε-far from being uniform in ℓ_1-distance (2) Identity Testing: Testing whether all the p_i’s are equal to an explicitly given distribution q or ε-far from q in ℓ_1-distance, and (3) Closeness Testing: Testing whether all the p_i’s are equal to a distribution q which we have sample access to, or ε-far from q in ℓ_1-distance. By assuming an additional natural condition about the source distributions, we provide sample optimal testers for all of these problems. 
    more » « less
  3. We propose a new setting for testing properties of distributions while receiving samples from several distributions, but few samples per distribution. Given samples from s distributions, p_1, p_2, …, p_s, we design testers for the following problems: (1) Uniformity Testing: Testing whether all the p_i’s are uniform or ε-far from being uniform in ℓ_1-distance (2) Identity Testing: Testing whether all the p_i’s are equal to an explicitly given distribution q or ε-far from q in ℓ_1-distance, and (3) Closeness Testing: Testing whether all the p_i’s are equal to a distribution q which we have sample access to, or ε-far from q in ℓ_1-distance. By assuming an additional natural condition about the source distributions, we provide sample optimal testers for all of these problems. 
    more » « less
  4. Suppose that we are given sample access to an unknown distribution p over n elements and an explicit distribution q over the same n elements. We would like to reject the null hypothesis“p=q” after seeing as few samples as possible, whenp6=q, while we never want to reject the null, when p=q. Well-known results show thatΘ(√n/2)samples are necessary and sufficient for distinguishing whether p equals q versus p is-far from q in total variation distance. However,this requires the distinguishing radiusto be fixed prior to deciding how many samples to request.Our goal is instead to design sequential hypothesis testers, i.e. online algorithms that request i.i.d.samples from p and stop as soon as they can confidently reject the hypothesis p=q, without being given a lower bound on the distance between p and q, whenp6=q. In particular, we want to minimize the number of samples requested by our tests as a function of the distance between p and q, and if p=q we want the algorithm, with high probability, to never reject the null. Our work is motivated by and addresses the practical challenge of sequential A/B testing in Statistics.We show that, when n= 2, any sequential hypothesis test must seeΩ(1dtv(p,q)2log log1dtv(p,q))samples, with high (constant) probability, before it rejects p=q, where dtv(p,q) is the—unknown to the tester—total variation distance between p and q. We match the dependence of this lower bound ondtv(p,q)by proposing a sequential tester that rejects p=q from at most O(√ndtv(p,q)2log log1dtv(p,q))samples with high (constant) probability. TheΩ(√n)dependence on the support size n is also known to be necessary. We similarly provide two-sample sequential hypothesis testers, when sample access is given to both p and q, and discuss applications to sequential A/B testing. 
    more » « less
  5. Given samples from an unknown distribution p and a description of a distribution q, are p and q close or far? This question of “identity testing” has received significant attention in the case of testing whether p and q are equal or far in total variation distance. However, in recent work [VV11a, ADK15, DP17], the following questions have been been critical to solving problems at the frontiers of distribution testing: Alternative Distances: Can we test whether p and q are far in other distances, say Hellinger? Tolerance: Can we test when p and q are close, rather than equal? And if so, close in which distances? Motivated by these questions, we characterize the complexity of distribution testing under a variety of distances, including total variation, ℓ2, Hellinger, Kullback-Leibler, and χ2. For each pair of distances d1 and d2, we study the complexity of testing if p and q are close in d1 versus far in d2, with a focus on identifying which problems allow strongly sublinear testers (i.e., those with complexity O(n1–γ) for some γ > 0 where n is the size of the support of the distributions p and q). We provide matching upper and lower bounds for each case. We also study these questions in the case where we only have samples from q (equivalence testing), showing qualitative differences from identity testing in terms of when tolerance can be achieved. Our algorithms fall into the classical paradigm of χ2-statistics, but require crucial changes to handle the challenges introduced by each distance we consider. Finally, we survey other recent results in an attempt to serve as a reference for the complexity of various distribution testing problems. 
    more » « less