Uniformity testing is one of the most wellstudied problems in property testing, with many known test statistics, including ones based on counting collisions, singletons, and the empirical TV distance. It is known that the optimal sample complexity to distinguish the uniform distribution on m elements from any ϵfar distribution with 1−δ probability is n=Θ(mlog(1/δ)√ϵ2+log(1/δ)ϵ2), which is achieved by the empirical TV tester. Yet in simulation, these theoretical analyses are misleading: in many cases, they do not correctly rank order the performance of existing testers, even in an asymptotic regime of all parameters tending to 0 or ∞.
We explain this discrepancy by studying the \emph{constant factors} required by the algorithms. We show that the collisions tester achieves a sharp maximal constant in the number of standard deviations of separation between uniform and nonuniform inputs. We then introduce a new tester based on the Huber loss, and show that it not only matches this separation, but also has tails corresponding to a Gaussian with this separation. This leads to a sample complexity of (1+o(1))mlog(1/δ)√ϵ2 in the regime where this term is dominant, unlike all other existing testers.
more »
« less
SampleOptimal Identity Testing with High Probability
We study the problem of testing identity against a given distribution with a focus on the high confidence regime. More precisely, given samples from an unknown distribution p over n elements, an explicitly given distribution q, and parameters 0< epsilon, delta < 1, we wish to distinguish, with probability at least 1delta, whether the distributions are identical versus epsilonfar in total variation distance. Most prior work focused on the case that delta = Omega(1), for which the sample complexity of identity testing is known to be Theta(sqrt{n}/epsilon^2). Given such an algorithm, one can achieve arbitrarily small values of delta via blackbox amplification, which multiplies the required number of samples by Theta(log(1/delta)). We show that blackbox amplification is suboptimal for any delta = o(1), and give a new identity tester that achieves the optimal sample complexity. Our new upper and lower bounds show that the optimal sample complexity of identity testing is Theta((1/epsilon^2) (sqrt{n log(1/delta)} + log(1/delta))) for any n, epsilon, and delta. For the special case of uniformity testing, where the given distribution is the uniform distribution U_n over the domain, our new tester is surprisingly simple: to test whether p = U_n versus d_{TV} (p, U_n) >= epsilon, we simply threshold d_{TV}({p^}, U_n), where {p^} is the empirical probability distribution. The fact that this simple "plugin" estimator is sampleoptimal is surprising, even in the constant delta case. Indeed, it was believed that such a tester would not attain sublinear sample complexity even for constant values of epsilon and delta. An important contribution of this work lies in the analysis techniques that we introduce in this context. First, we exploit an underlying strong convexity property to bound from below the expectation gap in the completeness and soundness cases. Second, we give a new, fast method for obtaining provably correct empirical estimates of the true worstcase failure probability for a broad class of uniformity testing statistics over all possible input distributions  including all previously studied statistics for this problem. We believe that our novel analysis techniques will be useful for other distribution testing problems as well.
more »
« less
 NSFPAR ID:
 10075636
 Date Published:
 Journal Name:
 45th International Colloquium on Automata, Languages, and Automata
 ISSN:
 01701495
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


The noise sensitivity of a Boolean function f: {0,1}^n  > {0,1} is one of its fundamental properties. For noise parameter delta, the noise sensitivity is denoted as NS_{delta}[f]. This quantity is defined as follows: First, pick x = (x_1,...,x_n) uniformly at random from {0,1}^n, then pick z by flipping each x_i independently with probability delta. NS_{delta}[f] is defined to equal Pr [f(x) != f(z)]. Much of the existing literature on noise sensitivity explores the following two directions: (1) Showing that functions with low noisesensitivity are structured in certain ways. (2) Mathematically showing that certain classes of functions have low noise sensitivity. Combined, these two research directions show that certain classes of functions have low noise sensitivity and therefore have useful structure. The fundamental importance of noise sensitivity, together with this wealth of structural results, motivates the algorithmic question of approximating NS_{delta}[f] given an oracle access to the function f. We show that the standard sampling approach is essentially optimal for general Boolean functions. Therefore, we focus on estimating the noise sensitivity of monotone functions, which form an important subclass of Boolean functions, since many functions of interest are either monotone or can be simply transformed into a monotone function (for example the class of unate functions consists of all the functions that can be made monotone by reorienting some of their coordinates [O'Donnell, 2014]). Specifically, we study the algorithmic problem of approximating NS_{delta}[f] for monotone f, given the promise that NS_{delta}[f] >= 1/n^{C} for constant C, and for delta in the range 1/n <= delta <= 1/2. For such f and delta, we give a randomized algorithm performing O((min(1,sqrt{n} delta log^{1.5} n))/(NS_{delta}[f]) poly (1/epsilon)) queries and approximating NS_{delta}[f] to within a multiplicative factor of (1 +/ epsilon). Given the same constraints on f and delta, we also prove a lower bound of Omega((min(1,sqrt{n} delta))/(NS_{delta}[f] * n^{xi})) on the query complexity of any algorithm that approximates NS_{delta}[f] to within any constant factor, where xi can be any positive constant. Thus, our algorithm's query complexity is close to optimal in terms of its dependence on n. We introduce a novel descendingascending view of noise sensitivity, and use it as a central tool for the analysis of our algorithm. To prove lower bounds on query complexity, we develop a technique that reduces computational questions about query complexity to combinatorial questions about the existence of "thin" functions with certain properties. The existence of such "thin" functions is proved using the probabilistic method. These techniques also yield new lower bounds on the query complexity of approximating other fundamental properties of Boolean functions: the total influence and the bias.more » « less

We study local symmetry breaking problems in the Congest model, focusing on ruling set problems, which generalize the fundamental Maximal Independent Set (MIS) problem. The time (round) complexity of MIS (and ruling sets) have attracted much attention in the Local model. Indeed, recent results (Barenboim et al., FOCS 2012, Ghaffari SODA 2016) for the MIS problem have tried to break the longstanding O(log n)round "barrier" achieved by Luby's algorithm, but these yield o(log n)round complexity only when the maximum degree Delta is somewhat small relative to n. More importantly, these results apply only in the Local model. In fact, the best known time bound in the Congest model is still O(log n) (via Luby's algorithm) even for moderately small Delta (i.e., for Delta = Omega(log n) and Delta = o(n)). Furthermore, message complexity has been largely ignored in the context of local symmetry breaking. Luby's algorithm takes O(m) messages on medge graphs and this is the best known bound with respect to messages. Our work is motivated by the following central question: can we break the Theta(log n) time complexity barrier and the Theta(m) message complexity barrier in the Congest model for MIS or closelyrelated symmetry breaking problems? This paper presents progress towards this question for the distributed ruling set problem in the Congest model. A betaruling set is an independent set such that every node in the graph is at most beta hops from a node in the independent set. We present the following results:  Time Complexity: We show that we can break the O(log n) "barrier" for 2 and 3ruling sets. We compute 3ruling sets in O(log n/log log n) rounds with high probability (whp). More generally we show that 2ruling sets can be computed in O(log Delta (log n)^(1/2 + epsilon) + log n/log log n) rounds for any epsilon > 0, which is o(log n) for a wide range of Delta values (e.g., Delta = 2^(log n)^(1/2epsilon)). These are the first 2 and 3ruling set algorithms to improve over the O(log n)round complexity of Luby's algorithm in the Congest model.  Message Complexity: We show an Omega(n^2) lower bound on the message complexity of computing an MIS (i.e., 1ruling set) which holds also for randomized algorithms and present a contrast to this by showing a randomized algorithm for 2ruling sets that, whp, uses only O(n log^2 n) messages and runs in O(Delta log n) rounds. This is the first messageefficient algorithm known for ruling sets, which has message complexity nearly linear in n (which is optimal up to a polylogarithmic factor).more » « less

This paper focuses on showing timemessage tradeoffs in distributed algorithms for fundamental problems such as leader election, broadcast, spanning tree (ST), minimum spanning tree (MST), minimum cut, and many graph verification problems. We consider the synchronous CONGEST distributed computing model and assume that each node has initial knowledge of itself and the identifiers of its neighbors  the socalled KT_1 model  a wellstudied model that also naturally arises in many applications. Recently, it has been established that one can obtain (almost) singularly optimal algorithms, i.e., algorithms that have simultaneously optimal time and message complexity (up to polylogarithmic factors), for many fundamental problems in the standard KT_0 model (where nodes have only local knowledge of themselves and not their neighbors). The situation is less clear in the KT_1 model. In this paper, we present several new distributed algorithms in the KT_1 model that trade off between time and message complexity. Our distributed algorithms are based on a uniform and general approach which involves constructing a sparsified spanning subgraph of the original graph  called a danner  that trades off the number of edges with the diameter of the sparsifier. In particular, a key ingredient of our approach is a distributed randomized algorithm that, given a graph G and any delta in [0,1], with high probability constructs a danner that has diameter O~(D + n^{1delta}) and O~(min{m,n^{1+delta}}) edges in O~(n^{1delta}) rounds while using O~(min{m,n^{1+delta}}) messages, where n, m, and D are the number of nodes, edges, and the diameter of G, respectively. Using our danner construction, we present a family of distributed randomized algorithms for various fundamental problems that exhibit a tradeoff between message and time complexity and that improve over previous results. Specifically, we show the following results (all hold with high probability) in the KT_1 model, which subsume and improve over prior bounds in the KT_1 model (King et al., PODC 2014 and Awerbuch et al., JACM 1990) and the KT_0 model (Kutten et al., JACM 2015, Pandurangan et al., STOC 2017 and Elkin, PODC 2017): 1) Leader Election, Broadcast, and ST. These problems can be solved in O~(D+n^{1delta}) rounds using O~(min{m,n^{1+delta}}) messages for any delta in [0,1]. 2) MST and Connectivity. These problems can be solved in O~(D+n^{1delta}) rounds using O~(min{m,n^{1+delta}}) messages for any delta in [0,0.5]. In particular, for delta = 0.5 we obtain a distributed MST algorithm that runs in optimal O~(D+sqrt{n}) rounds and uses O~(min{m,n^{3/2}}) messages. We note that this improves over the singularly optimal algorithm in the KT_0 model that uses O~(D+sqrt{n}) rounds and O~(m) messages. 3) Minimum Cut. O(log n)approximate minimum cut can be solved in O~(D+n^{1delta}) rounds using O~(min{m,n^{1+delta}}) messages for any delta in [0,0.5]. 4) Graph Verification Problems such as Bipartiteness, Spanning Subgraph etc. These can be solved in O~(D+n^{1delta}) rounds using O~(min{m,n^{1+delta}}) messages for any delta in [0,0.5].more » « less

There has been significant study on the sample complexity of testing properties of distributions over large domains. For many properties, it is known that the sample complexity can be substantially smaller than the domain size. For example, over a domain of size n, distinguishing the uniform distribution from distributions that are far from uniform in ℓ1distance uses only O(n−−√) samples. However, the picture is very different in the presence of arbitrary noise, even when the amount of noise is quite small. In this case, one must distinguish if samples are coming from a distribution that is ϵclose to uniform from the case where the distribution is (1−ϵ)far from uniform. The latter task requires nearly linear in n samples (Valiant, 2008; Valiant and Valiant, 2017a). In this work, we present a noise model that on one hand is more tractable for the testing problem, and on the other hand represents a rich class of noise families. In our model, the noisy distribution is a mixture of the original distribution and noise, where the latter is known to the tester either explicitly or via sample access; the form of the noise is also known \emph{a priori}. Focusing on the identity and closeness testing problems leads to the following mixture testing question: Given samples of distributions p,q1,q2, can we test if p is a mixture of q1 and q2? We consider this general question in various scenarios that differ in terms of how the tester can access the distributions, and show that indeed this problem is more tractable. Our results show that the sample complexity of our testers are exactly the same as for the classical nonmixture case.more » « less