skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The Full Landscape of Robust Mean Testing: Sharp Separations between Oblivious and Adaptive Contamination
We consider the question of Gaussian mean testing, a fundamental task in high-dimensional distribution testing and signal processing, subject to adversarial corruptions of the samples. We focus on the relative power of different adversaries, and show that, in contrast to the common wisdom in robust statistics, there exists a strict separation between adaptive adversaries (strong contamination) and oblivious ones (weak contamination) for this task. Specifically, we resolve both the information-theoretic and computational landscapes for robust mean testing. In the exponential-time setting, we establish the tight sample complexity of testing N(0,I) against N(αv,I), where ∥v∥2=1, with an ε-fraction of adversarial corruptions, to be Θ~(max(d−−√α2,dε3α4,min(d2/3ε2/3α8/3,dεα2))), while the complexity against adaptive adversaries is Θ~(max(d−−√α2,dε2α4)), which is strictly worse for a large range of vanishing ε,α. To the best of our knowledge, ours is the first separation in sample complexity between the strong and weak contamination models. In the polynomial-time setting, we close a gap in the literature by providing a polynomial-time algorithm against adaptive adversaries achieving the above sample complexity Θ~(max(d−−√/α2,dε2/α4)), and a low-degree lower bound (which complements an existing reduction from planted clique) suggesting that all efficient algorithms require this many samples, even in the oblivious-adversary setting.  more » « less
Award ID(s):
2238080
PAR ID:
10487307
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
64th Annual IEEE Symposium on Foundations of Computer Science (FOCS)
Format(s):
Medium: X
Location:
Santa Cruz, California
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the question of Gaussian mean testing, a fundamental task in high-dimensional distribution testing and signal processing, subject to adversarial corruptions of the samples. We focus on the relative power of different adversaries, and show that, in contrast to the common wisdom in robust statistics, there exists a strict separation between adaptive adversaries (strong contamination) and oblivious ones (weak contamination) for this task. Specifically, we resolve both the information-theoretic and computational landscapes for robust mean testing. In the exponential-time setting, we establish the tight sample complexity of testing N(0,I) against N(αv,I), where ∥v∥2=1, with an ε-fraction of adversarial corruptions, to be Θ~(max(d√α2,dε3α4,min(d2/3ε2/3α8/3,dεα2))) while the complexity against adaptive adversaries is Θ~(max(d√α2,dε2α4)) which is strictly worse for a large range of vanishing ε,α. To the best of our knowledge, ours is the first separation in sample complexity between the strong and weak contamination models. In the polynomial-time setting, we close a gap in the literature by providing a polynomial-time algorithm against adaptive adversaries achieving the above sample complexity Θ~(max(d−−√/α2,dε2/α4)), and a low-degree lower bound (which complements an existing reduction from planted clique) suggesting that all efficient algorithms require this many samples, even in the oblivious-adversary setting. 
    more » « less
  2. We initiate the study of a fundamental question concerning adversarial noise models in statistical problems where the algorithm receives i.i.d. draws from a distribution D. The definitions of these adversaries specify the {\sl type} of allowable corruptions (noise model) as well as {\sl when} these corruptions can be made (adaptivity); the latter differentiates between oblivious adversaries that can only corrupt the distribution D and adaptive adversaries that can have their corruptions depend on the specific sample S that is drawn from D. We investigate whether oblivious adversaries are effectively equivalent to adaptive adversaries, across all noise models studied in the literature, under a unifying framework that we introduce. Specifically, can the behavior of an algorithm A in the presence of oblivious adversaries always be well-approximated by that of an algorithm A′ in the presence of adaptive adversaries? Our first result shows that this is indeed the case for the broad class of {\sl statistical query} algorithms, under all reasonable noise models. We then show that in the specific case of {\sl additive noise}, this equivalence holds for {\sl all} algorithms. Finally, we map out an approach towards proving this statement in its fullest generality, for all algorithms and under all reasonable noise models. 
    more » « less
  3. We provide improved differentially private algorithms for identity testing of high-dimensional distributions. Specifically, for d-dimensional Gaussian distributions with known covariance Σ, we can test whether the distribution comes from N(μ∗,Σ) for some fixed μ∗ or from some N(μ,Σ) with total variation distance at least α from N(μ∗,Σ) with (ε,0)-differential privacy, using only O~(d1/2α2+d1/3α4/3⋅ε2/3+1α⋅ε) samples if the algorithm is allowed to be computationally inefficient, and only O~(d1/2α2+d1/4α⋅ε) samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including mean testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over {−1,1}d, and tolerant testing. Our results improve over the previous best work of Canonne et al.~\cite{CanonneKMUZ20} for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{non-private} sample complexity of O(d√α2) in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of d-dimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size d \cite{AcharyaSZ18}, which refutes a conjectured lower bound of~\cite{CanonneKMUZ20}. 
    more » « less
  4. Many problems on data streams have been studied at two extremes of difficulty: either allowing randomized algorithms, in the static setting (where they should err with bounded probability on the worst case stream); or when only deterministic and infallible algorithms are required. Some recent works have considered the adversarial setting, in which a randomized streaming algorithm must succeed even on data streams provided by an adaptive adversary that can see the intermediate outputs of the algorithm. In order to better understand the differences between these models, we study a streaming task called “Missing Item Finding”. In this problem, for r < n, one is given a data stream a1 , . . . , ar of elements in [n], (possibly with repetitions), and must output some x ∈ [n] which does not equal any of the ai. We prove that, for r = nΘ(1) and δ = 1/poly(n), the space required for randomized algorithms that solve this problem in the static setting with error δ is Θ(polylog(n)); for algorithms in the adversarial setting with error δ, Θ((1 + r2/n)polylog(n)); and for deterministic algorithms, Θ(r/polylog(n)). Because our adversarially robust algorithm relies on free access to a string of O(r log n) random bits, we investigate a “random start” model of streaming algorithms where all random bits used are included in the space cost. Here we find a conditional lower bound on the space usage, which depends on the space that would be needed for a pseudo-deterministic algorithm to solve the problem. We also prove an Ω(r/polylog(n)) lower bound for the space needed by a streaming algorithm with < 1/2polylog(n) error against “white-box” adversaries that can see the internal state of the algorithm, but not predict its future random decisions. 
    more » « less
  5. We study the fundamental problem of learning the parameters of a high-dimensional Gaussian in the presence of noise — where an ε-fraction of our samples were chosen by an adversary. We give robust estimators that achieve estimation error O(ε) in the total variation distance, which is optimal up to a universal constant that is independent of the dimension. In the case where just the mean is unknown, our robustness guarantee is optimal up to a factor of and the running time is polynomial in d and 1/ε. When both the mean and covariance are unknown, the running time is polynomial in d and quasipolynomial in 1/ε. Moreover all of our algorithms require only a polynomial number of samples. Our work shows that the same sorts of error guarantees that were established over fifty years ago in the one-dimensional setting can also be achieved by efficient algorithms in high-dimensional settings. 
    more » « less