We consider the question of Gaussian mean testing, a fundamental task in highdimensional distribution testing and signal processing, subject to adversarial corruptions of the samples. We focus on the relative power of different adversaries, and show that, in contrast to the common wisdom in robust statistics, there exists a strict separation between adaptive adversaries (strong contamination) and oblivious ones (weak contamination) for this task. Specifically, we resolve both the informationtheoretic and computational landscapes for robust mean testing. In the exponentialtime setting, we establish the tight sample complexity of testing N(0,I) against N(αv,I), where ∥v∥2=1, with an εfraction of adversarial corruptions, to be Θ~(max(d√α2,dε3α4,min(d2/3ε2/3α8/3,dεα2))) while the complexity against adaptive adversaries is Θ~(max(d√α2,dε2α4)) which is strictly worse for a large range of vanishing ε,α. To the best of our knowledge, ours is the first separation in sample complexity between the strong and weak contamination models. In the polynomialtime setting, we close a gap in the literature by providing a polynomialtime algorithm against adaptive adversaries achieving the above sample complexity Θ~(max(d−−√/α2,dε2/α4)), and a lowdegree lower bound (which complements an existing reduction from planted clique) suggesting that all efficient algorithms require this many samples, even in the obliviousadversary setting.
more »
« less
This content will become publicly available on November 9, 2024
The Full Landscape of Robust Mean Testing: Sharp Separations between Oblivious and Adaptive Contamination
We consider the question of Gaussian mean testing, a fundamental task in highdimensional distribution testing and signal processing, subject to adversarial corruptions of the samples. We focus on the relative power of different adversaries, and show that, in contrast to the common wisdom in robust statistics, there exists a strict separation between adaptive adversaries (strong contamination) and oblivious ones (weak contamination) for this task. Specifically, we resolve both the informationtheoretic and computational landscapes for robust mean testing. In the exponentialtime setting, we establish the tight sample complexity of testing N(0,I) against N(αv,I), where ∥v∥2=1, with an εfraction of adversarial corruptions, to be
Θ~(max(d−−√α2,dε3α4,min(d2/3ε2/3α8/3,dεα2))),
while the complexity against adaptive adversaries is
Θ~(max(d−−√α2,dε2α4)),
which is strictly worse for a large range of vanishing ε,α. To the best of our knowledge, ours is the first separation in sample complexity between the strong and weak contamination models.
In the polynomialtime setting, we close a gap in the literature by providing a polynomialtime algorithm against adaptive adversaries achieving the above sample complexity Θ~(max(d−−√/α2,dε2/α4)), and a lowdegree lower bound (which complements an existing reduction from planted clique) suggesting that all efficient algorithms require this many samples, even in the obliviousadversary setting.
more »
« less
 Award ID(s):
 2238080
 NSFPAR ID:
 10487307
 Publisher / Repository:
 IEEE
 Date Published:
 Journal Name:
 64th Annual IEEE Symposium on Foundations of Computer Science (FOCS)
 Format(s):
 Medium: X
 Location:
 Santa Cruz, California
 Sponsoring Org:
 National Science Foundation
More Like this


We initiate the study of a fundamental question concerning adversarial noise models in statistical problems where the algorithm receives i.i.d. draws from a distribution D. The definitions of these adversaries specify the {\sl type} of allowable corruptions (noise model) as well as {\sl when} these corruptions can be made (adaptivity); the latter differentiates between oblivious adversaries that can only corrupt the distribution D and adaptive adversaries that can have their corruptions depend on the specific sample S that is drawn from D. We investigate whether oblivious adversaries are effectively equivalent to adaptive adversaries, across all noise models studied in the literature, under a unifying framework that we introduce. Specifically, can the behavior of an algorithm A in the presence of oblivious adversaries always be wellapproximated by that of an algorithm A′ in the presence of adaptive adversaries? Our first result shows that this is indeed the case for the broad class of {\sl statistical query} algorithms, under all reasonable noise models. We then show that in the specific case of {\sl additive noise}, this equivalence holds for {\sl all} algorithms. Finally, we map out an approach towards proving this statement in its fullest generality, for all algorithms and under all reasonable noise models.more » « less

We provide improved differentially private algorithms for identity testing of highdimensional distributions. Specifically, for ddimensional Gaussian distributions with known covariance Σ, we can test whether the distribution comes from N(μ∗,Σ) for some fixed μ∗ or from some N(μ,Σ) with total variation distance at least α from N(μ∗,Σ) with (ε,0)differential privacy, using only O~(d1/2α2+d1/3α4/3⋅ε2/3+1α⋅ε) samples if the algorithm is allowed to be computationally inefficient, and only O~(d1/2α2+d1/4α⋅ε) samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including mean testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over {−1,1}d, and tolerant testing. Our results improve over the previous best work of Canonne et al.~\cite{CanonneKMUZ20} for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{nonprivate} sample complexity of O(d√α2) in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of ddimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size d \cite{AcharyaSZ18}, which refutes a conjectured lower bound of~\cite{CanonneKMUZ20}.more » « less

Many problems on data streams have been studied at two extremes of difficulty: either allowing randomized algorithms, in the static setting (where they should err with bounded probability on the worst case stream); or when only deterministic and infallible algorithms are required. Some recent works have considered the adversarial setting, in which a randomized streaming algorithm must succeed even on data streams provided by an adaptive adversary that can see the intermediate outputs of the algorithm. In order to better understand the differences between these models, we study a streaming task called “Missing Item Finding”. In this problem, for r < n, one is given a data stream a1 , . . . , ar of elements in [n], (possibly with repetitions), and must output some x ∈ [n] which does not equal any of the ai. We prove that, for r = nΘ(1) and δ = 1/poly(n), the space required for randomized algorithms that solve this problem in the static setting with error δ is Θ(polylog(n)); for algorithms in the adversarial setting with error δ, Θ((1 + r2/n)polylog(n)); and for deterministic algorithms, Θ(r/polylog(n)). Because our adversarially robust algorithm relies on free access to a string of O(r log n) random bits, we investigate a “random start” model of streaming algorithms where all random bits used are included in the space cost. Here we find a conditional lower bound on the space usage, which depends on the space that would be needed for a pseudodeterministic algorithm to solve the problem. We also prove an Ω(r/polylog(n)) lower bound for the space needed by a streaming algorithm with < 1/2polylog(n) error against “whitebox” adversaries that can see the internal state of the algorithm, but not predict its future random decisions.more » « less

Finding an approximate secondorder stationary point (SOSP) is a wellstudied and fundamental problem in stochastic nonconvex optimization with many applications in machine learning. However, this problem is poorly understood in the presence of outliers, limiting the use of existing nonconvex algorithms in adversarial settings. In this paper, we study the problem of finding SOSPs in the strong contamination model, where a constant fraction of datapoints are arbitrarily corrupted. We introduce a general framework for efficiently finding an approximate SOSP with dimensionindependent accuracy guarantees, using $\widetilde{O}({D^2}/{\epsilon})$ samples where $D$ is the ambient dimension and $\epsilon$ is the fraction of corrupted datapoints. As a concrete application of our framework, we apply it to the problem of low rank matrix sensing, developing efficient and provably robust algorithms that can tolerate corruptions in both the sensing matrices and the measurements. In addition, we establish a Statistical Query lower bound providing evidence that the quadratic dependence on $D$ in the sample complexity is necessary for computationally efficient algorithms.more » « less