We present a fast, differentially private algorithm for highdimensional covarianceaware mean estimation with nearly optimal sample complexity. Only exponentialtime estimators were previously known to achieve this guarantee. Given n samples from a (sub)Gaussian distribution with unknown mean μ and covariance Σ, our (ϵ,δ)differentially private estimator produces μ~ such that ∥μ−μ~∥Σ≤α as long as n≳dα2+dlog1/δ√αϵ+dlog1/δϵ. The Mahalanobis error metric ∥μ−μ^∥Σ measures the distance between μ^ and μ relative to Σ; it characterizes the error of the sample mean. Our algorithm runs in time O~(ndω−1+nd/\eps), where ω<2.38 is the matrix multiplication exponent.We adapt an exponentialtime approach of Brown, Gaboardi, Smith, Ullman, and Zakynthinou (2021), giving efficient variants of stable mean and covariance estimation subroutines that also improve the sample complexity to the nearly optimal bound above.Our stable covariance estimator can be turned to private covariance estimation for unrestricted subgaussian distributions. With n≳d3/2 samples, our estimate is accurate in spectral norm. This is the first such algorithm using n=o(d2) samples, answering an open question posed by Alabi et al. (2022). With n≳d2 samples, our estimate is accurate in Frobenius norm. This leads to a fast, nearly optimal algorithm for private learning of unrestricted Gaussian distributions in TV distance.Duchi, Haque, and Kuditipudi (2023) obtained similar results independently and concurrently.
more »
« less
This content will become publicly available on July 15, 2024
Fast, SampleEfficient, AffineInvariant Private Mean and Covariance Estimation for Subgaussian Distributions
We present a fast, differentially private algorithm for highdimensional covarianceaware mean estimation with nearly optimal sample complexity. Only exponentialtime estimators were previously known to achieve this guarantee. Given n samples from a (sub)Gaussian distribution with unknown mean μ and covariance Σ, our (ε,δ)differentially private estimator produces μ~ such that ∥μ−μ~∥Σ≤α as long as n≳dα2+dlog1/δ√αε+dlog1/δε. The Mahalanobis error metric ∥μ−μ^∥Σ measures the distance between μ^ and μ relative to Σ; it characterizes the error of the sample mean. Our algorithm runs in time O~(ndω−1+nd/ε), where ω<2.38 is the matrix multiplication exponent.
We adapt an exponentialtime approach of Brown, Gaboardi, Smith, Ullman, and Zakynthinou (2021), giving efficient variants of stable mean and covariance estimation subroutines that also improve the sample complexity to the nearly optimal bound above.
Our stable covariance estimator can be turned to private covariance estimation for unrestricted subgaussian distributions. With n≳d3/2 samples, our estimate is accurate in spectral norm. This is the first such algorithm using n=o(d2) samples, answering an open question posed by Alabi et al. (2022). With n≳d2 samples, our estimate is accurate in Frobenius norm. This leads to a fast, nearly optimal algorithm for private learning of unrestricted Gaussian distributions in TV distance.
Duchi, Haque, and Kuditipudi (2023) obtained similar results independently and concurrently.
more »
« less
 Award ID(s):
 2238080
 NSFPAR ID:
 10487320
 Publisher / Repository:
 Proceedings of Machine Learning Research
 Date Published:
 Journal Name:
 36th Annual Conference on Learning Theory (COLT 2023)
 Format(s):
 Medium: X
 Location:
 Bangalore, India
 Sponsoring Org:
 National Science Foundation
More Like this


We provide improved differentially private algorithms for identity testing of highdimensional distributions. Specifically, for ddimensional Gaussian distributions with known covariance Σ, we can test whether the distribution comes from N(μ∗,Σ) for some fixed μ∗ or from some N(μ,Σ) with total variation distance at least α from N(μ∗,Σ) with (ε,0)differential privacy, using only O~(d1/2α2+d1/3α4/3⋅ε2/3+1α⋅ε) samples if the algorithm is allowed to be computationally inefficient, and only O~(d1/2α2+d1/4α⋅ε) samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including mean testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over {−1,1}d, and tolerant testing. Our results improve over the previous best work of Canonne et al.~\cite{CanonneKMUZ20} for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{nonprivate} sample complexity of O(d√α2) in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of ddimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size d \cite{AcharyaSZ18}, which refutes a conjectured lower bound of~\cite{CanonneKMUZ20}.more » « less

We study the fundamental problem of highdimensional mean estimation in a robust model where a constant fraction of the samples are adversarially corrupted. Recent work gave the first polynomial time algorithms for this problem with dimensionindependent error guarantees for several families of structured distributions. In this work, we give the first nearlylinear time algorithms for highdimensional robust mean estimation. Specifically, we focus on distributions with (i) known covariance and subgaussian tails, and (ii) unknown bounded covariance. Given N samples on R^d, an \epsfraction of which may be arbitrarily corrupted, our algorithms run in time eO(Nd)/poly(\eps) and approximate the true mean within the informationtheoretically optimal error, up to constant factors. Previous robust algorithms with comparable error guarantees have running times \Omega(Nd^2), for \eps= O(1) Our algorithms rely on a natural family of SDPs parameterized by our current guess ν for the unknown mean μ. We give a winwin analysis establishing the following: either a nearoptimal solution to the primal SDP yields a good candidate for μ — independent of our current guess ν — or a nearoptimal solution to the dual SDP yields a new guess ν0 whose distance from μ is smaller by a constant factor. We exploit the special structure of the corresponding SDPs to show that they are approximately solvable in nearlylinear time. Our approach is quite general, and we believe it can also be applied to obtain nearlylinear time algorithms for other highdimensional robust learning problems.more » « less

We consider the question of Gaussian mean testing, a fundamental task in highdimensional distribution testing and signal processing, subject to adversarial corruptions of the samples. We focus on the relative power of different adversaries, and show that, in contrast to the common wisdom in robust statistics, there exists a strict separation between adaptive adversaries (strong contamination) and oblivious ones (weak contamination) for this task. Specifically, we resolve both the informationtheoretic and computational landscapes for robust mean testing. In the exponentialtime setting, we establish the tight sample complexity of testing N(0,I) against N(αv,I), where ∥v∥2=1, with an εfraction of adversarial corruptions, to be Θ~(max(d√α2,dε3α4,min(d2/3ε2/3α8/3,dεα2))) while the complexity against adaptive adversaries is Θ~(max(d√α2,dε2α4)) which is strictly worse for a large range of vanishing ε,α. To the best of our knowledge, ours is the first separation in sample complexity between the strong and weak contamination models. In the polynomialtime setting, we close a gap in the literature by providing a polynomialtime algorithm against adaptive adversaries achieving the above sample complexity Θ~(max(d−−√/α2,dε2/α4)), and a lowdegree lower bound (which complements an existing reduction from planted clique) suggesting that all efficient algorithms require this many samples, even in the obliviousadversary setting.more » « less

We consider the question of Gaussian mean testing, a fundamental task in highdimensional distribution testing and signal processing, subject to adversarial corruptions of the samples. We focus on the relative power of different adversaries, and show that, in contrast to the common wisdom in robust statistics, there exists a strict separation between adaptive adversaries (strong contamination) and oblivious ones (weak contamination) for this task. Specifically, we resolve both the informationtheoretic and computational landscapes for robust mean testing. In the exponentialtime setting, we establish the tight sample complexity of testing N(0,I) against N(αv,I), where ∥v∥2=1, with an εfraction of adversarial corruptions, to be Θ~(max(d−−√α2,dε3α4,min(d2/3ε2/3α8/3,dεα2))), while the complexity against adaptive adversaries is Θ~(max(d−−√α2,dε2α4)), which is strictly worse for a large range of vanishing ε,α. To the best of our knowledge, ours is the first separation in sample complexity between the strong and weak contamination models. In the polynomialtime setting, we close a gap in the literature by providing a polynomialtime algorithm against adaptive adversaries achieving the above sample complexity Θ~(max(d−−√/α2,dε2/α4)), and a lowdegree lower bound (which complements an existing reduction from planted clique) suggesting that all efficient algorithms require this many samples, even in the obliviousadversary setting.more » « less