We consider the problem of accurately recovering a matrix B of size M by M, which represents a probability distribution over M^2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This basic problem lies at the core of a number of questions that are currently being considered by different communities, including building recommendation systems and collaborative filtering in the sparse data regime, community detection in sparse random graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute "word embeddings". Many aspects of this problemboth in terms of learning and property testing/estimation and on both the algorithmic and information theoretic sidesremain open. Our results apply to the setting where B has a low rank structure. For this setting, we propose an efficient (and practically viable) algorithm that accurately recovers the underlying M by M matrix using O(M) samples} (where we assume the rank is a constant). This linear sample complexity is optimal, up to constant factors, in an extremely strong sense: even testing basic properties of the underlying matrix (such as whether it has rank 1 or 2) requires Omega(M) samples. Additionally, we provide an even stronger lower bound showing that distinguishing whether a sequence of observations were drawn from the uniform distribution over M observations versus being generated by a wellconditioned Hidden Markov Model with two hidden states requires Omega(M) observations, while our positive results for recovering B immediately imply that Omega(M) observations suffice to learn such an HMM. This lower bound precludes sublinearsample hypothesis tests for basic properties, such as identity or uniformity, as well as sublinear sample estimators for quantities such as the entropy rate of HMMs.
more »
« less
On Learning Ising Models under Huber's Contamination Model
We study the problem of learning Ising models in a setting where some of the
samples from the underlying distribution can be arbitrarily corrupted. In such
a setup, we aim to design statistically optimal estimators in a highdimensional
scaling in which the number of nodes p, the number of edges k and the maximal
node degree d are allowed to increase to infinity as a function of the sample size n.
Our analysis is based on exploiting moments of the underlying distribution, coupled
with novel reductions to univariate estimation. Our proposed estimators achieve an
optimal dimension independent dependence on the fraction of corrupted data in the
contaminated setting, while also simultaneously achieving highprobability error
guarantees with optimal samplecomplexity. We corroborate our theoretical results
by simulations.
more »
« less
 Award ID(s):
 1955532
 NSFPAR ID:
 10222690
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 Volume:
 33
 ISSN:
 10495258
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We study highdimensional sparse estimation tasks in a robust setting where a constant fraction of the dataset is adversarially corrupted. Specifically, we focus on the fundamental problems of robust sparse mean estimation and robust sparse PCA. We give the first practically viable robust estimators for these problems. In more detail, our algorithms are sample and computationally efficient and achieve nearoptimal robustness guarantees. In contrast to prior provable algorithms which relied on the ellipsoid method, our algorithms use spectral techniques to iteratively remove outliers from the dataset. Our experimental evaluation on synthetic data shows that our algorithms are scalable and significantly outperform a range of previous approaches, nearly matching the best error rate without corruptions.more » « less

null (Ed.)This paper is devoted to the estimators of the mean that provide strong nonasymptotic guarantees under minimal assumptions on the underlying distribution. The main ideas behind proposed techniques are based on bridging the notions of symmetry and robustness. We show that existing methods, such as medianofmeans and Catoni’s estimators, can often be viewed as special cases of our construction. The main contribution of the paper is the proof of uniform bounds for the deviations of the stochastic process defined by proposed estimators. Moreover, we extend our results to the case of adversarial contamination where a constant fraction of the observations is arbitrarily corrupted. Finally, we apply our methods to the problem of robust multivariate mean estimation and show that obtained inequalities achieve optimal dependence on the proportion of corrupted samples.more » « less

We propose and analyze a new estimator of the covariance matrix that admits strong theoretical guarantees under weak assumptions on the underlying distribution, such as existence of moments of only low order. While estimation of covariance matrices corresponding to subGaussian distributions is wellunderstood, much less in known in the case of heavytailed data. As K. Balasubramanian and M. Yuan write, "data from realworld experiments oftentimes tend to be corrupted with outliers and/or exhibit heavy tails. In such cases, it is not clear that those covariance matrix estimators .. remain optimal" and "what are the other possible strategies to deal with heavy tailed distributions warrant further studies." We make a step towards answering this question and prove tight deviation inequalities for the proposed estimator that depend only on the parameters controlling the intrinsic dimension'' associated to the covariance matrix (as opposed to the dimension of the ambient space); in particular, our results are applicable in the case of highdimensional observations.more » « less

We study the relationship between adversarial robustness and differential privacy in highdimensional algorithmic statistics. We give the first blackbox reduction from privacy to robustness which can produce private estimators with optimal tradeoffs among sample complexity, accuracy, and privacy for a wide range of fundamental highdimensional parameter estimation problems, including mean and covariance estimation. We show that this reduction can be implemented in polynomial time in some important special cases. In particular, using nearlyoptimal polynomialtime robust estimators for the mean and covariance of highdimensional Gaussians which are based on the SumofSquares method, we design the first polynomialtime private estimators for these problems with nearlyoptimal samplesaccuracyprivacy tradeoffs. Our algorithms are also robust to a nearly optimal fraction of adversariallycorrupted samples.more » « less