This paper studies Mestimators with gradientLipschitz loss function regularized with convex penalty in linear models with Gaussian design matrix and arbitrary noise distribution. A practical example is the robust Mestimator constructed with the Huber loss and the ElasticNet penalty and the noise distribution has heavytails. Our main contributions are threefold. (i) We provide general formulae for the derivatives of regularized Mestimators $\hat\beta(y,X)$ where differentiation is taken with respect to both X and y; this reveals a simple differentiability structure shared by all convex regularized Mestimators. (ii) Using these derivatives, we characterize the distribution of the residuals in the intermediate highdimensional regime where dimension and sample size are of the same order. (iii) Motivated by the distribution of the residuals, we propose a novel adaptive criterion to select tuning parameters of regularized Mestimators. The criterion approximates the outofsample error up to an additive constant independent of the estimator, so that minimizing the criterion provides a proxy for minimizing the outofsample error. The proposed adaptive criterion does not require the knowledge of the noise distribution or of the covariance of the design. Simulated data confirms the theoretical findings, regarding both the distribution of the residuals and the success of the criterion asmore »
Robust Estimation of Covariance Matrices: Adversarial Contamination and Beyond
Abstract: We consider the problem of estimating the covariance structure of a random vector $Y\in \mathbb R^d$ from a sample $Y_1,\ldots,Y_n$. We are interested in the situation when d is large compared to n but the covariance matrix $\Sigma$ of interest has (exactly or approximately) low rank. We assume that the given sample is (a) $\epsilon$adversarially corrupted, meaning that $\epsilon$ fraction of the observations could have been replaced by arbitrary vectors, or that (b) the sample is i.i.d. but the underlying distribution is heavytailed, meaning that the norm of Y possesses only 4 finite moments. We propose an estimator that is adaptive to the potential lowrank structure of the covariance matrix as well as to the proportion of contaminated data, and admits tight deviation guarantees despite rather weak assumptions on the underlying distribution. Finally, we discuss the algorithms that allow to approximate the proposed estimator in a numerically efficient way.
 Award ID(s):
 1908905
 Publication Date:
 NSFPAR ID:
 10293464
 Journal Name:
 Technical report
 ISSN:
 01091344
 Sponsoring Org:
 National Science Foundation
More Like this


Let Y be a ddimensional random vector with unknown mean μ and covariance matrix Σ. This paper is motivated by the problem of designing an estimator of Σ that admits exponential deviation bounds in the operator norm under minimal assumptions on the underlying distribution, such as existence of only 4th moments of the coordinates of Y. To address this problem, we propose robust modifications of the operatorvalued Ustatistics, obtain nonasymptotic guarantees for their performance, and demonstrate the implications of these results to the covariance estimation problem under various structural assumptions.

We consider the problem of estimating how well a model class is capable of fitting a distribution of labeled data. We show that it is possible to accurately estimate this “learnability” even when given an amount of data that is too small to reliably learn any accurate model. Our first result applies to the setting where the data is drawn from a ddimensional distribution with isotropic covariance, and the label of each datapoint is an arbitrary noisy function of the datapoint. In this setting, we show that with O(sqrt(d)) samples, one can accurately estimate the fraction of the variance of the label that can be explained via the best linear function of the data. For comparison, even if the labels are noiseless linear functions of the data, a sample size linear in the dimension, d, is required to learn any function correlated with the underlying model. Our estimation approach also applies to the setting where the data distribution has an (unknown) arbitrary covariance matrix, allowing these techniques to be applied to settings where the model class consists of a linear function applied to a nonlinear embedding of the data. In this setting we give a consistent estimator of the fractionmore »

We consider the problem of accurately recovering a matrix B of size M by M, which represents a probability distribution over M^2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This basic problem lies at the core of a number of questions that are currently being considered by different communities, including building recommendation systems and collaborative filtering in the sparse data regime, community detection in sparse random graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute "word embeddings". Many aspects of this problemboth in terms of learning and property testing/estimation and on both the algorithmic and information theoretic sidesremain open. Our results apply to the setting where B has a low rank structure. For this setting, we propose an efficient (and practically viable) algorithm that accurately recovers the underlying M by M matrix using O(M) samples} (where we assume the rank is a constant).more »

Abstract We explore why many recently proposed robust estimation problems are efficiently solvable, even though the underlying optimization problems are nonconvex. We study the loss landscape of these robust estimation problems, and identify the existence of ’generalized quasigradients’. Whenever these quasigradients exist, a large family of noregret algorithms are guaranteed to approximate the global minimum; this includes the commonly used filtering algorithm. For robust mean estimation of distributions under bounded covariance, we show that any firstorder stationary point of the associated optimization problem is an approximate global minimum if and only if the corruption level $\epsilon < 1/3$. Consequently, any optimization algorithm that approaches a stationary point yields an efficient robust estimator with breakdown point $1/3$. With carefully designed initialization and step size, we improve this to $1/2$, which is optimal. For other tasks, including linear regression and joint mean and covariance estimation, the loss landscape is more rugged: there are stationary points arbitrarily far from the global minimum. Nevertheless, we show that generalized quasigradients exist and construct efficient algorithms. These algorithms are simpler than previous ones in the literature, and for linear regression we improve the estimation error from $O(\sqrt{\epsilon })$ to the optimal rate of $O(\epsilon )$ formore »