This paper studies Mestimators with gradientLipschitz loss function regularized with convex penalty in linear models with Gaussian design matrix and arbitrary noise distribution. A practical example is the robust Mestimator constructed with the Huber loss and the ElasticNet penalty and the noise distribution has heavytails. Our main contributions are threefold. (i) We provide general formulae for the derivatives of regularized Mestimators $\hat\beta(y,X)$ where differentiation is taken with respect to both X and y; this reveals a simple differentiability structure shared by all convex regularized Mestimators. (ii) Using these derivatives, we characterize the distribution of the residuals in the intermediate highdimensional regime where dimension and sample size are of the same order. (iii) Motivated by the distribution of the residuals, we propose a novel adaptive criterion to select tuning parameters of regularized Mestimators. The criterion approximates the outofsample error up to an additive constant independent of the estimator, so that minimizing the criterion provides a proxy for minimizing the outofsample error. The proposed adaptive criterion does not require the knowledge of the noise distribution or of the covariance of the design. Simulated data confirms the theoretical findings, regarding both the distribution of the residuals and the success of the criterion asmore »
Robust Estimation of Covariance Matrices: Adversarial Contamination and Beyond
Abstract: We consider the problem of estimating the covariance structure of a random vector $Y\in \mathbb R^d$ from a sample $Y_1,\ldots,Y_n$. We are interested in the situation when d is large compared to n but the covariance matrix $\Sigma$ of interest has (exactly or approximately) low rank. We assume that the given sample is (a) $\epsilon$adversarially corrupted, meaning that $\epsilon$ fraction of the observations could have been replaced by arbitrary vectors, or that (b) the sample is i.i.d. but the underlying distribution is heavytailed, meaning that the norm of Y possesses only 4 finite moments. We propose an estimator that is adaptive to the potential lowrank structure of the covariance matrix as well as to the proportion of contaminated data, and admits tight deviation guarantees despite rather weak assumptions on the underlying distribution. Finally, we discuss the algorithms that allow to approximate the proposed estimator in a numerically efficient way.
 Award ID(s):
 1908905
 Publication Date:
 NSFPAR ID:
 10293464
 Journal Name:
 Technical report
 ISSN:
 01091344
 Sponsoring Org:
 National Science Foundation
More Like this


Let Y be a ddimensional random vector with unknown mean μ and covariance matrix Σ. This paper is motivated by the problem of designing an estimator of Σ that admits exponential deviation bounds in the operator norm under minimal assumptions on the underlying distribution, such as existence of only 4th moments of the coordinates of Y. To address this problem, we propose robust modifications of the operatorvalued Ustatistics, obtain nonasymptotic guarantees for their performance, and demonstrate the implications of these results to the covariance estimation problem under various structural assumptions.

We consider the problem of estimating how well a model class is capable of fitting a distribution of labeled data. We show that it is possible to accurately estimate this “learnability” even when given an amount of data that is too small to reliably learn any accurate model. Our first result applies to the setting where the data is drawn from a ddimensional distribution with isotropic covariance, and the label of each datapoint is an arbitrary noisy function of the datapoint. In this setting, we show that with O(sqrt(d)) samples, one can accurately estimate the fraction of the variance of the label that can be explained via the best linear function of the data. For comparison, even if the labels are noiseless linear functions of the data, a sample size linear in the dimension, d, is required to learn any function correlated with the underlying model. Our estimation approach also applies to the setting where the data distribution has an (unknown) arbitrary covariance matrix, allowing these techniques to be applied to settings where the model class consists of a linear function applied to a nonlinear embedding of the data. In this setting we give a consistent estimator of the fractionmore »

The matrix completion problem seeks to recover a $d\times d$ ground truth matrix of low rank $r\ll d$ from observations of its individual elements. Realworld matrix completion is often a hugescale optimization problem, with $d$ so large that even the simplest fulldimension vector operations with $O(d)$ time complexity become prohibitively expensive. Stochastic gradient descent (SGD) is one of the few algorithms capable of solving matrix completion on a huge scale, and can also naturally handle streaming data over an evolving ground truth. Unfortunately, SGD experiences a dramatic slowdown when the underlying ground truth is illconditioned; it requires at least $O(\kappa\log(1/\epsilon))$ iterations to get $\epsilon$close to ground truth matrix with condition number $\kappa$. In this paper, we propose a preconditioned version of SGD that preserves all the favorable practical qualities of SGD for hugescale online optimization while also making it agnostic to $\kappa$. For a symmetric ground truth and the Root Mean Square Error (RMSE) loss, we prove that the preconditioned SGD converges to $\epsilon$accuracy in $O(\log(1/\epsilon))$ iterations, with a rapid linear convergence rate as if the ground truth were perfectly conditioned with $\kappa=1$. In our numerical experiments, we observe a similar acceleration for illconditioned matrix completion under the 1bit crossentropymore »

We consider the problem of accurately recovering a matrix B of size M by M, which represents a probability distribution over M^2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This basic problem lies at the core of a number of questions that are currently being considered by different communities, including building recommendation systems and collaborative filtering in the sparse data regime, community detection in sparse random graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute "word embeddings". Many aspects of this problemboth in terms of learning and property testing/estimation and on both the algorithmic and information theoretic sidesremain open. Our results apply to the setting where B has a low rank structure. For this setting, we propose an efficient (and practically viable) algorithm that accurately recovers the underlying M by M matrix using O(M) samples} (where we assume the rank is a constant).more »