We propose and analyze a new estimator of the covariance matrix that admits strong theoretical guarantees under weak assumptions on the underlying distribution, such as existence of moments of only low order. While estimation of covariance matrices corresponding to subGaussian distributions is wellunderstood, much less in known in the case of heavytailed data. As K. Balasubramanian and M. Yuan write, "data from realworld experiments oftentimes tend to be corrupted with outliers and/or exhibit heavy tails. In such cases, it is not clear that those covariance matrix estimators .. remain optimal" and "what are the other possible strategies to deal with heavy tailed distributions warrant further studies." We make a step towards answering this question and prove tight deviation inequalities for the proposed estimator that depend only on the parameters controlling the intrinsic dimension'' associated to the covariance matrix (as opposed to the dimension of the ambient space); in particular, our results are applicable in the case of highdimensional observations.
Robust compressed sensing using generative models
The goal of compressed sensing is to estimate a high dimensional vector from an underdetermined system of noisy linear equations. In analogy to classical compressed sensing, here we assume a generative model as a prior, that is, we assume the vector is represented by a deep generative model G:Rk→Rn.
Classical recovery approaches such as empirical risk minimization (ERM) are guaranteed to succeed when the measurement matrix is subGaussian. However, when the measurement matrix and measurements are heavytailed or have outliers, recovery may fail dramatically. In this paper we propose an algorithm inspired by the MedianofMeans (MOM). Our algorithm guarantees recovery for heavytailed data, even in the presence of outliers. Theoretically, our results show our novel MOMbased algorithm enjoys the same sample complexity guarantees as ERM under subGaussian assumptions. Our experiments validate both aspects of our claims: other algorithms are indeed fragile and fail under heavytailed and/or corrupted data, while our approach exhibits the predicted robustness.
 Award ID(s):
 1763702
 Publication Date:
 NSFPAR ID:
 10285502
 Journal Name:
 Advances in neural information processing systems
 Volume:
 33
 ISSN:
 10495258
 Sponsoring Org:
 National Science Foundation
More Like this


Generative neural networks have been empirically found very promising in providing effective structural priors for compressed sensing, since they can be trained to span lowdimensional data manifolds in highdimensional signal spaces. Despite the nonconvexity of the resulting optimization problem, it has also been shown theoretically that, for neural networks with random Gaussian weights, a signal in the range of the network can be efficiently, approximately recovered from a few noisy measurements. However, a major bottleneck of these theoretical guarantees is a network expansivity condition: that each layer of the neural network must be larger than the previous by a logarithmic factor. Our main contribution is to break this strong expansivity assumption, showing that constant expansivity suffices to get efficient recovery algorithms, besides it also being informationtheoretically necessary. To overcome the theoretical bottleneck in existing approaches we prove a novel uniform concentration theorem for random functions that might not be Lipschitz but satisfy a relaxed notion which we call "pseudoLipschitzness." Using this theorem we can show that a matrix concentration inequality known as the Weight Distribution Condition (WDC), which was previously only known to hold for Gaussian matrices with logarithmic aspect ratio, in fact holds for constant aspect ratios too. Sincemore »

We consider the highdimensional linear regression model and assume that a fraction of the responses are contaminated by an adversary with complete knowledge of the data and the underlying distribution. We are interested in the situation when the dense additive noise can be heavytailed but the predictors have subGaussian distribution. We establish minimax lower bounds that depend on the the fraction of the contaminated data and the tails of the additive noise. Moreover, we design a modification of the square root Slope estimator with several desirable features: (a) it is provably robust to adversarial contamination, with the performance guarantees that take the form of subGaussian deviation inequalities and match the lower error bounds up to logfactors; (b) it is fully adaptive with respect to the unknown sparsity level and the variance of the noise, and (c) it is computationally tractable as a solution of a convex optimization problem. To analyze the performance of the proposed estimator, we prove several properties of matrices with subGaussian rows that could be of independent interest.

Abstract: We consider the problem of estimating the covariance structure of a random vector $Y\in \mathbb R^d$ from a sample $Y_1,\ldots,Y_n$. We are interested in the situation when d is large compared to n but the covariance matrix $\Sigma$ of interest has (exactly or approximately) low rank. We assume that the given sample is (a) $\epsilon$adversarially corrupted, meaning that $\epsilon$ fraction of the observations could have been replaced by arbitrary vectors, or that (b) the sample is i.i.d. but the underlying distribution is heavytailed, meaning that the norm of Y possesses only 4 finite moments. We propose an estimator that is adaptive to the potential lowrank structure of the covariance matrix as well as to the proportion of contaminated data, and admits tight deviation guarantees despite rather weak assumptions on the underlying distribution. Finally, we discuss the algorithms that allow to approximate the proposed estimator in a numerically efficient way.

Largescale panel data is ubiquitous in many modern data science applications. Conventional panel data analysis methods fail to address the new challenges, like individual impacts of covariates, endogeneity, embedded lowdimensional structure, and heavytailed errors, arising from the innovation of data collection platforms on which applications operate. In response to these challenges, this paper studies largescale panel data with an interactive effects model. This model takes into account the individual impacts of covariates on each spatial node and removes the exogenous condition by allowing latent factors to affect both covariates and errors. Besides, we waive the subGaussian assumption and allow the errors to be heavytailed. Further, we propose a datadriven procedure to learn a parsimonious yet flexible homogeneity structure embedded in highdimensional individual impacts of covariates. The homogeneity structure assumes that there exists a partition of regression coeffcients where the coeffcients are the same within each group but different between the groups. The homogeneity structure is flexible as it contains many widely assumed low dimensional structures (sparsity, global impact, etc.) as its special cases. Nonasymptotic properties are established to justify the proposed learning procedure. Extensive numerical experiments demonstrate the advantage of the proposed learning procedure over conventional methods especially when themore »