This paper investigates robust versions of the general empirical risk minimization algorithm, one of the core techniques underlying modern statistical methods. Success of the empirical risk minimization is based on the fact that for a ‘wellbehaved’ stochastic process $\left \{ f(X), \ f\in \mathscr F\right \}$ indexed by a class of functions $f\in \mathscr F$, averages $\frac{1}{N}\sum _{j=1}^N f(X_j)$ evaluated over a sample $X_1,\ldots ,X_N$ of i.i.d. copies of $X$ provide good approximation to the expectations $\mathbb E f(X)$, uniformly over large classes $f\in \mathscr F$. However, this might no longer be true if the marginal distributions of the process are heavy tailed or if the sample contains outliers. We propose a version of empirical risk minimization based on the idea of replacing sample averages by robust proxies of the expectations and obtain highconfidence bounds for the excess risk of resulting estimators. In particular, we show that the excess risk of robust estimators can converge to $0$ at fast rates with respect to the sample size $N$, referring to the rates faster than $N^{1/2}$. We discuss implications of the main results to the linear and logistic regression problems and evaluate the numerical performance of proposed methods on simulated and real data.
 Award ID(s):
 1934584
 NSFPAR ID:
 10352884
 Date Published:
 Journal Name:
 International Conference on Artificial Intelligence and Statistics (AISTATS)
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Abstract 
null (Ed.)Stochastic gradient descent with momentum (SGDm) is one of the most popular optimization algorithms in deep learning. While there is a rich theory of SGDm for convex problems, the theory is considerably less developed in the context of deep learning where the problem is nonconvex and the gradient noise might exhibit a heavytailed behavior, as empirically observed in recent studies. In this study, we consider a \emph{continuoustime} variant of SGDm, known as the underdamped Langevin dynamics (ULD), and investigate its asymptotic properties under heavytailed perturbations. Supported by recent studies from statistical physics, we argue both theoretically and empirically that the heavytails of such perturbations can result in a bias even when the stepsize is small, in the sense that \emph{the optima of stationary distribution} of the dynamics might not match \emph{the optima of the cost function to be optimized}. As a remedy, we develop a novel framework, which we coin as \emph{fractional} ULD (FULD), and prove that FULD targets the socalled Gibbs distribution, whose optima exactly match the optima of the original cost. We observe that the Euler discretization of FULD has noteworthy algorithmic similarities with \emph{natural gradient} methods and \emph{gradient clipping}, bringing a new perspective on understanding their role in deep learning. We support our theory with experiments conducted on a synthetic model and neural networks.more » « less

Abstract We explore why many recently proposed robust estimation problems are efficiently solvable, even though the underlying optimization problems are nonconvex. We study the loss landscape of these robust estimation problems, and identify the existence of ’generalized quasigradients’. Whenever these quasigradients exist, a large family of noregret algorithms are guaranteed to approximate the global minimum; this includes the commonly used filtering algorithm. For robust mean estimation of distributions under bounded covariance, we show that any firstorder stationary point of the associated optimization problem is an approximate global minimum if and only if the corruption level $\epsilon < 1/3$. Consequently, any optimization algorithm that approaches a stationary point yields an efficient robust estimator with breakdown point $1/3$. With carefully designed initialization and step size, we improve this to $1/2$, which is optimal. For other tasks, including linear regression and joint mean and covariance estimation, the loss landscape is more rugged: there are stationary points arbitrarily far from the global minimum. Nevertheless, we show that generalized quasigradients exist and construct efficient algorithms. These algorithms are simpler than previous ones in the literature, and for linear regression we improve the estimation error from $O(\sqrt{\epsilon })$ to the optimal rate of $O(\epsilon )$ for small $\epsilon $ assuming certified hypercontractivity. For mean estimation with nearidentity covariance, we show that a simple gradient descent algorithm achieves breakdown point $1/3$ and iteration complexity $\tilde{O}(d/\epsilon ^2)$.more » « less

In this work, we revisit the generalization error of stochastic mirror descent for quadratically bounded losses studied in Telgarsky (2022). Quadratically bounded losses is a broad class of loss functions, capturing both Lipschitz and smooth functions, for both regression and classification problems. We study the high probability generalization for this class of losses on linear predictors in both realizable and nonrealizable cases when the data are sampled IID or from a Markov chain. The prior work relies on an intricate coupling argument between the iterates of the original problem and those projected onto a bounded domain. This approach enables blackbox application of concentration inequalities, but also leads to suboptimal guarantees due in part to the use of a union bound across all iterations. In this work, we depart significantly from the prior work of Telgarsky (2022), and introduce a novel approach for establishing high probability generalization guarantees. In contrast to the prior work, our work directly analyzes the moment generating function of a novel supermartingale sequence and leverages the structure of stochastic mirror descent. As a result, we obtain improved bounds in all aforementioned settings. Specifically, in the realizable case and nonrealizable case with lighttailed subGaussian data, we improve the bounds by a $\log T$ factor, matching the correct rates of $1/T$ and $1/\sqrt{T}$, respectively. In the more challenging case of heavytailed polynomial data, we improve the existing bound by a $\mathrm{poly}\ T$ factor.more » « less

Abstract In recent years, the choice of prior distributions for Bayesian logistic regression has received considerable interest. It is widely acknowledged that noninformative, improper priors have to be used with caution because posterior propriety may not always hold. As an alternative, heavy‐tailed priors such as Cauchy prior distributions have been proposed by Gelman et al. (2008). The motivation for using Cauchy prior distributions is that they are proper, and thus, unlike noninformative priors, they are guaranteed to yield proper posterior distributions. The heavy tails of the Cauchy distribution allow the posterior distribution to adapt to the data. Thus Gelman et al. (2008) suggested the use of these prior distributions as a default weakly informative choice, in the absence of prior knowledge about the regression coefficients. While these prior distributions are guaranteed to have proper posterior distributions, Ghosh, Li, and Mitra (2018), showed that the posterior means may not exist, or can be unreasonably large, when they exist, for datasets with separation. In this paper, we provide a short review of the concept of separation, and we discuss how common prior distributions like the Cauchy and normal prior perform in data with separation. Theoretical and empirical results suggest that lighter tailed prior distributions such as normal prior distributions can be a good default choice when there is separation.
This article is categorized under:
Statistical Learning and Exploratory Methods of the Data Sciences > Modeling Methods Statistical Models