In this work, we revisit the generalization error of stochastic mirror descent for quadratically bounded losses studied in Telgarsky (2022). Quadratically bounded losses is a broad class of loss functions, capturing
both Lipschitz and smooth functions, for both regression and classification problems. We study the high probability generalization for this class of losses on linear predictors in both realizable and nonrealizable cases when the data are sampled IID or from a Markov chain. The prior work relies on an intricate coupling argument between the iterates of the original problem and those projected onto a bounded domain. This approach enables blackbox application of concentration inequalities, but also leads to suboptimal guarantees due in part to the use of a union bound across all iterations. In this work, we depart significantly from the prior work of Telgarsky (2022), and introduce a novel approach for establishing high probability generalization guarantees. In contrast to the prior work, our work directly analyzes the moment generating function of a novel supermartingale sequence and leverages the structure of stochastic mirror descent. As a result, we obtain improved bounds in all aforementioned settings. Specifically, in the realizable case and nonrealizable case with lighttailed subGaussian data, we improve
the bounds by a $\log T$ factor, matching the correct rates of $1/T$ and $1/\sqrt{T}$, respectively. In the more challenging case of heavytailed polynomial data, we improve the existing bound by a $\mathrm{poly}\ T$
factor.
more »
« less
This content will become publicly available on December 10, 2024
On the Generalization Error of Stochastic Mirror Descent for QuadraticallyBounded Losses: an Improved Analysis
In this work, we revisit the generalization error of stochastic mirror
descent for quadratically bounded losses studied in Telgarsky (2022).
Quadratically bounded losses is a broad class of loss functions, capturing
both Lipschitz and smooth functions, for both regression and classification
problems. We study the high probability generalization for this class
of losses on linear predictors in both realizable and nonrealizable
cases when the data are sampled IID or from a Markov chain. The prior
work relies on an intricate coupling argument between the iterates
of the original problem and those projected onto a bounded domain.
This approach enables blackbox application of concentration inequalities,
but also leads to suboptimal guarantees due in part to the use of
a union bound across all iterations. In this work, we depart significantly
from the prior work of Telgarsky (2022), and introduce a novel approach
for establishing high probability generalization guarantees. In contrast
to the prior work, our work directly analyzes the moment generating
function of a novel supermartingale sequence and leverages the structure
of stochastic mirror descent. As a result, we obtain improved bounds
in all aforementioned settings. Specifically, in the realizable case
and nonrealizable case with lighttailed subGaussian data, we improve
the bounds by a $\log T$ factor, matching the correct rates of $1/T$
and $1/\sqrt{T}$, respectively. In the more challenging case of heavytailed
polynomial data, we improve the existing bound by a $\mathrm{poly}\ T$
factor.
more »
« less
 NSFPAR ID:
 10488804
 Publisher / Repository:
 Neural Information Processing Systems
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 ISSN:
 10495258
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this



In this work, we study the convergence in high probability of clipped gradient methods when the noise distribution has heavy tails, i.e., with bounded $p$th moments, for some $1< p \leq 2$. Prior works in this setting follow the same recipe of using concentration inequalities and an inductive argument with union bound to bound the iterates across all iterations. This method results in an increase in the failure probability by a factor of $T$, where $T$ is the number of iterations. We instead propose a new analysis approach based on bounding the moment generating function of a well chosen supermartingale sequence. We improve the dependency on $T$ in the convergence guarantee for a wide range of algorithms with clipped gradients, including stochastic (accelerated) mirror descent for convex objectives and stochastic gradient descent for nonconvex objectives. Our high probability bounds achieve the optimal convergence rates and match the best currently known inexpectation bounds. Our approach naturally allows the algorithms to use timevarying step sizes and clipping parameters when the time horizon is unknown, which appears difficult or even impossible using existing techniques from prior works. Furthermore, we show that in the case of clipped stochastic mirror descent, several problem constants, including the initial distance to the optimum, are not required when setting step sizes and clipping parameters.more » « less

In this work, we describe a generic approach to show convergence with high probability for both stochastic convex and nonconvex optimization with subGaussian noise. In previous works for convex optimization, either the convergence is only in expectation or the bound depends on the diameter of the domain. Instead, we show high probability convergence with bounds depending on the initial distance to the optimal solution. The algorithms use step sizes analogous to the standard settings and are universal to Lipschitz functions, smooth functions, and their linear combinations. The method can be applied to the nonconvex case. We demonstrate an $O((1+\sigma^{2}\log(1/\delta))/T+\sigma/\sqrt{T})$ convergence rate when the number of iterations $T$ is known and an $O((1+\sigma^{2}\log(T/\delta))/\sqrt{T})$ convergence rate when $T$ is unknown for SGD, where $1\delta$ is the desired success probability. These bounds improve over existing bounds in the literature. We also revisit AdaGradNorm \cite{ward2019adagrad} and show a new analysis to obtain a high probability bound that does not require the bounded gradient assumption made in previous works. The full version of our paper contains results for the standard percoordinate AdaGrad.more » « less

In this work, we describe a generic approach to show convergence with high probability for both stochastic convex and nonconvex optimization with subGaussian noise. In previous works for convex optimization, either the convergence is only in expectation or the bound depends on the diameter of the domain. Instead, we show high probability convergence with bounds depending on the initial distance to the optimal solution. The algorithms use step sizes analogous to the standard settings and are universal to Lipschitz functions, smooth functions, and their linear combinations. The method can be applied to the nonconvex case. We demonstrate an $O((1+\sigma^{2}\log(1/\delta))/T+\sigma/\sqrt{T})$ convergence rate when the number of iterations $T$ is known and an $O((1+\sigma^{2}\log(T/\delta))/\sqrt{T})$ convergence rate when $T$ is unknown for SGD, where $1\delta$ is the desired success probability. These bounds improve over existing bounds in the literature. We also revisit AdaGradNorm (Ward et al., 2019) and show a new analysis to obtain a high probability bound that does not require the bounded gradient assumption made in previous works. The full version of our paper contains results for the standard percoordinate AdaGrad.more » « less