In this paper, we study the stability and its tradeoff with optimization error for stochastic gradient descent (SGD) algorithms in the pairwise learning setting. Pairwise learning refers to a learning task which involves a loss function depending on pairs of instances among which notable examples are bipartite ranking, metric learning, area under ROC curve (AUC) maximization and minimum error entropy (MEE) principle. Our contribution is twofolded. Firstly, we establish the stability results for SGD for pairwise learning in the convex, strongly convex and nonconvex settings, from which generalization errors can be naturally derived. Secondly, we establish the tradeoff between stability and optimization error of SGD algorithms for pairwise learning. This is achieved by lowerbounding the sum of stability and optimization error by the minimax statistical error over a prescribed class of pairwise loss functions. From this fundamental tradeoff, we obtain lower bounds for the optimization error of SGD algorithms and the excess expected risk over a class of pairwise losses. In addition, we illustrate our stability results by giving some specific examples of AUC maximization, metric learning and MEE.
more »
« less
This content will become publicly available on December 31, 2024
UniforminTime Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent
Algorithmic stability is an important notion that has proven powerful for deriving generalization bounds for practical algorithms. The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. While these bounds have illuminated various properties
of optimization algorithms, the analysis of each case typically required a different proof technique with significantly different mathematical tools. In this study, we make a novel connection between learning theory and applied probability and introduce a unified guideline for proving Wasserstein stability bounds for stochastic
optimization algorithms. We illustrate our approach on stochastic gradient descent (SGD) and we obtain timeuniform stability bounds (i.e., the bound does not increase with the number of iterations) for strongly convex losses and nonconvex losses with additive noise, where we recover similar results to the prior art or extend
them to more general cases by using a single proof technique. Our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature. It also illustrates that ergodicity is an important component for obtaining time uniform bounds – which might not be achieved for convex or nonconvex losses unless additional noise is injected to the iterates. Finally, we slightly stretch our analysis technique and prove timeuniform bounds for SGD under convex and nonconvex losses (without additional additive noise), which, to our knowledge, is
novel.
more »
« less
 Award ID(s):
 2053485
 NSFPAR ID:
 10515310
 Publisher / Repository:
 Proceedings of Machine Learning Research
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 ISSN:
 10495258
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We study the problem of machine unlearning and identify a notion of algorithmic stability, Total Variation (TV) stability, which we argue, is suitable for the goal of exact unlearning. For convex risk minimization problems, we design TVstable algorithms based on noisy Stochastic Gradient Descent (SGD). Our key contribution is the design of corresponding efficient unlearning algorithms, which are based on constructing a nearmaximal coupling of Markov chains for the noisy SGD procedure. To understand the tradeoffs between accuracy and unlearning efficiency, we give upper and lower bounds on excess empirical and populations risk of TV stable algorithms for convex risk minimization. Our techniques generalize to arbitrary nonconvex functions, and our algorithms are differentially private as well.more » « less

Stochastic Gradient Descent (SGD) has played a central role in machine learning. However, it requires a carefully handpicked stepsize for fast convergence, which is notoriously tedious and timeconsuming to tune. Over the last several years, a plethora of adaptive gradientbased algorithms have emerged to ameliorate this problem. In this paper, we propose new surrogate losses to cast the problem of learning the optimal stepsizes for the stochastic optimization of a nonconvex smooth objective function onto an online convex optimization problem. This allows the use of noregret online algorithms to compute optimal stepsizes on the fly. In turn, this results in a SGD algorithm with selftuned stepsizes that guarantees convergence rates that are automatically adaptive to the level of noise.more » « less

In this work, we describe a generic approach to show convergence with high probability for both stochastic convex and nonconvex optimization with subGaussian noise. In previous works for convex optimization, either the convergence is only in expectation or the bound depends on the diameter of the domain. Instead, we show high probability convergence with bounds depending on the initial distance to the optimal solution. The algorithms use step sizes analogous to the standard settings and are universal to Lipschitz functions, smooth functions, and their linear combinations. The method can be applied to the nonconvex case. We demonstrate an $O((1+\sigma^{2}\log(1/\delta))/T+\sigma/\sqrt{T})$ convergence rate when the number of iterations $T$ is known and an $O((1+\sigma^{2}\log(T/\delta))/\sqrt{T})$ convergence rate when $T$ is unknown for SGD, where $1\delta$ is the desired success probability. These bounds improve over existing bounds in the literature. We also revisit AdaGradNorm (Ward et al., 2019) and show a new analysis to obtain a high probability bound that does not require the bounded gradient assumption made in previous works. The full version of our paper contains results for the standard percoordinate AdaGrad.more » « less

In this work, we describe a generic approach to show convergence with high probability for both stochastic convex and nonconvex optimization with subGaussian noise. In previous works for convex optimization, either the convergence is only in expectation or the bound depends on the diameter of the domain. Instead, we show high probability convergence with bounds depending on the initial distance to the optimal solution. The algorithms use step sizes analogous to the standard settings and are universal to Lipschitz functions, smooth functions, and their linear combinations. The method can be applied to the nonconvex case. We demonstrate an $O((1+\sigma^{2}\log(1/\delta))/T+\sigma/\sqrt{T})$ convergence rate when the number of iterations $T$ is known and an $O((1+\sigma^{2}\log(T/\delta))/\sqrt{T})$ convergence rate when $T$ is unknown for SGD, where $1\delta$ is the desired success probability. These bounds improve over existing bounds in the literature. We also revisit AdaGradNorm \cite{ward2019adagrad} and show a new analysis to obtain a high probability bound that does not require the bounded gradient assumption made in previous works. The full version of our paper contains results for the standard percoordinate AdaGrad.more » « less