We study the problem of solving strongly convex and smooth unconstrained optimization problems using stochastic firstorder algorithms. We devise a novel algorithm, referred to as \emph{Recursive OneOverT SGD} (\ROOTSGD), based on an easily implementable, recursive averaging of past stochastic gradients. We prove that it simultaneously achieves stateoftheart performance in both a finitesample, nonasymptotic sense and an asymptotic sense. On the nonasymptotic side, we prove risk bounds on the last iterate of \ROOTSGD with leadingorder terms that match the optimal statistical risk with a unity prefactor, along with a higherorder term that scales at the sharp rate of O(n−3/2) under the Lipschitz condition on the Hessian matrix. On the asymptotic side, we show that when a mild, onepoint Hessian continuity condition is imposed, the rescaled last iterate of (multiepoch) \ROOTSGD converges asymptotically to a Gaussian limit with the Cram\'{e}rRao optimal asymptotic covariance, for a broad range of stepsize choices.
On Linear Stochastic Approximation: Finegrained PolyakRuppert and NonAsymptotic Concentration
We undertake a precise study of the asymptotic and nonasymptotic properties of stochastic approximation procedures with PolyakRuppert averaging for solving a linear system $\bar{A} \theta = \bar{b}$. When the matrix $\bar{A}$ is Hurwitz, we prove a central limit theorem (CLT) for the averaged iterates with fixed step size and number of iterations going to infinity. The CLT characterizes the exact asymptotic covariance matrix, which is the sum of the classical PolyakRuppert covariance and a correction term that scales with the step size. Under assumptions on the tail of the noise distribution, we prove a nonasymptotic concentration inequality whose main term matches the covariance in CLT in any direction, up to universal constants. When the matrix $\bar{A}$ is not Hurwitz but only has nonnegative real parts in its eigenvalues, we prove that the averaged LSA procedure actually achieves an $O(1/T)$ rate in meansquared error. Our results provide a more refined understanding of linear stochastic approximation in both the asymptotic and nonasymptotic settings. We also show various applications of the main results, including the study of momentumbased stochastic gradient methods as well as temporal difference algorithms in reinforcement learning.
 Award ID(s):
 1909365
 Publication Date:
 NSFPAR ID:
 10250956
 Journal Name:
 Proceedings of Thirty Third Conference on Learning Theory
 Volume:
 125
 Page Range or eLocationID:
 29472997
 Sponsoring Org:
 National Science Foundation
More Like this


Loh, P ; Raginsky, M. (Ed.)We study the problem of solving strongly convex and smooth unconstrained optimization problems using stochastic firstorder algorithms. We devise a novel algorithm, referred to as \emph{Recursive OneOverT SGD} (\ROOTSGD), based on an easily implementable, recursive averaging of past stochastic gradients. We prove that it simultaneously achieves stateoftheart performance in both a finitesample, nonasymptotic sense and an asymptotic sense. On the nonasymptotic side, we prove risk bounds on the last iterate of \ROOTSGD with leadingorder terms that match the optimal statistical risk with a unity prefactor, along with a higherorder term that scales at the sharp rate of $O(n^{3/2})$ under the Lipschitz condition on the Hessian matrix. On the asymptotic side, we show that when a mild, onepoint Hessian continuity condition is imposed, the rescaled last iterate of (multiepoch) \ROOTSGD converges asymptotically to a Gaussian limit with the Cram\'{e}rRao optimal asymptotic covariance, for a broad range of stepsize choices.

This work characterizes the benefits of averaging techniques widely used in conjunction with stochastic gradient descent (SGD). In particular, this work presents a sharp analysis of: (1) mini batching, a method of averaging many samples of a stochastic gradient to both reduce the variance of a stochastic gradient estimate and for parallelizing SGD and (2) tailaveraging, a method involving averaging the final few iterates of SGD in order to decrease the variance in SGD’s final iterate. This work presents sharp finite sample generalization error bounds for these schemes for the stochastic approximation problem of least squares regression. Furthermore, this work establishes a precise problemdependent extent to which minibatching can be used to yield provable nearlinear parallelization speedups over SGD with batch size one. This characterization is used to understand the relationship between learning rate versus batch size when considering the excess risk of the final iterate of an SGD procedure. Next, this minibatching characterization is utilized in providing a highly parallelizable SGD method that achieves the min imax risk with nearly the same number of serial updates as batch gradient descent, improving significantly over existing SGDstyle methods. Following this, a nonasymptotic excess risk bound for model averaging (which is amore »

Bach, Francis ; Blei, David ; Scholkopf, Bernhard (Ed.)This paper investigates the asymptotic behaviors of gradient descent algorithms (particularly accelerated gradient descent and stochastic gradient descent) in the context of stochastic optimization arising in statistics and machine learning, where objective functions are estimated from available data. We show that these algorithms can be computationally modeled by continuoustime ordinary or stochastic differential equations. We establish gradient flow central limit theorems to describe the limiting dynamic behaviors of these computational algorithms and the largesample performances of the related statistical procedures, as the number of algorithm iterations and data size both go to infinity, where the gradient flow central limit theorems are governed by some linear ordinary or stochastic differential equations, like timedependent OrnsteinUhlenbeck processes. We illustrate that our study can provide a novel unified framework for a joint computational and statistical asymptotic analysis, where the computational asymptotic analysis studies the dynamic behaviors of these algorithms with time (or the number of iterations in the algorithms), the statistical asymptotic analysis investigates the largesample behaviors of the statistical procedures (like estimators and classifiers) that are computed by applying the algorithms; in fact, the statistical procedures are equal to the limits of the random sequences generated from these iterative algorithms, as the numbermore »

Bach, Francis ; Blei, David ; Scholkopf, Bernhard (Ed.)This paper investigates the asymptotic behaviors of gradient descent algorithms (particularly accelerated gradient descent and stochastic gradient descent) in the context of stochastic optimization arising in statistics and machine learning, where objective functions are estimated from available data. We show that these algorithms can be computationally modeled by continuoustime ordinary or stochastic differential equations. We establish gradient flow central limit theorems to describe the limiting dynamic behaviors of these computational algorithms and the largesample performances of the related statistical procedures, as the number of algorithm iterations and data size both go to infinity, where the gradient flow central limit theorems are governed by some linear ordinary or stochastic differential equations, like timedependent OrnsteinUhlenbeck processes. We illustrate that our study can provide a novel unified framework for a joint computational and statistical asymptotic analysis, where the computational asymptotic analysis studies the dynamic behaviors of these algorithms with time (or the number of iterations in the algorithms), the statistical asymptotic analysis investigates the largesample behaviors of the statistical procedures (like estimators and classifiers) that are computed by applying the algorithms; in fact, the statistical procedures are equal to the limits of the random sequences generated from these iterative algorithms, as the numbermore »