skip to main content


This content will become publicly available on April 14, 2025

Title: AUTOSGM: A Unified Lowpass Regularization Framework for Accelerated Learning
This paper unifies commonly used accelerated stochastic gradient methods (Polyak’s Heavy Ball, Nesterov’s Accelerated Gradient and Adaptive Moment Estimation (Adam)) as specific cases of a general lowpass regularized learning framework, the Automatic Stochastic Gradient Method (AutoSGM). For AutoSGM, we derive an optimal iteration-dependent learning rate function and realize an approximation. Adam is also an approximation of this optimal approach that replaces the iteration-dependent learning-rate with a constant. Empirical results on deep neural networks comparing the learning behavior of AutoSGM equipped with this iteration-dependent learning-rate algorithm demonstrate fast learning behavior, robustness to the initial choice of the learning rate, and can tune an initial constant learning-rate in applications where a good constant learning rate approximation is unknown.  more » « less
Award ID(s):
1901492
NSF-PAR ID:
10509404
Author(s) / Creator(s):
; ;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
Proceedings of IEEE International Conference on Acoustics, SPeech and Signal Processing
ISBN:
979-8-3503-4485-1
Page Range / eLocation ID:
7285 to 7289
Subject(s) / Keyword(s):
stochastic gradient method accelerated learning learning algorithms optimization deep learning
Format(s):
Medium: X
Location:
Seoul, Korea, Republic of
Sponsoring Org:
National Science Foundation
More Like this
  1. Minimax optimal convergence rates for classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, SGD's final iterate behavior has received much less attention despite their widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations SGD is run for) is known in advance, SGD's final iterate behavior with any polynomially decaying learning rate scheme is highly sub-optimal compared to the minimax rate (by a condition number factor in the strongly convex case and a factor of T‾‾√ in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offers significant improvements over any polynomially decaying step sizes. In particular, the final iterate behavior with a step decay schedule is off the minimax rate by only log factors (in the condition number for strongly convex case, and in T for the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGD's final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the stepsizes employed. These results demonstrate the subtlety in establishing optimal learning rate schemes (for the final iterate) for stochastic gradient procedures in fixed time horizon settings. 
    more » « less
  2. Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, the behavior of SGD’s final iterate has received much less attention despite the widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations that SGD is run for) is known in advance, the behavior of SGD’s final iterate with any polynomially decaying learning rate scheme is highly sub-optimal compared to the statistical minimax rate (by a condition number factor in the strongly convex √ case and a factor of shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offer significant improvements over any polynomially decaying step size schedule. In particular, the behavior of the final iterate with step decay schedules is off from the statistical minimax rate by only log factors (in the condition number for the strongly convex case, and in T in the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGD’s final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the stepsize scheme employed. These results demonstrate the subtlety in establishing optimal learning rate schedules (for the final iterate) for stochastic gradient procedures in fixed time horizon settings. 
    more » « less
  3. Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, the behavior of SGDs final iterate has received much less attention despite the widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least quares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations that SGD is run for) is known in advance, the behavior of SGDs final iterate with any polynomially decaying learning rate scheme is highly suboptimal compared to the statistical minimax rate (by a condition number factor in the strongly convex case and a factor of \sqrt{T} in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offer significant improvements over any polynomially decaying step size schedule. In particular, the behavior of the final iterate with step decay schedules is off from the statistical minimax rate by only log factors (in the condition number for the strongly convex case, and in T in the non-strongly convex case). Finally, in stark contrast to the known horizon case, this paper shows that the anytime (i.e. the limiting) behavior of SGDs final iterate is poor (in that it queries iterates with highly sub-optimal function value infinitely often, i.e. in a limsup sense) irrespective of the step size scheme employed. These results demonstrate the subtlety in establishing optimal learning rate schedules (for the final iterate) for stochastic gradient procedures in fixed time horizon settings. 
    more » « less
  4. null (Ed.)
    We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov decision process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and actor-critic algorithms, this algorithm does not depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical Q-value iteration algorithm, converges to the optimal Q-value function. We also give a rate of convergence or a nonasymptotic sample complexity bound and show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ballpark estimate for our algorithm compared with stochastic approximation-based algorithms. 
    more » « less
  5. This paper presents an iterative learning approach for optimizing course geometry in repetitive path following applications. In particular, we focus on airborne wind energy (AWE) systems. Our proposed algorithm consists of two key features: First, a recursive least squares fit is used to construct an estimate of the behavior of the performance index. Second, an iteration-to-iteration path adaptation law is used to adjust the path shape in the direction of optimal performance. We propose two candidate update laws, both of which parallel the mathematical structure of common iterative learning control (ILC) update laws but replace the tracking-dependent terms with terms based on the performance index.We apply our formulation to the iterative crosswind path optimization of an AWE system, where the goal is to maximize the average power output over a figure-8 path. Using a physics based AWE system model, we demonstrate that the proposed adaptation strategy successfully achieves convergence to near-optimal figure-8 paths for a variety of initial conditions under both constant and real wind profiles. 
    more » « less