skip to main content


This content will become publicly available on July 23, 2024

Title: DoG is SGD’s Best Friend: A Parameter-Free Dynamic Step Size Schedule
We propose a tuning-free dynamic SGD step size formula, which we call Distance over Gradients (DoG). The DoG step sizes depend on simple empirical quantities (distance from the initial point and norms of gradients) and have no “learning rate” parameter. Theoretically, we show that, for stochastic convex optimization, a slight variation of the DoG formula enjoys strong, high-probability parameter-free convergence guarantees and iterate movement bounds. Empirically, we consider a broad range of vision and language transfer learning tasks, and show that DoG’s performance is close to that of SGD with tuned learning rate. We also propose a per-layer variant of DoG that generally outperforms tuned SGD, approaching the performance of tuned Adam. A PyTorch implementation of our algorithms is available at https://github.com/formll/dog.  more » « less
Award ID(s):
2239527
NSF-PAR ID:
10490603
Author(s) / Creator(s):
Publisher / Repository:
Proceedings of machine learning research
Date Published:
Journal Name:
Proceedings of the 40th International Conference on Machine Learning
ISSN:
2640-3498
Format(s):
Medium: X
Location:
Hawaii
Sponsoring Org:
National Science Foundation
More Like this
  1. Meila, Marina ; Zhang, Tong (Ed.)
    Stochastic Gradient Descent (SGD) is a popular tool in training large-scale machine learning models. Its performance, however, is highly variable, depending crucially on the choice of the step sizes. Accordingly, a variety of strategies for tuning the step sizes have been proposed, ranging from coordinate-wise approaches (a.k.a. “adaptive” step sizes) to sophisticated heuristics to change the step size in each iteration. In this paper, we study two step size schedules whose power has been repeatedly confirmed in practice: the exponential and the cosine step sizes. For the first time, we provide theoretical support for them proving convergence rates for smooth non-convex functions, with and without the Polyak-Łojasiewicz (PL) condition. Moreover, we show the surprising property that these two strategies are adaptive to the noise level in the stochastic gradients of PL functions. That is, contrary to polynomial step sizes, they achieve almost optimal performance without needing to know the noise level nor tuning their hyperparameters based on it. Finally, we conduct a fair and comprehensive empirical evaluation of real-world datasets with deep learning architectures. Results show that, even if only requiring at most two hyperparameters to tune, these two strategies best or match the performance of various finely-tuned state-of-the-art strategies. 
    more » « less
  2. Parameter-free stochastic gradient descent (PFSGD) algorithms do not require setting learning rates while achieving optimal theoretical performance. In practical applications, however, there remains an empirical gap between tuned stochastic gradient descent (SGD) and PFSGD. In this paper, we close the empirical gap with a new parameter-free algorithm based on continuous-time Coin-Betting on truncated models. The new update is derived through the solution of an Ordinary Differential Equation (ODE) and solved in a closed form. We show empirically that this new parameter-free algorithm outperforms algorithms with the "best default" learning rates and almost matches the performance of finely tuned baselines without anything to tune. 
    more » « less
  3. null (Ed.)
    We consider the problem of minimizing a convex risk with stochastic subgradients guaranteeing $\epsilon$-locally differentially private ($\epsilon$-LDP). While it has been shown that stochastic optimization is possible with $\epsilon$-LDP via the standard SGD, its convergence rate largely depends on the learning rate, which must be tuned via repeated runs. Further, tuning is detrimental to privacy loss since it significantly increases the number of gradient requests. In this work, we propose BANCO (Betting Algorithm for Noisy COins), the first $\epsilon$-LDP SGD algorithm that essentially matches the convergence rate of the tuned SGD without any learning rate parameter, reducing privacy loss and saving privacy budget. 
    more » « less
  4. In this paper, we propose a new algorithm for solving convex-concave saddle-point problems using regret minimization in the repeated game framework. To do so, we introduce the Conic Blackwell Algorithm + ([Formula: see text]), a new parameter- and scale-free regret minimizer for general convex compact sets. [Formula: see text] is based on Blackwell approachability and attains [Formula: see text] regret. We show how to efficiently instantiate [Formula: see text] for many decision sets of interest, including the simplex, [Formula: see text] norm balls, and ellipsoidal confidence regions in the simplex. Based on [Formula: see text], we introduce [Formula: see text], a new parameter-free algorithm for solving convex-concave saddle-point problems achieving a [Formula: see text] ergodic convergence rate. In our simulations, we demonstrate the wide applicability of [Formula: see text] on several standard saddle-point problems from the optimization and operations research literature, including matrix games, extensive-form games, distributionally robust logistic regression, and Markov decision processes. In each setting, [Formula: see text] achieves state-of-the-art numerical performance and outperforms classical methods, without the need for any choice of step sizes or other algorithmic parameters. Funding: J. Grand-Clément is supported by the Agence Nationale de la Recherche [Grant 11-LABX-0047] and by Hi! Paris. C. Kroer is supported by the Office of Naval Research [Grant N00014-22-1-2530] and by the National Science Foundation [Grant IIS-2147361]. 
    more » « less
  5. We present LBW-Net, an efficient optimization based method for quantization and training of the low bit-width convolutional neural networks (CNNs). Specifically, we quantize the weights to zero or powers of 2 by minimizing the Euclidean distance between full-precision weights and quantized weights during back-propagation (weight learning). We characterize the combinatorial nature of the low bit-width quantization problem. For 2-bit (ternary) CNNs, the quantization of N weights can be done by an exact formula in O(N log N) complexity. When the bit-width is 3 and above, we further propose a semi-analytical thresholding scheme with a single free parameter for quantization that is computationally inexpensive. The free parameter is further determined by network retraining and object detection tests. The LBW-Net has several desirable advantages over full-precision CNNs, including considerable memory savings, energy efficiency, and faster deployment. Our experiments on PASCAL VOC dataset show that compared with its 32-bit floating-point counterpart, the performance of the 6-bit LBW-Net is nearly lossless in the object detection tasks, and can even do better in real world visual scenes, while empirically enjoying more than 4× faster deployment. 
    more » « less