skip to main content

Title: A Second look at Exponential and Cosine Step Sizes: Simplicity, Adaptivity, and Performance
Stochastic Gradient Descent (SGD) is a popular tool in training large-scale machine learning models. Its performance, however, is highly variable, depending crucially on the choice of the step sizes. Accordingly, a variety of strategies for tuning the step sizes have been proposed, ranging from coordinate-wise approaches (a.k.a. “adaptive” step sizes) to sophisticated heuristics to change the step size in each iteration. In this paper, we study two step size schedules whose power has been repeatedly confirmed in practice: the exponential and the cosine step sizes. For the first time, we provide theoretical support for them proving convergence rates for smooth non-convex functions, with and without the Polyak-Łojasiewicz (PL) condition. Moreover, we show the surprising property that these two strategies are adaptive to the noise level in the stochastic gradients of PL functions. That is, contrary to polynomial step sizes, they achieve almost optimal performance without needing to know the noise level nor tuning their hyperparameters based on it. Finally, we conduct a fair and comprehensive empirical evaluation of real-world datasets with deep learning architectures. Results show that, even if only requiring at most two hyperparameters to tune, these two strategies best or match the performance of various finely-tuned state-of-the-art strategies.
Authors:
; ;
Editors:
Meila, Marina; Zhang, Tong
Award ID(s):
1925930 2022446 1908111
Publication Date:
NSF-PAR ID:
10310894
Journal Name:
Proceedings of Machine Learning Research
Volume:
139
ISSN:
2640-3498
Sponsoring Org:
National Science Foundation
More Like this
  1. The performance of private gradient-based optimization algorithms is highly dependent on the choice of step size (or learning rate) which often requires non-trivial amount of tuning. In this paper, we introduce a stochastic variant of classic backtracking line search algorithm that satisfies Renyi differential privacy. Specifically, the proposed algorithm adaptively chooses the step size satisfying the the Armijo condition (with high probability) using noisy gradients and function estimates. Furthermore, to improve the probability with which the chosen step size satisfies the condition, it adjusts per-iteration privacy budget during runtime according to the reliability of noisy gradient. A naive implementation of the backtracking search algorithm may end up using unacceptably large privacy budget as the ability of adaptive step size selection comes at the cost of extra function evaluations. The proposed algorithm avoids this problem by using the sparse vector technique combined with the recent privacy amplification lemma. We also introduce a privacy budget adaptation strategy in which the algorithm adaptively increases the budget when it detects that directions pointed by consecutive gradients are drastically different. Extensive experiments on both convex and non-convex problems show that the adaptively chosen step sizes allow the proposed algorithm to efficiently use the privacy budgetmore »and show competitive performance against existing private optimizers.« less
  2. The emphasis on conceptual learning and the development of adaptive instructional design are both emerging areas in science and engineering education. Instructors are writing their own conceptual questions to promote active learning during class and utilizing pools of these questions in assessments. For adaptive assessment strategies, these questions need to be rated based on difficulty level (DL). Historically DL has been determined from the performance of a suitable number of students. The research study reported here investigates whether instructors can save time by predicting DL of newly made conceptual questions without the need for student data. In this paper, we report on the development of one component in an adaptive learning module for materials science – specifically on the topic of crystallography. The summative assessment element consists of five DL scales and 15 conceptual questions This adaptive assessment directs students based on their previous performances and the DL of the questions. Our five expert participants are faculty members who have taught the introductory Materials Science course multiple times. They provided predictions for how many students would answer each question correctly during a two-step process. First, predictions were made individually without an answer key. Second, experts had the opportunity to revisemore »their predictions after being provided an answer key in a group discussion. We compared expert predictions with actual student performance using results from over 400 students spanning multiple courses and terms. We found no clear correlation between expert predictions of the DL and the measured DL from students. Some evidence shows that discussion during the second step made expert predictions closer to student performance. We suggest that, in determining the DL for conceptual questions, using predictions of the DL by experts who have taught the course is not a valid route. The findings in this paper can be applied to assessments in both in-person, hybrid, and online settings and is applicable to subject matter beyond materials science.« less
  3. Efficient simulation of SDEs is essential in many applications, particularly for ergodic systems that demand efficient simulation of both short-time dynamics and large-time statistics. However, locally Lipschitz SDEs often require special treatments such as implicit schemes with small time-steps to accurately simulate the ergodic measures. We introduce a framework to construct inference-based schemes adaptive to large time-steps (ISALT) from data, achieving a reduction in time by several orders of magnitudes. The key is the statistical learning of an approximation to the infinite-dimensional discrete-time flow map. We explore the use of numerical schemes (such as the Euler-Maruyama, the hybrid RK4, and an implicit scheme) to derive informed basis functions, leading to a parameter inference problem. We introduce a scalable algorithm to estimate the parameters by least squares, and we prove the convergence of the estimators as data size increases.

    We test the ISALT on three non-globally Lipschitz SDEs: the 1D double-well potential, a 2D multiscale gradient system, and the 3D stochastic Lorenz equation with a degenerate noise. Numerical results show that ISALT can tolerate time-step magnitudes larger than plain numerical schemes. It reaches optimal accuracy in reproducing the invariant measure when the time-step is medium-large.

  4. Application-layer transfer configurations play a crucial role in achieving desirable performance in high-speed networks. However, finding the optimal configuration for a given transfer task is a difficult problem as it depends on various factors including dataset characteristics, network settings, and background traffic. The state-of-the-art transfer tuning solutions rely on real-time sample transfers to evaluate various configurations and estimate the optimal one. However, existing approaches to run sample transfers incur high delay and measurement errors, thus significantly limit the efficiency of the transfer tuning algorithms. In this paper, we introduce adaptive feed forward deep neural network (DNN) to minimize the error rate of sample transfers without increasing their execution time. We ran 115K file transfers in four different high-speed networks and used their logs to train an adaptive DNN that can quickly and accurately predict the throughput of sample transfers by analyzing instantaneous throughput values. The results gathered in various networks with rich set of transfer configurations indicate that the proposed model reduces error rate by up to 50% compared to the state-of-the-art solutions while keeping the execution time low. We also show that one can further reduce delay or error rate by tuning hyperparameters of the model to meet specificmore »needs of user or application. Finally, transfer learning analysis reveals that the model developed in one network would yield accurate results in other networks with similar transfer convergence characteristics, alleviating the needs to run an extensive data collection and model derivation efforts for each network.« less
  5. Deep learning methods achieve state-of-the-art performance in many application scenarios. Yet, these methods require a significant amount of hyperparameters tuning in order to achieve the best results. In particular, tuning the learning rates in the stochastic optimization process is still one of the main bottlenecks. In this paper, we propose a new stochastic gradient descent procedure for deep networks that does not require any learning rate setting. Contrary to previous methods, we do not adapt the learning rates nor we make use of the assumed curvature of the objective function. Instead, we reduce the optimization process to a game of betting on a coin and propose a learning rate free optimal algorithm for this scenario. Theoretical convergence is proven for convex and quasi-convex functions and empirical evidence shows the advantage of our algorithm over popular stochastic gradient algorithms.