Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Oh, Alice H. ; Agarwal, Alekh ; Belgrave, Danielle ; Cho, Kyunghyun (Ed.)Traditional analyses in nonconvex optimization typically rely on the smoothness assumption, namely requiring the gradients to be Lipschitz. However, recent evidence shows that this smoothness condition does not capture the properties of some deep learning objective functions, including the ones involving Recurrent Neural Networks and LSTMs. Instead, they satisfy a much more relaxed condition, with potentially unbounded smoothness. Under this relaxed assumption, it has been theoretically and empirically shown that the gradientclipped SGD has an advantage over the vanilla one. In this paper, we show that clipping is not indispensable for Adamtype algorithms in tackling such scenarios: we theoretically prove that a generalized SignSGD algorithm can obtain similar convergence rates as SGD with clipping but does not need explicit clipping at all. This family of algorithms on one end recovers SignSGD and on the other end closely resembles the popular Adam algorithm. Our analysis underlines the critical role that momentum plays in analyzing SignSGDtype and Adamtype algorithms: it not only reduces the effects of noise, thus removing the need for large minibatch in previous analyses of SignSGDtype algorithms, but it also substantially reduces the effects of unbounded smoothness and gradient norms. To the best of our knowledge, this work is the first one showing the benefit of Adamtype algorithms compared with nonadaptive gradient algorithms such as gradient descent in the unbounded smoothness setting. We also compare these algorithms with popular optimizers on a set of deep learning tasks, observing that we can match the performance of Adam while beating others.more » « lessFree, publiclyaccessible full text available November 29, 2023

Dasgupta, Sanjoy ; Haghtalab, Nika (Ed.)Convexconcave minmax problems are ubiquitous in machine learning, and people usually utilize firstorder methods (e.g., gradient descent ascent) to find the optimal solution. One feature which separates convexconcave minmax problems from convex minimization problems is that the best known convergence rates for minmax problems have an explicit dependence on the size of the domain, rather than on the distance between initial point and the optimal solution. This means that the convergence speed does not have any improvement even if the algorithm starts from the optimal solution, and hence, is oblivious to the initialization. Here, we show that strictconvexitystrictconcavity is sufficient to get the convergence rate to depend on the initialization. We also show how different algorithms can asymptotically achieve initializationdependent convergence rates on this class of functions. Furthermore, we show that the socalled “parameterfree” algorithms allow to achieve improved initializationdependent asymptotic rates without any learning rate to tune. In addition, we utilize this particular parameterfree algorithm as a subroutine to design a new algorithm, which achieves a novel nonasymptotic fast rate for strictlyconvexstrictlyconcave minmax problems with a growth condition and Hölder continuous solution mapping. Experiments are conducted to verify our theoretical findings and demonstrate the effectiveness of the proposed algorithms.more » « less

SGD with Momentum (SGDM) is a widely used family of algorithms for largescale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of $\Omega(\frac{\ln T}{\sqrt{T}})$ after $T$ iterations. Based on this fact, we study a class of (both adaptive and nonadaptive) FollowTheRegularizedLeaderbased SGDM algorithms with \emph{increasing momentum} and \emph{shrinking updates}. For these algorithms, we show that the last iterate has optimal convergence $O(\frac{1}{\sqrt{T}})$ for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of $T$. Further, we show a variety of results for FTRLbased SGDM when used with adaptive stepsizes. Empirical results are shown as well.more » « less

Parameterfree stochastic gradient descent (PFSGD) algorithms do not require setting learning rates while achieving optimal theoretical performance. In practical applications, however, there remains an empirical gap between tuned stochastic gradient descent (SGD) and PFSGD. In this paper, we close the empirical gap with a new parameterfree algorithm based on continuoustime CoinBetting on truncated models. The new update is derived through the solution of an Ordinary Differential Equation (ODE) and solved in a closed form. We show empirically that this new parameterfree algorithm outperforms algorithms with the "best default" learning rates and almost matches the performance of finely tuned baselines without anything to tune.more » « less

SGD with Momentum (SGDM) is a widely used family of algorithms for largescale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of $\Omega(\frac{\ln T}{\sqrt{T}})$ after $T$ iterations. Based on this fact, we study a class of (both adaptive and nonadaptive) FollowTheRegularizedLeaderbased SGDM algorithms with increasing momentum and shrinking updates. For these algorithms, we show that the last iterate has optimal convergence $O(\frac{1}{\sqrt{T}})$ for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of $T$. Further, we show a variety of results for FTRLbased SGDM when used with adaptive stepsizes. Empirical results are shown as well.more » « less

Dasgupta, Sanjoy ; Haghtalab, Nika (Ed.)Parameterfree algorithms are online learning algorithms that do not require setting learning rates. They achieve optimal regret with respect to the distance between the initial point and any competitor. Yet, parameterfree algorithms do not take into account the geometry of the losses. Recently, in the stochastic optimization literature, it has been proposed to instead use truncated linear lower bounds, which produce better performance by more closely modeling the losses. In particular, truncated linear models greatly reduce the problem of overshooting the minimum of the loss function. Unfortunately, truncated linear models cannot be used with parameterfree algorithms because the updates become very expensive to compute. In this paper, we propose new parameterfree algorithms that can take advantage of truncated linear models through a new update that has an “implicit” flavor. Based on a novel decomposition of the regret, the new update is efficient, requires only one gradient at each step, never overshoots the minimum of the truncated model, and retains the favorable parameterfree properties. We also conduct an empirical study demonstrating the practical utility of our algorithms.more » « less

Meila, Marina ; Zhang, Tong (Ed.)Stochastic Gradient Descent (SGD) is a popular tool in training largescale machine learning models. Its performance, however, is highly variable, depending crucially on the choice of the step sizes. Accordingly, a variety of strategies for tuning the step sizes have been proposed, ranging from coordinatewise approaches (a.k.a. “adaptive” step sizes) to sophisticated heuristics to change the step size in each iteration. In this paper, we study two step size schedules whose power has been repeatedly confirmed in practice: the exponential and the cosine step sizes. For the first time, we provide theoretical support for them proving convergence rates for smooth nonconvex functions, with and without the PolyakŁojasiewicz (PL) condition. Moreover, we show the surprising property that these two strategies are adaptive to the noise level in the stochastic gradients of PL functions. That is, contrary to polynomial step sizes, they achieve almost optimal performance without needing to know the noise level nor tuning their hyperparameters based on it. Finally, we conduct a fair and comprehensive empirical evaluation of realworld datasets with deep learning architectures. Results show that, even if only requiring at most two hyperparameters to tune, these two strategies best or match the performance of various finelytuned stateoftheart strategies.more » « less

Meila, Marina ; Zhang, Tong (Ed.)Inspired by the demands of realtime climate and weather forecasting, we develop optimistic online learning algorithms that require no parameter tuning and have optimal regret guarantees under delayed feedback. Our algorithms—DORM, DORM+, and AdaHedgeD—arise from a novel reduction of delayed online learning to optimistic online learning that reveals how optimistic hints can mitigate the regret penalty caused by delay. We pair this delayasoptimism perspective with a new analysis of optimistic learning that exposes its robustness to hinting errors and a new metaalgorithm for learning effective hinting strategies in the presence of delay. We conclude by benchmarking our algorithms on four subseasonal climate forecasting tasks, demonstrating low regret relative to stateoftheart forecasting models.more » « less

null (Ed.)Online learning is a framework for the design and analysis of algorithms that build predictive models by processing data one at the time. Besides being computationally efficient, online algorithms enjoy theoretical performance guarantees that do not rely on statistical assumptions on the data source. In this survey, we describe some of the most important algorithmic ideas behind online learning and explain the main mathematical tools for their analysis. Our reference framework is online convex optimization, a sequential version of convex optimization within which most online algorithms are formulated. More specifically, we provide an indepth description of Online Mirror Descent and Follow the Regularized Leader, two of the most fundamental algorithms in online learning. As the tuning of parameters is a typically difficult task in sequential data analysis, in the last part of the survey we focus on coinbetting, an informationtheoretic approach to the design of parameterfree online algorithms with good theoretical guarantees.more » « less