Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Oh, Alice H. ; Agarwal, Alekh ; Belgrave, Danielle ; Cho, Kyunghyun (Ed.)Traditional analyses in nonconvex optimization typically rely on the smoothness assumption, namely requiring the gradients to be Lipschitz. However, recent evidence shows that this smoothness condition does not capture the properties of some deep learning objective functions, including the ones involving Recurrent Neural Networks and LSTMs. Instead, they satisfy a much more relaxed condition, with potentially unbounded smoothness. Under this relaxed assumption, it has been theoretically and empirically shown that the gradientclipped SGD has an advantage over the vanilla one. In this paper, we show that clipping is not indispensable for Adamtype algorithms in tackling such scenarios: we theoretically prove that a generalized SignSGD algorithm can obtain similar convergence rates as SGD with clipping but does not need explicit clipping at all. This family of algorithms on one end recovers SignSGD and on the other end closely resembles the popular Adam algorithm. Our analysis underlines the critical role that momentum plays in analyzing SignSGDtype and Adamtype algorithms: it not only reduces the effects of noise, thus removing the need for large minibatch in previous analyses of SignSGDtype algorithms, but it also substantially reduces the effects of unbounded smoothness and gradient norms. To the best of our knowledge, this work ismore »Free, publiclyaccessible full text available November 29, 2023

Free, publiclyaccessible full text available April 7, 2024

Dasgupta, Sanjoy ; Haghtalab, Nika (Ed.)Convexconcave minmax problems are ubiquitous in machine learning, and people usually utilize firstorder methods (e.g., gradient descent ascent) to find the optimal solution. One feature which separates convexconcave minmax problems from convex minimization problems is that the best known convergence rates for minmax problems have an explicit dependence on the size of the domain, rather than on the distance between initial point and the optimal solution. This means that the convergence speed does not have any improvement even if the algorithm starts from the optimal solution, and hence, is oblivious to the initialization. Here, we show that strictconvexitystrictconcavity is sufficient to get the convergence rate to depend on the initialization. We also show how different algorithms can asymptotically achieve initializationdependent convergence rates on this class of functions. Furthermore, we show that the socalled “parameterfree” algorithms allow to achieve improved initializationdependent asymptotic rates without any learning rate to tune. In addition, we utilize this particular parameterfree algorithm as a subroutine to design a new algorithm, which achieves a novel nonasymptotic fast rate for strictlyconvexstrictlyconcave minmax problems with a growth condition and Hölder continuous solution mapping. Experiments are conducted to verify our theoretical findings and demonstrate the effectiveness of the proposed algorithms.

SGD with Momentum (SGDM) is a widely used family of algorithms for largescale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of $\Omega(\frac{\ln T}{\sqrt{T}})$ after $T$ iterations. Based on this fact, we study a class of (both adaptive and nonadaptive) FollowTheRegularizedLeaderbased SGDM algorithms with \emph{increasing momentum} and \emph{shrinking updates}. For these algorithms, we show that the last iterate has optimal convergence $O(\frac{1}{\sqrt{T}})$ for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of $T$. Further, we show a variety of results for FTRLbased SGDM when used with adaptive stepsizes. Empirical results are shown as well.

SGD with Momentum (SGDM) is a widely used family of algorithms for largescale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the first time, we prove that for any constant momentum factor, there exists a Lipschitz and convex function for which the last iterate of SGDM suffers from a suboptimal convergence rate of $\Omega(\frac{\ln T}{\sqrt{T}})$ after $T$ iterations. Based on this fact, we study a class of (both adaptive and nonadaptive) FollowTheRegularizedLeaderbased SGDM algorithms with increasing momentum and shrinking updates. For these algorithms, we show that the last iterate has optimal convergence $O(\frac{1}{\sqrt{T}})$ for unconstrained convex stochastic optimization problems without projections onto bounded domains nor knowledge of $T$. Further, we show a variety of results for FTRLbased SGDM when used with adaptive stepsizes. Empirical results are shown as well.

ABSTRACT As a novel approach for tracing interstellar magnetic fields, the velocity gradient technique (VGT) has been proven to be effective for probing magnetic fields in the diffuse interstellar medium (ISM). In this work, we verify the VGT in a broader context by applying the technique to a molecular cloud interacting with the supernova remnant (SNR) W44. We probe the magnetic fields with the VGT using CO, $\rm HCO^+$ and H i emission lines and make a comparison with the Planck 353GHZ dust polarization. We show that the VGT gives an accurate measurement that coheres with the Planck polarization especially in intense molecular gas emission regions. We further study the foreground’s contribution on the polarization that results in misalignment between the VGT and the Planck measurements in lowintensity molecular gas areas. We advance the VGT to achieve magnetic field tomography by decomposing the SNR W44 into various velocity components. We show that W44’s velocity component at v ∼ 45 km s−1 exhibits the largest coverage and gives best agreement with Planck polarization in terms of magnetic field orientation.