NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

Umut Simsekli, Lingjiong Zhu (January 2021, Proceedings of Machine Learning Research)
null (Ed.)
Full Text Available
Fractional moment-preserving initialization schemes for training deep neural networks

Gurbuzbalaban, M; Hu, Yuanhan (January 2021, Proceedings of Machine Learning Research)

Full Text Available
Why random reshuffling beats stochastic gradient descent

https://doi.org/10.1007/s10107-019-01440-w

Gürbüzbalaban, M.; Ozdaglar, A.; Parrilo, P. A. (October 2020, Mathematical Programming)

Full Text Available
A Universally Optimal Multistage Accelerated Stochastic Gradient Method

Aybat, N; Fallah, A; Gurbuzbalaban, M; Ozdaglar, A. (May 2020, Advances in neural information processing systems)

Full Text Available
ASYNC: A Cloud Engine with Asynchrony and History for Distributed Machine Learning

https://doi.org/10.1109/IPDPS47924.2020.00052

Soori, Saeed; Can, Bugra; Gurbuzbalaban, Mert; Dehnavi, Maryam Mehri (May 2020, 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS))
null (Ed.)
ASYNC is a framework that supports the implementation of asynchrony and history for optimization methods on distributed computing platforms. The popularity of asynchronous optimization methods has increased in distributed machine learning. However, their applicability and practical experimentation on distributed systems are limited because current bulk-processing cloud engines do not provide a robust support for asynchrony and history. With introducing three main modules and bookkeeping system-specific and application parameters, ASYNC provides practitioners with a framework to implement asynchronous machine learning methods. To demonstrate ease-of-implementation in ASYNC, the synchronous and asynchronous variants of two well-known optimization methods, stochastic gradient descent and SAGA, are demonstrated in ASYNC.
more » « less
Full Text Available
IDEAL: Inexact DEcentralized Accelerated Augmented Lagrangian Method

Yossi Arjevani, Joan Bruna (January 2020, Proceedings of Machine Learning Research)
null (Ed.)
We introduce a framework for designing primal methods under the decentralized optimization setting where local functions are smooth and strongly convex. Our approach consists of approximately solving a sequence of sub-problems induced by the accelerated augmented Lagrangian method, thereby providing a systematic way for deriving several well-known decentralized algorithms including EXTRA and SSDA. When coupled with accelerated gradient descent, our framework yields a novel primal algorithm whose convergence rate is optimal and matched by recently derived lower bounds. We provide experimental results that demonstrate the effectiveness of the proposed algorithm on highly ill-conditioned problems.
more » « less
Full Text Available
Breaking Reversibility Accelerates Langevin Dynamics for Non-Convex Optimization

Xuefeng Gao, Mert Gurbuzbalaban (January 2020, Proceedings of Machine Learning Research)
null (Ed.)
Langevin dynamics (LD) has been proven to be a powerful technique for optimizing a non-convex objective as an efficient algorithm to find local minima while eventually visiting a global minimum on longer time-scales. LD is based on the first-order Langevin diffusion which is reversible in time. We study two variants that are based on non-reversible Langevin diffusions: the underdamped Langevin dynamics (ULD) and the Langevin dynamics with a non-symmetric drift (NLD). Adopting the techniques of Tzen et al. (2018) for LD to non-reversible diffusions, we show that for a given local minimum that is within an arbitrary distance from the initialization, with high probability, either the ULD trajectory ends up somewhere outside a small neighborhood of this local minimum within a recurrence time which depends on the smallest eigenvalue of the Hessian at the local minimum or they enter this neighborhood by the recurrence time and stay there for a potentially exponentially long escape time. The ULD algorithm improves upon the recurrence time obtained for LD in Tzen et al. (2018) with respect to the dependency on the smallest eigenvalue of the Hessian at the local minimum. Similar results and improvements are obtained for the NLD algorithm. We also show that non-reversible variants can exit the basin of attraction of a local minimum faster in discrete time when the objective has two local minima separated by a saddle point and quantify the amount of improvement. Our analysis suggests that non-reversible Langevin algorithms are more efficient to locate a local minimum as well as exploring the state space.
more » « less
Full Text Available
Fractional Underdamped Langevin Dynamics: Retargeting SGD with Momentum under Heavy-Tailed Gradient Noise

Umut Simsekli, Lingjiong Zhu (January 2020, Proceedings of Machine Learning Research)
null (Ed.)
Stochastic gradient descent with momentum (SGDm) is one of the most popular optimization algorithms in deep learning. While there is a rich theory of SGDm for convex problems, the theory is considerably less developed in the context of deep learning where the problem is non-convex and the gradient noise might exhibit a heavy-tailed behavior, as empirically observed in recent studies. In this study, we consider a \emph{continuous-time} variant of SGDm, known as the underdamped Langevin dynamics (ULD), and investigate its asymptotic properties under heavy-tailed perturbations. Supported by recent studies from statistical physics, we argue both theoretically and empirically that the heavy-tails of such perturbations can result in a bias even when the step-size is small, in the sense that \emph{the optima of stationary distribution} of the dynamics might not match \emph{the optima of the cost function to be optimized}. As a remedy, we develop a novel framework, which we coin as \emph{fractional} ULD (FULD), and prove that FULD targets the so-called Gibbs distribution, whose optima exactly match the optima of the original cost. We observe that the Euler discretization of FULD has noteworthy algorithmic similarities with \emph{natural gradient} methods and \emph{gradient clipping}, bringing a new perspective on understanding their role in deep learning. We support our theory with experiments conducted on a synthetic model and neural networks.
more » « less
Full Text Available
Robust Accelerated Gradient Methods for Smooth Strongly Convex Functions

https://doi.org/10.1137/19M1244925

Aybat, Necdet Serhat; Fallah, Alireza; Gürbüzbalaban, Mert; Ozdaglar, Asuman (January 2020, SIAM Journal on Optimization)

Full Text Available
DAve-QN: A Distributed Averaged Quasi-Newton Method with Local Superlinear Convergence Rate

Saeed Soori, Konstantin Mishchenko (January 2020, Proceedings of Machine Learning Research)
null (Ed.)
In this paper, we consider distributed algorithms for solving the empirical risk minimization problem under the master/worker communication model. We develop a distributed asynchronous quasi-Newton algorithm that can achieve superlinear convergence. To our knowledge, this is the first distributed asynchronous algorithm with superlinear convergence guarantees. Our algorithm is communication-efficient in the sense that at every iteration the master node and workers communicate vectors of size 𝑂(𝑝), where 𝑝 is the dimension of the decision variable. The proposed method is based on a distributed asynchronous averaging scheme of decision vectors and gradients in a way to effectively capture the local Hessian information of the objective function. Our convergence theory supports asynchronous computations subject to both bounded delays and unbounded delays with a bounded time-average. Unlike in the majority of asynchronous optimization literature, we do not require choosing smaller stepsize when delays are huge. We provide numerical experiments that match our theoretical results and showcase significant improvement comparing to state-of-the-art distributed algorithms.
more » « less
Full Text Available

« Prev Next »

Search for: All records