We consider the development of practical stochastic quasi-Newton, and in particular Kronecker-factored block diagonal BFGS and L-BFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient n is often of the order of tens of millions and the Hessian has n^ 2 elements. Consequently, computing and storing a full n times n BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kronecker-factored block diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.
more »
« less
mL-BFGS: A Momentum-based L-BFGS for Distributed Large-Scale Neural Network Optimization
Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training. A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training. So far, attempts that adapt L-BFGS to large-scale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wall-clock time. In this paper, we propose mL-BFGS, a lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) optimization. mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing convergence during stochastic optimization. For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing nodes. We provide a supporting convergence analysis for mL-BFGS in stochastic settings. To investigate mL-BFGS’s potential in large-scale DNN training, we train benchmark neural models using mL-BFGS and compare performance with baselines (SGD, Adam, and other quasi-Newton methods). Results show that mL-BFGS achieves both noticeable iteration-wise and wall-clock speedup.
more »
« less
- PAR ID:
- 10483609
- Publisher / Repository:
- Transactions on Machine Learning Research
- Date Published:
- Journal Name:
- Transactions on Machine Learning Research
- ISSN:
- 2835-8856
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Newton's method is usually preferred when solving optimization problems due to its superior convergence properties compared to gradient-based or derivative-free optimization algorithms. However, deriving and computing second-order derivatives needed by Newton's method often is not trivial and, in some cases, not possible. In such cases quasi-Newton algorithms are a great alternative. In this paper, we provide a new derivation of well-known quasi-Newton formulas in an infinite-dimensional Hilbert space setting. It is known that quasi-Newton update formulas are solutions to certain variational problems over the space of symmetric matrices. In this paper, we formulate similar variational problems over the space of bounded symmetric operators in Hilbert spaces. By changing the constraints of the variational problem we obtain updates (for the Hessian and Hessian inverse) not only for the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method but also for Davidon--Fletcher--Powell (DFP), Symmetric Rank One (SR1), and Powell-Symmetric-Broyden (PSB). In addition, for an inverse problem governed by a partial differential equation (PDE), we derive DFP and BFGS ``structured" secant formulas that explicitly use the derivative of the regularization and only approximates the second derivative of the misfit term. We show numerical results that demonstrate the desired mesh-independence property and superior performance of the resulting quasi-Newton methods.more » « less
-
Abstract In this paper, we explore the non-asymptotic global convergence rates of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method implemented with exact line search. Notably, due to Dixon’s equivalence result, our findings are also applicable to other quasi-Newton methods in the convex Broyden class employing exact line search, such as the Davidon-Fletcher-Powell (DFP) method. Specifically, we focus on problems where the objective function is strongly convex with Lipschitz continuous gradient and Hessian. Our results hold for any initial point and any symmetric positive definite initial Hessian approximation matrix. The analysis unveils a detailed three-phase convergence process, characterized by distinct linear and superlinear rates, contingent on the iteration progress. Additionally, our theoretical findings demonstrate the trade-offs between linear and superlinear convergence rates for BFGS when we modify the initial Hessian approximation matrix, a phenomenon further corroborated by our numerical experiments.more » « less
-
Non-asymptotic analysis of quasi-Newton methods have gained traction recently. In particular, several works have established a non-asymptotic superlinear rate of O((1/sqrt{t})^t) for the (classic) BFGS method by exploiting the fact that its error of Newton direction approximation approaches zero. Moreover, a greedy variant of BFGS was recently proposed which accelerates its convergence by directly approximating the Hessian, instead of the Newton direction, and achieves a fast local quadratic convergence rate. Alas, the local quadratic convergence of Greedy-BFGS requires way more updates compared to the number of iterations that BFGS requires for a local superlinear rate. This is due to the fact that in Greedy-BFGS the Hessian is directly approximated and the Newton direction approximation may not be as accurate as the one for BFGS. In this paper, we close this gap and present a novel BFGS method that has the best of both worlds in that it leverages the approximation ideas of both BFGS and GreedyBFGS to properly approximate the Newton direction and the Hessian matrix simultaneously. Our theoretical results show that our method outperforms both BFGS and Greedy-BFGS in terms of convergence rate, while it reaches its quadratic convergence rate with fewer steps compared to Greedy-BFGS. Numerical experiments on various datasets also confirm our theoretical findings.more » « less
-
In this paper, we present the first explicit and non-asymptotic global convergence rates of the BFGS method when implemented with an inexact line search scheme satisfying the Armijo-Wolfe conditions. We show that BFGS achieves a global linear convergence rate of (1−1κ)t for μ-strongly convex functions with L-Lipschitz gradients, where κ=Lμ represents the condition number. Additionally, if the objective function's Hessian is Lipschitz, BFGS with the Armijo-Wolfe line search achieves a linear convergence rate that depends solely on the line search parameters, independent of the condition number. We also establish a global superlinear convergence rate of ((1t)t). These global bounds are all valid for any starting point x0 and any symmetric positive definite initial Hessian approximation matrix B0, though the choice of B0 impacts the number of iterations needed to achieve these rates. By synthesizing these results, we outline the first global complexity characterization of BFGS with the Armijo-Wolfe line search. Additionally, we clearly define a mechanism for selecting the step size to satisfy the Armijo-Wolfe conditions and characterize its overall complexity.more » « less
An official website of the United States government

