Quasi-Newton methods still face significant challenges in training large-scale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training. A well-known method, L-BFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training. So far, attempts that adapt L-BFGS to large-scale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wall-clock time. In this paper, we propose mL-BFGS, a lightweight momentum-based L-BFGS algorithm that paves the way for quasi-Newton (QN) methods in large-scale distributed deep neural network (DNN) optimization. mL-BFGS introduces a nearly cost-free momentum scheme into L-BFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing convergence during stochastic optimization. For model training at a large scale, mL-BFGS approximates a block-wise Hessian, thus enabling distributing compute and memory costs across all computing nodes. We provide a supporting convergence analysis for mL-BFGS in stochastic settings. To investigate mL-BFGS’s potential in large-scale DNN training, we train benchmark neural models using mL-BFGS and compare performance with baselines (SGD, Adam, and other quasi-Newton methods). Results show that mL-BFGS achieves both noticeable iteration-wise and wall-clock speedup.
more »
« less
Practical Quasi-Newton Methods for Training Deep Neural Networks
We consider the development of practical stochastic quasi-Newton, and in particular Kronecker-factored block diagonal BFGS and L-BFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient n is often of the order of tens of millions and the Hessian has n^ 2 elements. Consequently, computing and storing a full n times n BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an L-BFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a block-diagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kronecker-factored block diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded. In tests on autoencoder feed-forward network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.
more »
« less
- Award ID(s):
- 1838061
- PAR ID:
- 10330079
- Date Published:
- Journal Name:
- Advances in neural information processing systems
- Volume:
- 33
- ISSN:
- 1049-5258
- Page Range / eLocation ID:
- 2386-2396
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Non-asymptotic analysis of quasi-Newton methods have gained traction recently. In particular, several works have established a non-asymptotic superlinear rate of O((1/sqrt{t})^t) for the (classic) BFGS method by exploiting the fact that its error of Newton direction approximation approaches zero. Moreover, a greedy variant of BFGS was recently proposed which accelerates its convergence by directly approximating the Hessian, instead of the Newton direction, and achieves a fast local quadratic convergence rate. Alas, the local quadratic convergence of Greedy-BFGS requires way more updates compared to the number of iterations that BFGS requires for a local superlinear rate. This is due to the fact that in Greedy-BFGS the Hessian is directly approximated and the Newton direction approximation may not be as accurate as the one for BFGS. In this paper, we close this gap and present a novel BFGS method that has the best of both worlds in that it leverages the approximation ideas of both BFGS and GreedyBFGS to properly approximate the Newton direction and the Hessian matrix simultaneously. Our theoretical results show that our method outperforms both BFGS and Greedy-BFGS in terms of convergence rate, while it reaches its quadratic convergence rate with fewer steps compared to Greedy-BFGS. Numerical experiments on various datasets also confirm our theoretical findings.more » « less
-
Abstract In this paper, we explore the non-asymptotic global convergence rates of the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method implemented with exact line search. Notably, due to Dixon’s equivalence result, our findings are also applicable to other quasi-Newton methods in the convex Broyden class employing exact line search, such as the Davidon-Fletcher-Powell (DFP) method. Specifically, we focus on problems where the objective function is strongly convex with Lipschitz continuous gradient and Hessian. Our results hold for any initial point and any symmetric positive definite initial Hessian approximation matrix. The analysis unveils a detailed three-phase convergence process, characterized by distinct linear and superlinear rates, contingent on the iteration progress. Additionally, our theoretical findings demonstrate the trade-offs between linear and superlinear convergence rates for BFGS when we modify the initial Hessian approximation matrix, a phenomenon further corroborated by our numerical experiments.more » « less
-
In this paper, we present the first explicit and non-asymptotic global convergence rates of the BFGS method when implemented with an inexact line search scheme satisfying the Armijo-Wolfe conditions. We show that BFGS achieves a global linear convergence rate of (1−1κ)t for μ-strongly convex functions with L-Lipschitz gradients, where κ=Lμ represents the condition number. Additionally, if the objective function's Hessian is Lipschitz, BFGS with the Armijo-Wolfe line search achieves a linear convergence rate that depends solely on the line search parameters, independent of the condition number. We also establish a global superlinear convergence rate of ((1t)t). These global bounds are all valid for any starting point x0 and any symmetric positive definite initial Hessian approximation matrix B0, though the choice of B0 impacts the number of iterations needed to achieve these rates. By synthesizing these results, we outline the first global complexity characterization of BFGS with the Armijo-Wolfe line search. Additionally, we clearly define a mechanism for selecting the step size to satisfy the Armijo-Wolfe conditions and characterize its overall complexity.more » « less
-
This paper presents an algorithm to optimize the parameters of power systems equivalents to enhance the accuracy of the DC power flow approximation in reduced networks. Based on a zonal division of the network, the algorithm produces a reduced power system equivalent that captures inter-zonal flows with aggregated buses and equivalent transmission lines. The algorithm refines coefficient and bias parameters for the DC power flow model of the reduced network, aiming to minimize discrepancies between inter-zonal flows in DC and AC power flow results. Using optimization methods like Broyden-Fletcher-Goldfarb-Shanno (BFGS), Limited-memory BFGS (L-BFGS), and Truncated Newton Conjugate-Gradient (TNC) in an offline training phase, these parameters boost the accuracy of online DC power flow computations. In contrast to existing network equivalencing methods, the proposed algorithm optimizes accuracy over a specified range of operation as opposed to only considering a single nominal point. Numerical tests demonstrate substantial accuracy improvements over traditional equivalencing and approximation methods.more » « less
An official website of the United States government

