Nonasymptotic analysis of quasiNewton methods have gained traction recently. In particular, several works have established a nonasymptotic superlinear rate of O((1/sqrt{t})^t) for the (classic) BFGS method by exploiting the fact that its error of Newton direction approximation approaches zero. Moreover, a greedy variant of BFGS was recently proposed which accelerates its convergence by directly approximating the Hessian, instead of the Newton direction, and achieves a fast local quadratic convergence rate. Alas, the local quadratic convergence of GreedyBFGS requires way more updates compared to the number of iterations that BFGS requires for a local superlinear rate. This is due to the fact that in GreedyBFGS the Hessian is directly approximated and the Newton direction approximation may not be as accurate as the one for BFGS. In this paper, we close this gap and present a novel BFGS method that has the best of both worlds in that it leverages the approximation ideas of both BFGS and GreedyBFGS to properly approximate the Newton direction and the Hessian matrix simultaneously. Our theoretical results show that our method outperforms both BFGS and GreedyBFGS in terms of convergence rate, while it reaches its quadratic convergence rate with fewer steps compared to GreedyBFGS. Numerical experiments on various datasets also confirm our theoretical findings.
more »
« less
Practical QuasiNewton Methods for Training Deep Neural Networks
We consider the development of practical stochastic quasiNewton, and in particular Kroneckerfactored block diagonal BFGS and LBFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient n is often of the order of tens of millions and the Hessian has n^ 2 elements. Consequently, computing and storing a full n times n BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an LBFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a blockdiagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kroneckerfactored block diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and LBFGS approximations bounded. In tests on autoencoder feedforward network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and stateoftheart firstorder stochastic methods.
more »
« less
 Award ID(s):
 1838061
 NSFPAR ID:
 10330079
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 Volume:
 33
 ISSN:
 10495258
 Page Range / eLocation ID:
 23862396
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


null (Ed.)Gradient descentbased optimization methods underpin the parameter training of neural networks, and hence comprise a significant component in the impressive test results found in a number of applications. Introducing stochasticity is key to their success in practical problems, and there is some understanding of the role of stochastic gradient descent in this context. Momentum modifications of gradient descent such as Polyak’s Heavy Ball method (HB) and Nesterov’s method of accelerated gradients (NAG), are also widely adopted. In this work our focus is on understanding the role of momentum in the training of neural networks, concentrating on the common situation in which the momentum contribution is fixed at each step of the algorithm. To expose the ideas simply we work in the deterministic setting. Our approach is to derive continuous time approximations of the discrete algorithms; these continuous time approximations provide insights into the mechanisms at play within the discrete algorithms. We prove three such approximations. Firstly we show that standard implementations of fixed momentum methods approximate a timerescaled gradient descent flow, asymptotically as the learning rate shrinks to zero; this result does not distinguish momentum methods from pure gradient descent, in the limit of vanishing learning rate. We then proceed to prove two results aimed at understanding the observed practical advantages of fixed momentum methods over gradient descent, when implemented in the nonasymptotic regime with fixed small, but nonzero, learning rate. We achieve this by proving approximations to continuous time limits in which the small but fixed learning rate appears as a parameter; this is known as the method of modified equations in the numerical analysis literature, recently rediscovered as the high resolution ODE approximation in the machine learning context. In our second result we show that the momentum method is approximated by a continuous time gradient flow, with an additional momentumdependent second order timederivative correction, proportional to the learning rate; this may be used to explain the stabilizing effect of momentum algorithms in their transient phase. Furthermore in a third result we show that the momentum methods admit an exponentially attractive invariant manifold on which the dynamics reduces, approximately, to a gradient flow with respect to a modified loss function, equal to the original loss function plus a small perturbation proportional to the learning rate; this small correction provides convexification of the loss function and encodes additional robustness present in momentum methods, beyond the transient phase.more » « less

Many machine learning problems optimize an objective that must be measured with noise. The primary method is a first order stochastic gradient descent using one or more Monte Carlo (MC) samples at each step. There are settings where illconditioning makes second order methods such as limitedmemory BroydenFletcherGoldfarbShanno (LBFGS) more effective. We study the use of randomized quasiMonte Carlo (RQMC) sampling for such problems. When MC sampling has a root mean squared error (RMSE) of O(n−1/2) then RQMC has an RMSE of o(n−1/2) that can be close to O(n−3/2) in favorable settings. We prove that improved sampling accuracy translates directly to improved optimization. In our empirical investigations for variational Bayes, using RQMC with stochastic quasiNewton method greatly speeds up the optimization, and sometimes finds a better parameter value than MC does.more » « less

Modern deep neural networks (DNNs) often require high memory consumption and large computational loads. In order to deploy DNN algorithms efficiently on edge or mobile devices, a series of DNN compression algorithms have been explored, including factorization methods. Factorization methods approximate the weight matrix of a DNN layer with the multiplication of two or multiple lowrank matrices. However, it is hard to measure the ranks of DNN layers during the training process. Previous works mainly induce lowrank through implicit approximations or via costly singular value decomposition (SVD) process on every training step. The former approach usually induces a high accuracy loss while the latter has a low efficiency. In this work, we propose SVD training, the first method to explicitly achieve lowrank DNNs during training without applying SVD on every step. SVD training first decomposes each layer into the form of its fullrank SVD, then performs training directly on the decomposed weights. We add orthogonality regularization to the singular vectors, which ensure the valid form of SVD and avoid gradient vanishing/exploding. Lowrank is encouraged by applying sparsityinducing regularizers on the singular values of each layer. Singular value pruning is applied at the end to explicitly reach a lowrank model. We empirically show that SVD training can significantly reduce the rank of DNN layers and achieve higher reduction on computation load under the same accuracy, comparing to not only previous factorization methods but also stateoftheart filter pruning methods.more » « less

Beattie, C.A. ; Benner, P. ; Embree, M. ; Gugercin, S. ; Lefteriu, S. (Ed.)This paper introduces reduced order model (ROM) based Hessian approximations for use in inexact Newton methods for the solution of optimization problems implicitly constrained by a largescale system, typically a discretization of a partial differential equation (PDE). The direct application of an inexact Newton method to this problem requires the solution of many PDEs per optimization iteration. To reduce the computational complexity, a ROM Hessian approximation is proposed. Since only the Hessian is approximated, but the original objective function and its gradient is used, the resulting inexact Newton method maintains the firstorder global convergence property, under suitable assumptions. Thus even computationally inexpensive lower fidelity ROMs can be used, which is different from ROM approaches that replace the original optimization problem by a sequence of ROM optimization problem and typically need to accurately approximate function and gradient information of the original problem. In the proposed approach, the quality of the ROM Hessian approximation determines the rate of convergence, but not whether the method converges. The projection based ROM is constructed from state and adjoint snapshots, and is relatively inexpensive to compute. Numerical examples on semilinear parabolic optimal control problems demonstrate that the proposed approach can lead to substantial savings in terms of overall PDE solves required.more » « less