This content will become publicly available on July 16, 2024
 NSFPAR ID:
 10483609
 Publisher / Repository:
 Transactions on Machine Learning Research
 Date Published:
 Journal Name:
 Transactions on Machine Learning Research
 ISSN:
 28358856
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

We consider the development of practical stochastic quasiNewton, and in particular Kroneckerfactored block diagonal BFGS and LBFGS methods, for training deep neural networks (DNNs). In DNN training, the number of variables and components of the gradient n is often of the order of tens of millions and the Hessian has n^ 2 elements. Consequently, computing and storing a full n times n BFGS approximation or storing a modest number of (step, change in gradient) vector pairs for use in an LBFGS implementation is out of the question. In our proposed methods, we approximate the Hessian by a blockdiagonal matrix and use the structure of the gradient and Hessian to further approximate these blocks, each of which corresponds to a layer, as the Kronecker product of two much smaller matrices. This is analogous to the approach in KFAC, which computes a Kroneckerfactored block diagonal approximation to the Fisher matrix in a stochastic natural gradient method. Because the indefinite and highly variable nature of the Hessian in a DNN, we also propose a new damping approach to keep the upper as well as the lower bounds of the BFGS and LBFGS approximations bounded. In tests on autoencoder feedforward network models with either nine or thirteen layers applied to three datasets, our methods outperformed or performed comparably to KFAC and stateoftheart firstorder stochastic methods.more » « less

Newton's method is usually preferred when solving optimization problems due to its superior convergence properties compared to gradientbased or derivativefree optimization algorithms. However, deriving and computing secondorder derivatives needed by Newton's method often is not trivial and, in some cases, not possible. In such cases quasiNewton algorithms are a great alternative. In this paper, we provide a new derivation of wellknown quasiNewton formulas in an infinitedimensional Hilbert space setting. It is known that quasiNewton update formulas are solutions to certain variational problems over the space of symmetric matrices. In this paper, we formulate similar variational problems over the space of bounded symmetric operators in Hilbert spaces. By changing the constraints of the variational problem we obtain updates (for the Hessian and Hessian inverse) not only for the BroydenFletcherGoldfarbShanno (BFGS) quasiNewton method but also for DavidonFletcherPowell (DFP), Symmetric Rank One (SR1), and PowellSymmetricBroyden (PSB). In addition, for an inverse problem governed by a partial differential equation (PDE), we derive DFP and BFGS ``structured" secant formulas that explicitly use the derivative of the regularization and only approximates the second derivative of the misfit term. We show numerical results that demonstrate the desired meshindependence property and superior performance of the resulting quasiNewton methods.more » « less

null (Ed.)Stateoftheart seismic imaging techniques treat inversion tasks such as fullwaveform inversion (FWI) and leastsquares reverse time migration (LSRTM) as partial differential equationconstrained optimization problems. Due to the largescale nature, gradientbased optimization algorithms are preferred in practice to update the model iteratively. Higherorder methods converge in fewer iterations but often require higher computational costs, more linesearch steps, and bigger memory storage. A balance among these aspects has to be considered. We have conducted an evaluation using Anderson acceleration (AA), a popular strategy to speed up the convergence of fixedpoint iterations, to accelerate the steepestdescent algorithm, which we innovatively treat as a fixedpoint iteration. Independent of the unknown parameter dimensionality, the computational cost of implementing the method can be reduced to an extremely low dimensional leastsquares problem. The cost can be further reduced by a lowrank update. We determine the theoretical connections and the differences between AA and other wellknown optimization methods such as LBFGS and the restarted generalized minimal residual method and compare their computational cost and memory requirements. Numerical examples of FWI and LSRTM applied to the Marmousi benchmark demonstrate the acceleration effects of AA. Compared with the steepestdescent method, AA can achieve faster convergence and can provide competitive results with some quasiNewton methods, making it an attractive optimization strategy for seismic inversion.more » « less

Nonasymptotic analysis of quasiNewton methods have gained traction recently. In particular, several works have established a nonasymptotic superlinear rate of O((1/sqrt{t})^t) for the (classic) BFGS method by exploiting the fact that its error of Newton direction approximation approaches zero. Moreover, a greedy variant of BFGS was recently proposed which accelerates its convergence by directly approximating the Hessian, instead of the Newton direction, and achieves a fast local quadratic convergence rate. Alas, the local quadratic convergence of GreedyBFGS requires way more updates compared to the number of iterations that BFGS requires for a local superlinear rate. This is due to the fact that in GreedyBFGS the Hessian is directly approximated and the Newton direction approximation may not be as accurate as the one for BFGS. In this paper, we close this gap and present a novel BFGS method that has the best of both worlds in that it leverages the approximation ideas of both BFGS and GreedyBFGS to properly approximate the Newton direction and the Hessian matrix simultaneously. Our theoretical results show that our method outperforms both BFGS and GreedyBFGS in terms of convergence rate, while it reaches its quadratic convergence rate with fewer steps compared to GreedyBFGS. Numerical experiments on various datasets also confirm our theoretical findings.more » « less

Abstract In this paper, we study and prove the nonasymptotic superlinear convergence rate of the Broyden class of quasiNewton algorithms which includes the Davidon–Fletcher–Powell (DFP) method and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method. The asymptotic superlinear convergence rate of these quasiNewton methods has been extensively studied in the literature, but their explicit finite–time local convergence rate is not fully investigated. In this paper, we provide a finite–time (nonasymptotic) convergence analysis for Broyden quasiNewton algorithms under the assumptions that the objective function is strongly convex, its gradient is Lipschitz continuous, and its Hessian is Lipschitz continuous at the optimal solution. We show that in a local neighborhood of the optimal solution, the iterates generated by both DFP and BFGS converge to the optimal solution at a superlinear rate of
, where$$(1/k)^{k/2}$$ ${(1/k)}^{k/2}$k is the number of iterations. We also prove a similar local superlinear convergence result holds for the case that the objective function is selfconcordant. Numerical experiments on several datasets confirm our explicit convergence rate bounds. Our theoretical guarantee is one of the first results that provide a nonasymptotic superlinear convergence rate for quasiNewton methods.