In this paper, we study and prove the nonasymptotic superlinear convergence rate of the Broyden class of quasiNewton algorithms which includes the Davidon–Fletcher–Powell (DFP) method and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method. The asymptotic superlinear convergence rate of these quasiNewton methods has been extensively studied in the literature, but their explicit finite–time local convergence rate is not fully investigated. In this paper, we provide a finite–time (nonasymptotic) convergence analysis for Broyden quasiNewton algorithms under the assumptions that the objective function is strongly convex, its gradient is Lipschitz continuous, and its Hessian is Lipschitz continuous at the optimal solution. We show that in a local neighborhood of the optimal solution, the iterates generated by both DFP and BFGS converge to the optimal solution at a superlinear rate of
Sharpened QuasiNewton Methods: Faster Superlinear Rate and Larger Local Convergence Neighborhood
Nonasymptotic analysis of quasiNewton methods have gained traction recently. In particular, several works have established a nonasymptotic superlinear rate of O((1/sqrt{t})^t) for the (classic) BFGS method by exploiting the fact that its error of Newton direction approximation approaches zero. Moreover, a greedy variant of BFGS was recently proposed which accelerates its convergence by directly approximating the Hessian, instead of the Newton direction, and achieves a fast local quadratic convergence rate. Alas, the local quadratic convergence of GreedyBFGS requires way more updates compared to the number of iterations that BFGS requires for a local superlinear rate. This is due to the fact that in GreedyBFGS the Hessian is directly approximated and the Newton direction approximation may not be as accurate as the one for BFGS. In this paper, we close this gap and present a novel BFGS method that has the best of both worlds in that it leverages the approximation ideas of both BFGS and GreedyBFGS to properly approximate the Newton direction and the Hessian matrix simultaneously. Our theoretical results show that our method outperforms both BFGS and GreedyBFGS in terms of convergence rate, while it reaches its quadratic convergence rate with fewer steps compared to GreedyBFGS. Numerical experiments on various datasets also confirm our theoretical findings.
more »
« less
 Award ID(s):
 2007668
 NSFPAR ID:
 10426579
 Date Published:
 Journal Name:
 International Conference on Machine Learning
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Abstract , where$$(1/k)^{k/2}$$ ${(1/k)}^{k/2}$k is the number of iterations. We also prove a similar local superlinear convergence result holds for the case that the objective function is selfconcordant. Numerical experiments on several datasets confirm our explicit convergence rate bounds. Our theoretical guarantee is one of the first results that provide a nonasymptotic superlinear convergence rate for quasiNewton methods. 
QuasiNewton methods still face significant challenges in training largescale neural networks due to additional compute costs in the Hessian related computations and instability issues in stochastic training. A wellknown method, LBFGS that efficiently approximates the Hessian using history parameter and gradient changes, suffers convergence instability in stochastic training. So far, attempts that adapt LBFGS to largescale stochastic training incur considerable extra overhead, which offsets its convergence benefits in wallclock time. In this paper, we propose mLBFGS, a lightweight momentumbased LBFGS algorithm that paves the way for quasiNewton (QN) methods in largescale distributed deep neural network (DNN) optimization. mLBFGS introduces a nearly costfree momentum scheme into LBFGS update and greatly reduces stochastic noise in the Hessian, therefore stabilizing convergence during stochastic optimization. For model training at a large scale, mLBFGS approximates a blockwise Hessian, thus enabling distributing compute and memory costs across all computing nodes. We provide a supporting convergence analysis for mLBFGS in stochastic settings. To investigate mLBFGS’s potential in largescale DNN training, we train benchmark neural models using mLBFGS and compare performance with baselines (SGD, Adam, and other quasiNewton methods). Results show that mLBFGS achieves both noticeable iterationwise and wallclock speedup.more » « less

Beattie, C.A. ; Benner, P. ; Embree, M. ; Gugercin, S. ; Lefteriu, S. (Ed.)This paper introduces reduced order model (ROM) based Hessian approximations for use in inexact Newton methods for the solution of optimization problems implicitly constrained by a largescale system, typically a discretization of a partial differential equation (PDE). The direct application of an inexact Newton method to this problem requires the solution of many PDEs per optimization iteration. To reduce the computational complexity, a ROM Hessian approximation is proposed. Since only the Hessian is approximated, but the original objective function and its gradient is used, the resulting inexact Newton method maintains the firstorder global convergence property, under suitable assumptions. Thus even computationally inexpensive lower fidelity ROMs can be used, which is different from ROM approaches that replace the original optimization problem by a sequence of ROM optimization problem and typically need to accurately approximate function and gradient information of the original problem. In the proposed approach, the quality of the ROM Hessian approximation determines the rate of convergence, but not whether the method converges. The projection based ROM is constructed from state and adjoint snapshots, and is relatively inexpensive to compute. Numerical examples on semilinear parabolic optimal control problems demonstrate that the proposed approach can lead to substantial savings in terms of overall PDE solves required.more » « less

null (Ed.)In this paper, we consider distributed algorithms for solving the empirical risk minimization problem under the master/worker communication model. We develop a distributed asynchronous quasiNewton algorithm that can achieve superlinear convergence. To our knowledge, this is the first distributed asynchronous algorithm with superlinear convergence guarantees. Our algorithm is communicationefficient in the sense that at every iteration the master node and workers communicate vectors of size 𝑂(𝑝), where 𝑝 is the dimension of the decision variable. The proposed method is based on a distributed asynchronous averaging scheme of decision vectors and gradients in a way to effectively capture the local Hessian information of the objective function. Our convergence theory supports asynchronous computations subject to both bounded delays and unbounded delays with a bounded timeaverage. Unlike in the majority of asynchronous optimization literature, we do not require choosing smaller stepsize when delays are huge. We provide numerical experiments that match our theoretical results and showcase significant improvement comparing to stateoftheart distributed algorithms.more » « less

In this paper, we study the application of quasiNewton methods for solving empirical risk minimization (ERM) problems defined over a large dataset. Traditional deterministic and stochastic quasiNewton methods can be executed to solve such problems; however, it is known that their global convergence rate may not be better than firstorder methods, and their local superlinear convergence only appears towards the end of the learning process. In this paper, we use an adaptive sample size scheme that exploits the superlinear convergence of quasiNewton methods globally and throughout the entire learning process. The main idea of the proposed adaptive sample size algorithms is to start with a small subset of data points and solve their corresponding ERM problem within its statistical accuracy, and then enlarge the sample size geometrically and use the optimal solution of the problem corresponding to the smaller set as an initial point for solving the subsequent ERM problem with more samples. We show that if the initial sample size is sufficiently large and we use quasiNewton methods to solve each subproblem, the subproblems can be solved superlinearly fast (after at most three iterations), as we guarantee that the iterates always stay within a neighborhood that quasiNewton methods converge superlinearly. Numerical experiments on various datasets confirm our theoretical results and demonstrate the computational advantages of our method.more » « less