skip to main content


Title: Exploiting Local Convergence of Quasi-Newton Methods Globally: Adaptive Sample Size Approach
In this paper, we study the application of quasi-Newton methods for solving empirical risk minimization (ERM) problems defined over a large dataset. Traditional deterministic and stochastic quasi-Newton methods can be executed to solve such problems; however, it is known that their global convergence rate may not be better than first-order methods, and their local superlinear convergence only appears towards the end of the learning process. In this paper, we use an adaptive sample size scheme that exploits the superlinear convergence of quasi-Newton methods globally and throughout the entire learning process. The main idea of the proposed adaptive sample size algorithms is to start with a small subset of data points and solve their corresponding ERM problem within its statistical accuracy, and then enlarge the sample size geometrically and use the optimal solution of the problem corresponding to the smaller set as an initial point for solving the subsequent ERM problem with more samples. We show that if the initial sample size is sufficiently large and we use quasi-Newton methods to solve each subproblem, the subproblems can be solved superlinearly fast (after at most three iterations), as we guarantee that the iterates always stay within a neighborhood that quasi-Newton methods converge superlinearly. Numerical experiments on various datasets confirm our theoretical results and demonstrate the computational advantages of our method.  more » « less
Award ID(s):
2007668
NSF-PAR ID:
10337825
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Advances in Neural Information Processing Systems (NeurIPS 2021)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    In this paper, we consider distributed algorithms for solving the empirical risk minimization problem under the master/worker communication model. We develop a distributed asynchronous quasi-Newton algorithm that can achieve superlinear convergence. To our knowledge, this is the first distributed asynchronous algorithm with superlinear convergence guarantees. Our algorithm is communication-efficient in the sense that at every iteration the master node and workers communicate vectors of size đť‘‚(đť‘ť), where đť‘ť is the dimension of the decision variable. The proposed method is based on a distributed asynchronous averaging scheme of decision vectors and gradients in a way to effectively capture the local Hessian information of the objective function. Our convergence theory supports asynchronous computations subject to both bounded delays and unbounded delays with a bounded time-average. Unlike in the majority of asynchronous optimization literature, we do not require choosing smaller stepsize when delays are huge. We provide numerical experiments that match our theoretical results and showcase significant improvement comparing to state-of-the-art distributed algorithms. 
    more » « less
  2. In recent years, there is a growing need to train machine learning models on a huge volume of data. Therefore, designing efficient distributed optimization algorithms for empirical risk minimization (ERM) has become an active and challenging research topic. In this paper, we propose a flexible framework for distributed ERM training through solving the dual problem, which provides a unified description and comparison of existing methods. Our approach requires only approximate solutions of the sub-problems involved in the optimization process, and is versatile to be applied on many large-scale machine learning problems including classification, regression, and structured prediction. We show that our framework enjoys global linear convergence for a broad class of non-strongly-convex problems, and some specific choices of the sub-problems can even achieve much faster convergence than existing approaches by a refined analysis. This improved convergence rate is also reflected in the superior empirical performance of our method. 
    more » « less
  3. Newton's method is usually preferred when solving optimization problems due to its superior convergence properties compared to gradient-based or derivative-free optimization algorithms. However, deriving and computing second-order derivatives needed by Newton's method often is not trivial and, in some cases, not possible. In such cases quasi-Newton algorithms are a great alternative. In this paper, we provide a new derivation of well-known quasi-Newton formulas in an infinite-dimensional Hilbert space setting. It is known that quasi-Newton update formulas are solutions to certain variational problems over the space of symmetric matrices. In this paper, we formulate similar variational problems over the space of bounded symmetric operators in Hilbert spaces. By changing the constraints of the variational problem we obtain updates (for the Hessian and Hessian inverse) not only for the Broyden-Fletcher-Goldfarb-Shanno (BFGS) quasi-Newton method but also for Davidon--Fletcher--Powell (DFP), Symmetric Rank One (SR1), and Powell-Symmetric-Broyden (PSB). In addition, for an inverse problem governed by a partial differential equation (PDE), we derive DFP and BFGS ``structured" secant formulas that explicitly use the derivative of the regularization and only approximates the second derivative of the misfit term. We show numerical results that demonstrate the desired mesh-independence property and superior performance of the resulting quasi-Newton methods. 
    more » « less
  4. Abstract

    In this paper, we study and prove the non-asymptotic superlinear convergence rate of the Broyden class of quasi-Newton algorithms which includes the Davidon–Fletcher–Powell (DFP) method and the Broyden–Fletcher–Goldfarb–Shanno (BFGS) method. The asymptotic superlinear convergence rate of these quasi-Newton methods has been extensively studied in the literature, but their explicit finite–time local convergence rate is not fully investigated. In this paper, we provide a finite–time (non-asymptotic) convergence analysis for Broyden quasi-Newton algorithms under the assumptions that the objective function is strongly convex, its gradient is Lipschitz continuous, and its Hessian is Lipschitz continuous at the optimal solution. We show that in a local neighborhood of the optimal solution, the iterates generated by both DFP and BFGS converge to the optimal solution at a superlinear rate of$$(1/k)^{k/2}$$(1/k)k/2, wherekis the number of iterations. We also prove a similar local superlinear convergence result holds for the case that the objective function is self-concordant. Numerical experiments on several datasets confirm our explicit convergence rate bounds. Our theoretical guarantee is one of the first results that provide a non-asymptotic superlinear convergence rate for quasi-Newton methods.

     
    more » « less
  5. This work concerns the local convergence theory of Newton and quasi-Newton methods for convex-composite optimization: where one minimizes an objective that can be written as the composition of a convex function with one that is continuiously differentiable. We focus on the case in which the convex function is a potentially infinite-valued piecewise linear-quadratic function. Such problems include nonlinear programming, mini-max optimization, and estimation of nonlinear dynamics with non-Gaussian noise as well as many modern approaches to large-scale data analysis and machine learning. Our approach embeds the optimality conditions for convex-composite optimization problems into a generalized equation. We establish conditions for strong metric subregularity and strong metric regularity of the corresponding set-valued mappings. This allows us to extend classical convergence of Newton and quasi-Newton methods to the broader class of nonfinite valued piecewise linear-quadratic convex-composite optimization problems. In particular, we establish local quadratic convergence of the Newton method under conditions that parallel those in nonlinear programming. 
    more » « less