The limited-memory BFGS (Broyden-Fletcher-Goldfarb-Shanno) method is widely used for large-scale unconstrained optimization, but its behavior on nonsmooth problems has received little attention. L-BFGS (limited memory BFGS) can be used with or without ‘scaling’; the use of scaling is normally recommended. A simple special case, when just one BFGS update is stored and used at every iteration, is sometimes also known as memoryless BFGS. We analyze memoryless BFGS with scaling, using any Armijo–Wolfe line search, on the function $f(x) = a|x^{(1)}| + \sum _{i=2}^{n} x^{(i)}$, initiated at any point $x_0$ with $x_0^{(1)}\not = 0$. We show that if $a\ge 2\sqrt{n-1}$, the absolute value of the normalized search direction generated by this method converges to a constant vector, and if, in addition, $a$ is larger than a quantity that depends on the Armijo parameter, then the iterates converge to a nonoptimal point $\bar x$ with $\bar x^{(1)}=0$, although $f$ is unbounded below. As we showed in previous work, the gradient method with any Armijo–Wolfe line search also fails on the same function if $a\geq \sqrt{n-1}$ and $a$ is larger than another quantity depending on the Armijo parameter, but scaled memoryless BFGS fails under a weaker condition relating $a$ to the Armijo parameter than that implying failure of the gradient method. Furthermore, in sharp contrast to the gradient method, if a specific standard Armijo–Wolfe bracketing line search is used, scaled memoryless BFGS fails when $a\ge 2 \sqrt{n-1}$regardless of the Armijo parameter. Finally, numerical experiments indicate that the results may extend to scaled L-BFGS with any fixed number of updates $m$, and to more general piecewise linear functions.
This content will become publicly available on May 1, 2025
Statistical and Computational Complexities of BFGS Quasi-Newton Method for Generalized Linear Models
The gradient descent (GD) method has been used widely to solve parameter estimation in generalized linear models (GLMs), a generalization of linear models when the link function can be non-linear. In GLMs with a polynomial link function, it has been shown that in the high signal-to-noise ratio (SNR) regime, due to the problem's strong convexity and smoothness, GD converges linearly and reaches the final desired accuracy in a logarithmic number of iterations. In contrast, in the low SNR setting, where the problem becomes locally convex, GD converges at a slower rate and requires a polynomial number of iterations to reach the desired accuracy. Even though Newton's method can be used to resolve the flat curvature of the loss functions in the low SNR case, its computational cost is prohibitive in high-dimensional settings as it is $\mathcal{O}(d^3)$, where $d$ the is the problem dimension. To address the shortcomings of GD and Newton's method, we propose the use of the BFGS quasi-Newton method to solve parameter estimation of the GLMs, which has a per iteration cost of $\mathcal{O}(d^2)$. When the SNR is low, for GLMs with a polynomial link function of degree $p$, we demonstrate that the iterates of BFGS converge linearly to the optimal solution of the population least-square loss function, and the contraction coefficient of the BFGS algorithm is comparable to that of Newton's method. Moreover, the contraction factor of the linear rate is independent of problem parameters and only depends on the degree of the link function $p$. Also, for the empirical loss with $n$ samples, we prove that in the low SNR setting of GLMs with a polynomial link function of degree $p$, the iterates of BFGS reach a final statistical radius of $\mathcal{O}((d/n)^{\frac{1}{2p+2}})$ after at most $\log(n/d)$ iterations. This complexity is significantly less than the number required for GD, which scales polynomially with $(n/d)$.
more »
« less
- Award ID(s):
- 2007668
- PAR ID:
- 10526934
- Publisher / Repository:
- Transactions on Machine Learning Research
- Date Published:
- Journal Name:
- Transactions on machine learning research
- ISSN:
- 2835-8856
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract -
We develop discrete $W^2_p$-norm error estimates for the Oliker--Prussner method applied to the Monge--Ampère equation. This is obtained by extending discrete Alexandroff estimates and showing that the contact set of a nodal function contains information on its second-order difference. In addition, we show that the size of the complement of the contact set is controlled by the consistency of the method. Combining both observations, we show that the error estimate $\| u - u_h \|_{W^2_{f,p}} (\mathcal{N}^I_h)$ converges in order $O(h^{1/p})$ if $p > d$ and converges in order $O(h^{1/d} \ln (\frac 1 h)^{1/d})$ if $p \leq d$, where $\|\cdot\|_{W^2_{f,p}(\mathcal{N}^I_h)}$ is a weighted $W^2_p$-type norm, and the constant $C>0$ depends on $\|{u}\|_{C^{3,1}(\bar\Omega)}$, the dimension $d$, and the constant $p$. Numerical examples are given in two space dimensions and confirm that the estimate is sharp in several cases.more » « less
-
In this paper, we present an accelerated quasi-Newton proximal extragradient method for solving unconstrained smooth convex optimization problems. With access only to the gradients of the objective function, we prove that our method can achieve a convergence rate of ${\bigO}\bigl(\min\{\frac{1}{k^2}, \frac{\sqrt{d\log k}}{k^{2.5}}\}\bigr)$, where $d$ is the problem dimension and $k$ is the number of iterations. In particular, in the regime where $k = \bigO(d)$, our method matches the \emph{optimal rate} of $\mathcal{O}(\frac{1}{k^2})$ by Nesterov's accelerated gradient (NAG). Moreover, in the the regime where $k = \Omega(d \log d)$, it outperforms NAG and converges at a \emph{faster rate} of $\mathcal{O}\bigl(\frac{\sqrt{d\log k}}{k^{2.5}}\bigr)$. To the best of our knowledge, this result is the first to demonstrate a provable gain for a quasi-Newton-type method over NAG in the convex setting. To achieve such results, we build our method on a recent variant of the Monteiro-Svaiter acceleration framework and adopt an online learning perspective to update the Hessian approximation matrices, in which we relate the convergence rate of our method to the dynamic regret of a specific online convex optimization problem in the space of matrices.more » « less
-
Recent research has observed that in machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) [Cohen, et al., 2021], where the stepsizes are set to be large, resulting in non-monotonic losses induced by the GD iterates. This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime. Despite the presence of local oscillations, we prove that the logistic loss can be minimized by GD with \emph{any} constant stepsize over a long time scale. Furthermore, we prove that with \emph{any} constant stepsize, the GD iterates tend to infinity when projected to a max-margin direction (the hard-margin SVM direction) and converge to a fixed vector that minimizes a strongly convex potential when projected to the orthogonal complement of the max-margin direction. In contrast, we also show that in the EoS regime, GD iterates may diverge catastrophically under the exponential loss, highlighting the superiority of the logistic loss. These theoretical findings are in line with numerical simulations and complement existing theories on the convergence and implicit bias of GD for logistic regressiomore » « less
-
We consider the problem of estimating a $p$ -dimensional vector $\beta$ from $n$ observations $Y=X\beta+W$ , where $\beta_{j}\mathop{\sim}^{\mathrm{i.i.d}.}\pi$ for a real-valued distribution $\pi$ with zero mean and unit variance’ $X_{ij}\mathop{\sim}^{\mathrm{i.i.d}.}\mathcal{N}(0,1)$ , and $W_{i}\mathop{\sim}^{\mathrm{i.i.d}.}\mathcal{N}(0,\ \sigma^{2})$ . In the asymptotic regime where $n/p\rightarrow\delta$ and $p/\sigma^{2}\rightarrow$ snr for two fixed constants $\delta,\ \mathsf{snr}\in(0,\ \infty)$ as $p\rightarrow\infty$ , the limiting (normalized) minimum mean-squared error (MMSE) has been characterized by a single-letter (additive Gaussian scalar) channel. In this paper, we show that if the MMSE function of the single-letter channel converges to a step function, then the limiting MMSE of estimating $\beta$ converges to a step function which jumps from 1 to 0 at a critical threshold. Moreover, we establish that the limiting mean-squared error of the (MSE-optimal) approximate message passing algorithm also converges to a step function with a larger threshold, providing evidence for the presence of a computational-statistical gap between the two thresholds.more » « less