skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Statistical and Computational Complexities of BFGS Quasi-Newton Method for Generalized Linear Models
The gradient descent (GD) method has been used widely to solve parameter estimation in generalized linear models (GLMs), a generalization of linear models when the link function can be non-linear. In GLMs with a polynomial link function, it has been shown that in the high signal-to-noise ratio (SNR) regime, due to the problem's strong convexity and smoothness, GD converges linearly and reaches the final desired accuracy in a logarithmic number of iterations. In contrast, in the low SNR setting, where the problem becomes locally convex, GD converges at a slower rate and requires a polynomial number of iterations to reach the desired accuracy. Even though Newton's method can be used to resolve the flat curvature of the loss functions in the low SNR case, its computational cost is prohibitive in high-dimensional settings as it is $$\mathcal{O}(d^3)$$, where $$d$$ the is the problem dimension. To address the shortcomings of GD and Newton's method, we propose the use of the BFGS quasi-Newton method to solve parameter estimation of the GLMs, which has a per iteration cost of $$\mathcal{O}(d^2)$$. When the SNR is low, for GLMs with a polynomial link function of degree $$p$$, we demonstrate that the iterates of BFGS converge linearly to the optimal solution of the population least-square loss function, and the contraction coefficient of the BFGS algorithm is comparable to that of Newton's method. Moreover, the contraction factor of the linear rate is independent of problem parameters and only depends on the degree of the link function $$p$$. Also, for the empirical loss with $$n$$ samples, we prove that in the low SNR setting of GLMs with a polynomial link function of degree $$p$$, the iterates of BFGS reach a final statistical radius of $$\mathcal{O}((d/n)^{\frac{1}{2p+2}})$$ after at most $$\log(n/d)$$ iterations. This complexity is significantly less than the number required for GD, which scales polynomially with $(n/d)$.  more » « less
Award ID(s):
2007668
PAR ID:
10526934
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Transactions on Machine Learning Research
Date Published:
Journal Name:
Transactions on machine learning research
ISSN:
2835-8856
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We develop discrete $$W^2_p$$-norm error estimates for the Oliker--Prussner method applied to the Monge--Ampère equation. This is obtained by extending discrete Alexandroff estimates and showing that the contact set of a nodal function contains information on its second-order difference. In addition, we show that the size of the complement of the contact set is controlled by the consistency of the method. Combining both observations, we show that the error estimate $$\| u - u_h \|_{W^2_{f,p}} (\mathcal{N}^I_h)$$ converges in order $$O(h^{1/p})$$ if $p > d$ and converges in order $$O(h^{1/d} \ln (\frac 1 h)^{1/d})$$ if $$p \leq d$$, where $$\|\cdot\|_{W^2_{f,p}(\mathcal{N}^I_h)}$$ is a weighted $$W^2_p$$-type norm, and the constant $C>0$ depends on $$\|{u}\|_{C^{3,1}(\bar\Omega)}$$, the dimension $$d$$, and the constant $$p$$. Numerical examples are given in two space dimensions and confirm that the estimate is sharp in several cases. 
    more » « less
  2. In this paper, we present an accelerated quasi-Newton proximal extragradient method for solving unconstrained smooth convex optimization problems. With access only to the gradients of the objective function, we prove that our method can achieve a convergence rate of $${\bigO}\bigl(\min\{\frac{1}{k^2}, \frac{\sqrt{d\log k}}{k^{2.5}}\}\bigr)$$, where $$d$$ is the problem dimension and $$k$$ is the number of iterations. In particular, in the regime where $$k = \bigO(d)$$, our method matches the \emph{optimal rate} of $$\mathcal{O}(\frac{1}{k^2})$$ by Nesterov's accelerated gradient (NAG). Moreover, in the the regime where $$k = \Omega(d \log d)$$, it outperforms NAG and converges at a \emph{faster rate} of $$\mathcal{O}\bigl(\frac{\sqrt{d\log k}}{k^{2.5}}\bigr)$$. To the best of our knowledge, this result is the first to demonstrate a provable gain for a quasi-Newton-type method over NAG in the convex setting. To achieve such results, we build our method on a recent variant of the Monteiro-Svaiter acceleration framework and adopt an online learning perspective to update the Hessian approximation matrices, in which we relate the convergence rate of our method to the dynamic regret of a specific online convex optimization problem in the space of matrices. 
    more » « less
  3. Posterior sampling with the spike-and-slab prior [MB88], a popular multimodal distribution used to model uncertainty in variable selection, is considered the theoretical gold standard method for Bayesian sparse linear regression [CPS09, Roc18]. However, designing provable algorithms for performing this sampling task is notoriously challenging. Existing posterior samplers for Bayesian sparse variable selection tasks either require strong assumptions about the signal-to-noise ratio (SNR) [YWJ16], only work when the measurement count grows at least linearly in the dimension [MW24], or rely on heuristic approximations to the posterior. We give the first provable algorithms for spike-and-slab posterior sampling that apply for any SNR, and use a measurement count sublinear in the problem dimension. Concretely, assume we are given a measurement matrix X∈ℝn×d and noisy observations y=Xθ⋆+ξ of a signal θ⋆ drawn from a spike-and-slab prior π with a Gaussian diffuse density and expected sparsity k, where ξ∼(𝟘n,σ2In). We give a polynomial-time high-accuracy sampler for the posterior π(⋅∣X,y), for any SNR σ−1 > 0, as long as n≥k3⋅polylog(d) and X is drawn from a matrix ensemble satisfying the restricted isometry property. We further give a sampler that runs in near-linear time ≈nd in the same setting, as long as n≥k5⋅polylog(d). To demonstrate the flexibility of our framework, we extend our result to spike-and-slab posterior sampling with Laplace diffuse densities, achieving similar guarantees when σ=O(1k) is bounded. 
    more » « less
  4. Recent research has observed that in machine learning optimization, gradient descent (GD) often operates at the edge of stability (EoS) [Cohen, et al., 2021], where the stepsizes are set to be large, resulting in non-monotonic losses induced by the GD iterates. This paper studies the convergence and implicit bias of constant-stepsize GD for logistic regression on linearly separable data in the EoS regime. Despite the presence of local oscillations, we prove that the logistic loss can be minimized by GD with \emph{any} constant stepsize over a long time scale. Furthermore, we prove that with \emph{any} constant stepsize, the GD iterates tend to infinity when projected to a max-margin direction (the hard-margin SVM direction) and converge to a fixed vector that minimizes a strongly convex potential when projected to the orthogonal complement of the max-margin direction. In contrast, we also show that in the EoS regime, GD iterates may diverge catastrophically under the exponential loss, highlighting the superiority of the logistic loss. These theoretical findings are in line with numerical simulations and complement existing theories on the convergence and implicit bias of GD for logistic regressio 
    more » « less
  5. We consider the problem of estimating a $$p$$ -dimensional vector $$\beta$$ from $$n$$ observations $$Y=X\beta+W$$ , where $$\beta_{j}\mathop{\sim}^{\mathrm{i.i.d}.}\pi$$ for a real-valued distribution $$\pi$$ with zero mean and unit variance’ $$X_{ij}\mathop{\sim}^{\mathrm{i.i.d}.}\mathcal{N}(0,1)$$ , and $$W_{i}\mathop{\sim}^{\mathrm{i.i.d}.}\mathcal{N}(0,\ \sigma^{2})$$ . In the asymptotic regime where $$n/p\rightarrow\delta$$ and $$p/\sigma^{2}\rightarrow$$ snr for two fixed constants $$\delta,\ \mathsf{snr}\in(0,\ \infty)$$ as $$p\rightarrow\infty$$ , the limiting (normalized) minimum mean-squared error (MMSE) has been characterized by a single-letter (additive Gaussian scalar) channel. In this paper, we show that if the MMSE function of the single-letter channel converges to a step function, then the limiting MMSE of estimating $$\beta$$ converges to a step function which jumps from 1 to 0 at a critical threshold. Moreover, we establish that the limiting mean-squared error of the (MSE-optimal) approximate message passing algorithm also converges to a step function with a larger threshold, providing evidence for the presence of a computational-statistical gap between the two thresholds. 
    more » « less