In this paper, we study the finite-sum convex optimization problem focusing on the general convex case. Recently, the study of variance reduced (VR) methods and their accelerated variants has made exciting progress. However, the step size used in the existing VR algorithms typically depends on the smoothness parameter, which is often unknown and requires tuning in practice. To address this problem, we propose two novel adaptive VR algorithms: Adaptive Variance Reduced Accelerated Extra-Gradient (AdaVRAE) and Adaptive Variance Reduced Accelerated Gradient (AdaVRAG). Our algorithms do not require knowledge of the smoothness parameter. AdaVRAE uses $$\mathcal{O}\left(n\log\log n+\sqrt{\frac{n\beta}{\epsilon}}\right)$$ and AdaVRAG uses $$\mathcal{O}\left(n\log\log n+\sqrt{\frac{n\beta\log\beta}{\epsilon}}\right)$$ gradient evaluations to attain an $$\mathcal{O}(\epsilon)$$-suboptimal solution, where $$n$$ is the number of functions in the finite sum and $$\beta$$ is the smoothness parameter. This result matches the best-known convergence rate of non-adaptive VR methods and it improves upon the convergence of the state of the art adaptive VR method, AdaSVRG. We demonstrate the superior performance of our algorithms compared with previous methods in experiments on real-world datasets. 
                        more » 
                        « less   
                    
                            
                            Accelerated Quasi-Newton Proximal Extragradient: Faster Rate for Smooth Convex Optimization
                        
                    
    
            In this paper, we present an accelerated quasi-Newton proximal extragradient method for solving unconstrained smooth convex optimization problems. With access only to the gradients of the objective function, we prove that our method can achieve a convergence rate of $${\bigO}\bigl(\min\{\frac{1}{k^2}, \frac{\sqrt{d\log k}}{k^{2.5}}\}\bigr)$$, where $$d$$ is the problem dimension and $$k$$ is the number of iterations. In particular, in the regime where $$k = \bigO(d)$$, our method matches the \emph{optimal rate} of $$\mathcal{O}(\frac{1}{k^2})$$ by Nesterov's accelerated gradient (NAG). Moreover, in the the regime where $$k = \Omega(d \log d)$$, it outperforms NAG and converges at a \emph{faster rate} of $$\mathcal{O}\bigl(\frac{\sqrt{d\log k}}{k^{2.5}}\bigr)$$. To the best of our knowledge, this result is the first to demonstrate a provable gain for a quasi-Newton-type method over NAG in the convex setting. To achieve such results, we build our method on a recent variant of the Monteiro-Svaiter acceleration framework and adopt an online learning perspective to update the Hessian approximation matrices, in which we relate the convergence rate of our method to the dynamic regret of a specific online convex optimization problem in the space of matrices. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2007668
- PAR ID:
- 10526930
- Publisher / Repository:
- NeurIPS proceedings
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Krylov Cubic Regularized Newton: A Subspace Second-Order Method with Dimension-Free Convergence RateSecond-order optimization methods, such as cubic regularized Newton methods, are known for their rapid convergence rates; nevertheless, they become impractical in high-dimensional problems due to their substantial memory requirements and computational costs. One promising approach is to execute second order updates within a lower-dimensional subspace, giving rise to \textit{subspace second-order} methods. However, the majority of existing subspace second-order methods randomly select subspaces, consequently resulting in slower convergence rates depending on the problem's dimension $$d$$. In this paper, we introduce a novel subspace cubic regularized Newton method that achieves a dimension-independent global convergence rate of $$\bigO\left(\frac{1}{mk}+\frac{1}{k^2}\right)$$ for solving convex optimization problems. Here, $$m$$ represents the subspace dimension, which can be significantly smaller than $$d$$. Instead of adopting a random subspace, our primary innovation involves performing the cubic regularized Newton update within the \emph{Krylov subspace} associated with the Hessian and the gradient of the objective function. This result marks the first instance of a dimension-independent convergence rate for a subspace second-order method. Furthermore, when specific spectral conditions of the Hessian are met, our method recovers the convergence rate of a full-dimensional cubic regularized Newton method. Numerical experiments show our method converges faster than existing random subspace methods, especially for high-dimensional problems.more » « less
- 
            Quasi-Newton algorithms are among the most popular iterative methods for solving unconstrained minimization problems, largely due to their favorable superlinear convergence property. However, existing results for these algorithms are limited as they provide either (i) a global convergence guarantee with an asymptotic superlinear convergence rate, or (ii) a local non-asymptotic superlinear rate for the case that the initial point and the initial Hessian approximation are chosen properly. In particular, no current analysis for quasi-Newton methods guarantees global convergence with an explicit superlinear convergence rate. In this paper, we close this gap and present the first globally convergent quasi-Newton method with an explicit non asymptotic superlinear convergence rate. Unlike classical quasi-Newton methods, we build our algorithm upon the hybrid proximal extragradient method and propose a novel online learning framework for updating the Hessian approximation matrices. Specifically, guided by the convergence analysis, we formulate the Hessian approximation update as an online convex optimization problem in the space of matrices, and we relate the bounded regret of the online problem to the superlinear convergence of our method.more » « less
- 
            null (Ed.)Abstract We show how to construct a $$(1+\varepsilon )$$ ( 1 + ε ) -spanner over a set $${P}$$ P of n points in $${\mathbb {R}}^d$$ R d that is resilient to a catastrophic failure of nodes. Specifically, for prescribed parameters $${\vartheta },\varepsilon \in (0,1)$$ ϑ , ε ∈ ( 0 , 1 ) , the computed spanner $${G}$$ G has $$\begin{aligned} {{\mathcal {O}}}\bigl (\varepsilon ^{-O(d)} {\vartheta }^{-6} n(\log \log n)^6 \log n \bigr ) \end{aligned}$$ O ( ε - O ( d ) ϑ - 6 n ( log log n ) 6 log n ) edges. Furthermore, for any k , and any deleted set $${{B}}\subseteq {P}$$ B ⊆ P of k points, the residual graph $${G}\setminus {{B}}$$ G \ B is a $$(1+\varepsilon )$$ ( 1 + ε ) -spanner for all the points of $${P}$$ P except for $$(1+{\vartheta })k$$ ( 1 + ϑ ) k of them. No previous constructions, beyond the trivial clique with $${{\mathcal {O}}}(n^2)$$ O ( n 2 ) edges, were known with this resilience property (i.e., only a tiny additional fraction of vertices, $$\vartheta |B|$$ ϑ | B | , lose their distance preserving connectivity). Our construction works by first solving the exact problem in one dimension, and then showing a surprisingly simple and elegant construction in higher dimensions, that uses the one-dimensional construction in a black-box fashion.more » « less
- 
            Statistical and Computational Complexities of BFGS Quasi-Newton Method for Generalized Linear ModelsThe gradient descent (GD) method has been used widely to solve parameter estimation in generalized linear models (GLMs), a generalization of linear models when the link function can be non-linear. In GLMs with a polynomial link function, it has been shown that in the high signal-to-noise ratio (SNR) regime, due to the problem's strong convexity and smoothness, GD converges linearly and reaches the final desired accuracy in a logarithmic number of iterations. In contrast, in the low SNR setting, where the problem becomes locally convex, GD converges at a slower rate and requires a polynomial number of iterations to reach the desired accuracy. Even though Newton's method can be used to resolve the flat curvature of the loss functions in the low SNR case, its computational cost is prohibitive in high-dimensional settings as it is $$\mathcal{O}(d^3)$$, where $$d$$ the is the problem dimension. To address the shortcomings of GD and Newton's method, we propose the use of the BFGS quasi-Newton method to solve parameter estimation of the GLMs, which has a per iteration cost of $$\mathcal{O}(d^2)$$. When the SNR is low, for GLMs with a polynomial link function of degree $$p$$, we demonstrate that the iterates of BFGS converge linearly to the optimal solution of the population least-square loss function, and the contraction coefficient of the BFGS algorithm is comparable to that of Newton's method. Moreover, the contraction factor of the linear rate is independent of problem parameters and only depends on the degree of the link function $$p$$. Also, for the empirical loss with $$n$$ samples, we prove that in the low SNR setting of GLMs with a polynomial link function of degree $$p$$, the iterates of BFGS reach a final statistical radius of $$\mathcal{O}((d/n)^{\frac{1}{2p+2}})$$ after at most $$\log(n/d)$$ iterations. This complexity is significantly less than the number required for GD, which scales polynomially with $(n/d)$.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    