We propose a stochastic variance-reduced cubic regularized Newton method (SVRC) for non-convex optimization. At the core of our algorithm is a novel semi-stochastic gradient along with a semi-stochastic Hessian, which are specifically designed for cubic regularization method. We show that our algorithm is guaranteed to converge to an $$(\epsilon,\sqrt{\epsilon})$$-approximate local minimum within $$\tilde{O}(n^{4/5}/\epsilon^{3/2})$$ second-order oracle calls, which outperforms the state-of-the-art cubic regularization algorithms including subsampled cubic regularization. Our work also sheds light on the application of variance reduction technique to high-order non-convex optimization methods. Thorough experiments on various non-convex optimization problems support our theory.
more »
« less
Escaping Saddle-Point Faster under Interpolation-like Conditions
In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an overparametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an \epsilon-local-minimizer, matches the corresponding deterministic rate of ˜O(1/\epsilon^2). We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an \epsilon-local-minimizer under interpolation-like conditions, is ˜O(1/\epsilon^2.5). While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of ˜O(1/\epsilon^1.5) corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings.
more »
« less
- Award ID(s):
- 1934568
- PAR ID:
- 10282296
- Date Published:
- Journal Name:
- 34th Conference on Neural Information Processing Systems (NeurIPS 2020)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Krylov Cubic Regularized Newton: A Subspace Second-Order Method with Dimension-Free Convergence RateSecond-order optimization methods, such as cubic regularized Newton methods, are known for their rapid convergence rates; nevertheless, they become impractical in high-dimensional problems due to their substantial memory requirements and computational costs. One promising approach is to execute second order updates within a lower-dimensional subspace, giving rise to \textit{subspace second-order} methods. However, the majority of existing subspace second-order methods randomly select subspaces, consequently resulting in slower convergence rates depending on the problem's dimension $$d$$. In this paper, we introduce a novel subspace cubic regularized Newton method that achieves a dimension-independent global convergence rate of $$\bigO\left(\frac{1}{mk}+\frac{1}{k^2}\right)$$ for solving convex optimization problems. Here, $$m$$ represents the subspace dimension, which can be significantly smaller than $$d$$. Instead of adopting a random subspace, our primary innovation involves performing the cubic regularized Newton update within the \emph{Krylov subspace} associated with the Hessian and the gradient of the objective function. This result marks the first instance of a dimension-independent convergence rate for a subspace second-order method. Furthermore, when specific spectral conditions of the Hessian are met, our method recovers the convergence rate of a full-dimensional cubic regularized Newton method. Numerical experiments show our method converges faster than existing random subspace methods, especially for high-dimensional problems.more » « less
-
Stochastic second-order methods accelerate local convergence in strongly convex optimization by using noisy Hessian estimates to precondition gradients. However, they typically achieve superlinear convergence only when Hessian noise diminishes, which increases per-iteration costs. Prior work [arXiv:2204.09266] introduced a Hessian averaging scheme that maintains low per-iteration cost while achieving superlinear convergence, but with slow global convergence, requiring 𝑂 ~ ( 𝜅 2 ) O ~ (κ 2 ) iterations to reach the superlinear rate of 𝑂 ~ ( ( 1 / 𝑡 ) 𝑡 / 2 ) O ~ ((1/t) t/2 ), where 𝜅 κ is the condition number. This paper proposes a stochastic Newton proximal extragradient method that improves these bounds, delivering faster global linear convergence and achieving the same fast superlinear rate in only 𝑂 ~ ( 𝜅 ) O ~ (κ) iterations. The method extends the Hybrid Proximal Extragradient (HPE) framework, yielding improved global and local convergence guarantees for strongly convex functions with access to a noisy Hessian oracle.more » « less
-
Stochastic second-order methods are known to achieve fast local convergence in strongly convex optimization by relying on noisy Hessian estimates to precondition the gradient. Yet, most of these methods achieve superlinear convergence only when the stochastic Hessian noise diminishes, requiring an increase in the per-iteration cost as time progresses. Recent work in \cite{na2022hessian} addressed this issue via a Hessian averaging scheme that achieves a superlinear convergence rate without increasing the per-iteration cost. However, the considered method exhibits a slow global convergence rate, requiring up to ~O(κ^2) iterations to reach the superlinear rate of ~O((1/t)^{t/2}), where κ is the problem's condition number. In this paper, we propose a novel stochastic Newton proximal extragradient method that significantly improves these bounds, achieving a faster global linear rate and reaching the same fast superlinear rate in ~O(κ) iterations. We achieve this by developing a novel extension of the Hybrid Proximal Extragradient (HPE) framework, which simultaneously achieves fast global and local convergence rates for strongly convex functions with access to a noisy Hessian oracle.more » « less
-
We consider the problem of finding stationary points in Bilevel optimization when the lower-level problem is unconstrained and strongly convex. The problem has been extensively studied in recent years; the main technical challenge is to keep track of lower-level solutions $y^*(x)$ in response to the changes in the upper-level variables $$x$$. Subsequently, all existing approaches tie their analyses to a genie algorithm that knows lower-level solutions and, therefore, need not query any points far from them. We consider a dual question to such approaches: suppose we have an oracle, which we call $y^*$-aware, that returns an $$O(\epsilon)$$-estimate of the lower-level solution, in addition to first-order gradient estimators {\it locally unbiased} within the $$\Theta(\epsilon)$$-ball around $y^*(x)$. We study the complexity of finding stationary points with such an $y^*$-aware oracle: we propose a simple first-order method that converges to an $$\epsilon$$ stationary point using $$O(\epsilon^{-6}), O(\epsilon^{-4})$$ access to first-order $y^*$-aware oracles. Our upper bounds also apply to standard unbiased first-order oracles, improving the best-known complexity of first-order methods by $$O(\epsilon)$$ with minimal assumptions. We then provide the matching $$\Omega(\epsilon^{-6})$$, $$\Omega(\epsilon^{-4})$$ lower bounds without and with an additional smoothness assumption on $y^*$-aware oracles, respectively. Our results imply that any approach that simulates an algorithm with an $y^*$-aware oracle must suffer the same lower bounds.more » « less
An official website of the United States government

