We propose a stochastic variancereduced cubic regularized Newton method (SVRC) for nonconvex optimization. At the core of our algorithm is a novel semistochastic gradient along with a semistochastic Hessian, which are specifically designed for cubic regularization method. We show that our algorithm is guaranteed to converge to an $(\epsilon,\sqrt{\epsilon})$approximate local minimum within $\tilde{O}(n^{4/5}/\epsilon^{3/2})$ secondorder oracle calls, which outperforms the stateoftheart cubic regularization algorithms including subsampled cubic regularization. Our work also sheds light on the application of variance reduction technique to highorder nonconvex optimization methods. Thorough experiments on various nonconvex optimization problems support our theory.
more »
« less
Escaping SaddlePoint Faster under Interpolationlike Conditions
In this paper, we show that under overparametrization several standard stochastic
optimization algorithms escape saddlepoints and converge to localminimizers
much faster. One of the fundamental aspects of overparametrized models is
that they are capable of interpolating the training data. We show that, under
interpolationlike assumptions satisfied by the stochastic gradients in an overparametrization setting, the firstorder oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an \epsilonlocalminimizer, matches the corresponding deterministic rate of ˜O(1/\epsilon^2). We next analyze Stochastic CubicRegularized Newton (SCRN) algorithm under interpolationlike conditions, and show that the oracle complexity to reach an \epsilonlocalminimizer under interpolationlike conditions, is ˜O(1/\epsilon^2.5). While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolationlike assumptions, it does not match the rate of ˜O(1/\epsilon^1.5) corresponding to deterministic CubicRegularized Newton method. It seems further Hessianbased interpolationlike assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zerothorder settings.
more »
« less
 Award ID(s):
 1934568
 NSFPAR ID:
 10282296
 Date Published:
 Journal Name:
 34th Conference on Neural Information Processing Systems (NeurIPS 2020)
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Krylov Cubic Regularized Newton: A Subspace SecondOrder Method with DimensionFree Convergence RateSecondorder optimization methods, such as cubic regularized Newton methods, are known for their rapid convergence rates; nevertheless, they become impractical in highdimensional problems due to their substantial memory requirements and computational costs. One promising approach is to execute second order updates within a lowerdimensional subspace, giving rise to \textit{subspace secondorder} methods. However, the majority of existing subspace secondorder methods randomly select subspaces, consequently resulting in slower convergence rates depending on the problem's dimension $d$. In this paper, we introduce a novel subspace cubic regularized Newton method that achieves a dimensionindependent global convergence rate of $\bigO\left(\frac{1}{mk}+\frac{1}{k^2}\right)$ for solving convex optimization problems. Here, $m$ represents the subspace dimension, which can be significantly smaller than $d$. Instead of adopting a random subspace, our primary innovation involves performing the cubic regularized Newton update within the \emph{Krylov subspace} associated with the Hessian and the gradient of the objective function. This result marks the first instance of a dimensionindependent convergence rate for a subspace secondorder method. Furthermore, when specific spectral conditions of the Hessian are met, our method recovers the convergence rate of a fulldimensional cubic regularized Newton method. Numerical experiments show our method converges faster than existing random subspace methods, especially for highdimensional problems.more » « less

We consider the problem of finding stationary points in Bilevel optimization when the lowerlevel problem is unconstrained and strongly convex. The problem has been extensively studied in recent years; the main technical challenge is to keep track of lowerlevel solutions $y^*(x)$ in response to the changes in the upperlevel variables $x$. Subsequently, all existing approaches tie their analyses to a genie algorithm that knows lowerlevel solutions and, therefore, need not query any points far from them. We consider a dual question to such approaches: suppose we have an oracle, which we call $y^*$aware, that returns an $O(\epsilon)$estimate of the lowerlevel solution, in addition to firstorder gradient estimators {\it locally unbiased} within the $\Theta(\epsilon)$ball around $y^*(x)$. We study the complexity of finding stationary points with such an $y^*$aware oracle: we propose a simple firstorder method that converges to an $\epsilon$ stationary point using $O(\epsilon^{6}), O(\epsilon^{4})$ access to firstorder $y^*$aware oracles. Our upper bounds also apply to standard unbiased firstorder oracles, improving the bestknown complexity of firstorder methods by $O(\epsilon)$ with minimal assumptions. We then provide the matching $\Omega(\epsilon^{6})$, $\Omega(\epsilon^{4})$ lower bounds without and with an additional smoothness assumption on $y^*$aware oracles, respectively. Our results imply that any approach that simulates an algorithm with an $y^*$aware oracle must suffer the same lower bounds.more » « less

We consider the problem of finding stationary points in Bilevel optimization when the lowerlevel problem is unconstrained and strongly convex. The problem has been extensively studied in recent years; the main technical challenge is to keep track of lowerlevel solutions $y^*(x)$ in response to the changes in the upperlevel variables $x$. Subsequently, all existing approaches tie their analyses to a genie algorithm that knows lowerlevel solutions and, therefore, need not query any points far from them. We consider a dual question to such approaches: suppose we have an oracle, which we call $y^*$aware, that returns an $O(\epsilon)$estimate of the lowerlevel solution, in addition to firstorder gradient estimators locally unbiased within the $\Theta(\epsilon)$ball around $y^*(x)$. We study the complexity of finding stationary points with such an $y^*$aware oracle: we propose a simple firstorder method that converges to an $\epsilon$ stationary point using $O(\epsilon^{6}), O(\epsilon^{4})$ access to firstorder $y^*$aware oracles. Our upper bounds also apply to standard unbiased firstorder oracles, improving the bestknown complexity of firstorder methods by $O(\epsilon)$ with minimal assumptions. We then provide the matching $\Omega(\epsilon^{6})$, $\Omega(\epsilon^{4})$ lower bounds without and with an additional smoothness assumption, respectively. Our results imply that any approach that simulates an algorithm with an $y^*$aware oracle must suffer the same lower bounds.more » « less

We propose a fast stochastic Hamilton Monte Carlo (HMC) method, for sampling from a smooth and strongly logconcave distribution. At the core of our proposed method is a variance reduction technique inspired by the recent advance in stochastic optimization. We show that, to achieve $\epsilon$ accuracy in 2Wasserstein distance, our algorithm achieves $\tilde O\big(n+\kappa^{2}d^{1/2}/\epsilon+\kappa^{4/3}d^{1/3}n^{2/3}/\epsilon^{2/3}%\wedge\frac{\kappa^2L^{2}d\sigma^2}{\epsilon^2} \big)$ gradient complexity (i.e., number of component gradient evaluations), which outperforms the stateoftheart HMC and stochastic gradient HMC methods in a wide regime. We also extend our algorithm for sampling from smooth and general logconcave distributions, and prove the corresponding gradient complexity as well. Experiments on both synthetic and real data demonstrate the superior performance of our algorithm.more » « less