Escaping Saddle-Point Faster under Interpolation-like Conditions
In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an overparametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an \epsilon-local-minimizer, matches the corresponding deterministic rate of ˜O(1/\epsilon^2). We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an \epsilon-local-minimizer under interpolation-like conditions, is ˜O(1/\epsilon^2.5). While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of ˜O(1/\epsilon^1.5) corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings.
Authors:
; ; ;
Award ID(s):
Publication Date:
NSF-PAR ID:
10282296
Journal Name:
34th Conference on Neural Information Processing Systems (NeurIPS 2020)
1. We propose a stochastic variance-reduced cubic regularized Newton method (SVRC) for non-convex optimization. At the core of our algorithm is a novel semi-stochastic gradient along with a semi-stochastic Hessian, which are specifically designed for cubic regularization method. We show that our algorithm is guaranteed to converge to an $(\epsilon,\sqrt{\epsilon})$-approximate local minimum within $\tilde{O}(n^{4/5}/\epsilon^{3/2})$ second-order oracle calls, which outperforms the state-of-the-art cubic regularization algorithms including subsampled cubic regularization. Our work also sheds light on the application of variance reduction technique to high-order non-convex optimization methods. Thorough experiments on various non-convex optimization problems support our theory.
3. Abstract We consider the problem of computing the partition function $\sum _x e^{f(x)}$ , where $f: \{-1, 1\}^n \longrightarrow {\mathbb R}$ is a quadratic or cubic polynomial on the Boolean cube $\{-1, 1\}^n$ . In the case of a quadratic polynomial f , we show that the partition function can be approximated within relative error $0 < \epsilon < 1$ in quasi-polynomial $n^{O(\ln n - \ln \epsilon )}$ time if the Lipschitz constant of the non-linear part of f with respect to the $\ell ^1$ metric on the Boolean cube does not exceed $1-\delta$ , for any $\delta>0$ ,more »