skip to main content


Title: Escaping Saddle-Point Faster under Interpolation-like Conditions
In this paper, we show that under over-parametrization several standard stochastic optimization algorithms escape saddle-points and converge to local-minimizers much faster. One of the fundamental aspects of over-parametrized models is that they are capable of interpolating the training data. We show that, under interpolation-like assumptions satisfied by the stochastic gradients in an overparametrization setting, the first-order oracle complexity of Perturbed Stochastic Gradient Descent (PSGD) algorithm to reach an \epsilon-local-minimizer, matches the corresponding deterministic rate of ˜O(1/\epsilon^2). We next analyze Stochastic Cubic-Regularized Newton (SCRN) algorithm under interpolation-like conditions, and show that the oracle complexity to reach an \epsilon-local-minimizer under interpolation-like conditions, is ˜O(1/\epsilon^2.5). While this obtained complexity is better than the corresponding complexity of either PSGD, or SCRN without interpolation-like assumptions, it does not match the rate of ˜O(1/\epsilon^1.5) corresponding to deterministic Cubic-Regularized Newton method. It seems further Hessian-based interpolation-like assumptions are necessary to bridge this gap. We also discuss the corresponding improved complexities in the zeroth-order settings.  more » « less
Award ID(s):
1934568
NSF-PAR ID:
10282296
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
34th Conference on Neural Information Processing Systems (NeurIPS 2020)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We propose a stochastic variance-reduced cubic regularized Newton method (SVRC) for non-convex optimization. At the core of our algorithm is a novel semi-stochastic gradient along with a semi-stochastic Hessian, which are specifically designed for cubic regularization method. We show that our algorithm is guaranteed to converge to an $(\epsilon,\sqrt{\epsilon})$-approximate local minimum within $\tilde{O}(n^{4/5}/\epsilon^{3/2})$ second-order oracle calls, which outperforms the state-of-the-art cubic regularization algorithms including subsampled cubic regularization. Our work also sheds light on the application of variance reduction technique to high-order non-convex optimization methods. Thorough experiments on various non-convex optimization problems support our theory. 
    more » « less
  2. null (Ed.)
    We investigate the approximability of the following optimization problem. The input is an n× n matrix A=(Aij) with real entries and an origin-symmetric convex body K⊂ ℝn that is given by a membership oracle. The task is to compute (or approximate) the maximum of the quadratic form ∑i=1n∑j=1n Aij xixj=⟨ x,Ax⟩ as x ranges over K. This is a rich and expressive family of optimization problems; for different choices of matrices A and convex bodies K it includes a diverse range of optimization problems like max-cut, Grothendieck/non-commutative Grothendieck inequalities, small set expansion and more. While the literature studied these special cases using case-specific reasoning, here we develop a general methodology for treatment of the approximability and inapproximability aspects of these questions. The underlying geometry of K plays a critical role; we show under commonly used complexity assumptions that polytime constant-approximability necessitates that K has type-2 constant that grows slowly with n. However, we show that even when the type-2 constant is bounded, this problem sometimes exhibits strong hardness of approximation. Thus, even within the realm of type-2 bodies, the approximability landscape is nuanced and subtle. However, the link that we establish between optimization and geometry of Banach spaces allows us to devise a generic algorithmic approach to the above problem. We associate to each convex body a new (higher dimensional) auxiliary set that is not convex, but is approximately convex when K has a bounded type-2 constant. If our auxiliary set has an approximate separation oracle, then we design an approximation algorithm for the original quadratic optimization problem, using an approximate version of the ellipsoid method. Even though our hardness result implies that such an oracle does not exist in general, this new question can be solved in specific cases of interest by implementing a range of classical tools from functional analysis, most notably the deep factorization theory of linear operators. Beyond encompassing the scenarios in the literature for which constant-factor approximation algorithms were found, our generic framework implies that that for convex sets with bounded type-2 constant, constant factor approximability is preserved under the following basic operations: (a) Subspaces, (b) Quotients, (c) Minkowski Sums, (d) Complex Interpolation. This yields a rich family of new examples where constant factor approximations are possible, which were beyond the reach of previous methods. We also show (under commonly used complexity assumptions) that for symmetric norms and unitarily invariant matrix norms the type-2 constant nearly characterizes the approximability of quadratic maximization. 
    more » « less
  3. Abstract

    We study the distribution over measurement outcomes of noisy random quantum circuits in the regime of low fidelity, which corresponds to the setting where the computation experiences at least one gate-level error with probability close to one. We model noise by adding a pair of weak, unital, single-qubit noise channels after each two-qubit gate, and we show that for typical random circuit instances, correlations between the noisy output distribution$$p_{\text {noisy}}$$pnoisyand the corresponding noiseless output distribution$$p_{\text {ideal}}$$pidealshrink exponentially with the expected number of gate-level errors. Specifically, the linear cross-entropy benchmarkFthat measures this correlation behaves as$$F=\text {exp}(-2s\epsilon \pm O(s\epsilon ^2))$$F=exp(-2sϵ±O(sϵ2)), where$$\epsilon $$ϵis the probability of error per circuit location andsis the number of two-qubit gates. Furthermore, if the noise is incoherent—for example, depolarizing or dephasing noise—the total variation distance between the noisy output distribution$$p_{\text {noisy}}$$pnoisyand the uniform distribution$$p_{\text {unif}}$$punifdecays at precisely the same rate. Consequently, the noisy output distribution can be approximated as$$p_{\text {noisy}}\approx Fp_{\text {ideal}}+ (1-F)p_{\text {unif}}$$pnoisyFpideal+(1-F)punif. In other words, although at least one local error occurs with probability$$1-F$$1-F, the errors are scrambled by the random quantum circuit and can be treated as global white noise, contributing completely uniform output. Importantly, we upper bound the average total variation error in this approximation by$$O(F\epsilon \sqrt{s})$$O(Fϵs). Thus, the “white-noise approximation” is meaningful when$$\epsilon \sqrt{s} \ll 1$$ϵs1, a quadratically weaker condition than the$$\epsilon s\ll 1$$ϵs1requirement to maintain high fidelity. The bound applies if the circuit size satisfies$$s \ge \Omega (n\log (n))$$sΩ(nlog(n)), which corresponds to onlylogarithmic depthcircuits, and if, additionally, the inverse error rate satisfies$$\epsilon ^{-1} \ge {\tilde{\Omega }}(n)$$ϵ-1Ω~(n), which is needed to ensure errors are scrambled faster thanFdecays. The white-noise approximation is useful for salvaging the signal from a noisy quantum computation; for example, it was an underlying assumption in complexity-theoretic arguments that noisy random quantum circuits cannot be efficiently sampled classically, even when the fidelity is low. Our method is based on a map from second-moment quantities in random quantum circuits to expectation values of certain stochastic processes for which we compute upper and lower bounds.

     
    more » « less
  4. Abstract We consider the problem of computing the partition function $\sum _x e^{f(x)}$ , where $f: \{-1, 1\}^n \longrightarrow {\mathbb R}$ is a quadratic or cubic polynomial on the Boolean cube $\{-1, 1\}^n$ . In the case of a quadratic polynomial f , we show that the partition function can be approximated within relative error $0 < \epsilon < 1$ in quasi-polynomial $n^{O(\ln n - \ln \epsilon )}$ time if the Lipschitz constant of the non-linear part of f with respect to the $\ell ^1$ metric on the Boolean cube does not exceed $1-\delta $ , for any $\delta>0$ , fixed in advance. For a cubic polynomial f , we get the same result under a somewhat stronger condition. We apply the method of polynomial interpolation, for which we prove that $\sum _x e^{\tilde {f}(x)} \ne 0$ for complex-valued polynomials $\tilde {f}$ in a neighborhood of a real-valued f satisfying the above mentioned conditions. The bounds are asymptotically optimal. Results on the zero-free region are interpreted as the absence of a phase transition in the Lee–Yang sense in the corresponding Ising model. The novel feature of the bounds is that they control the total interaction of each vertex but not every single interaction of sets of vertices. 
    more » « less
  5. In order to meet the requirements for safety and latency in many IoT applications, intelligent decisions must be made right here right now at the network edge, calling for edge intelligence. To facilitate fast edge learning, this work advocates a platform-aided federated meta-learning architecture, where a set of edge nodes joint force to learn a meta-model (i.e., model initialization for adaptation in a new learning task) by exploiting the similarity among edge nodes as well as the cloud knowledge transfer. The federated meta-learning problem is cast as a regularized stochastic optimization problem, using Bregman Divergence between the edge model and the cloud pre-trained model as the regularization. We then devise an alternating direction method of multiplier (ADMM) based Hessian-free federated meta-learning algorithm, called ADMM-FedMeta, with inexact Hessian estimation. Further, we analyze the convergence properties and the rapid adaptation performance of ADMM-FedMeta for the general non-convex case. The theoretical results show that under mild conditions, ADMM-FedMeta converges to an $\epsilon$-approximate first-order stationary point after at most $\mathcal{O}(1/\epsilon^2)$ communication rounds. Extensive experimental studies on benchmark datasets demonstrate the effectiveness and efficiency of ADMM-FedMeta, and showcase that ADMM-FedMeta outperforms the existing baselines. 
    more » « less