skip to main content


Title: The Complexity of Making the Gradient Small in Stochastic Convex Optimization
We give nearly matching upper and lower bounds on the oracle complexity of finding ϵ-stationary points (∥∇F(x)∥≤ϵ in stochastic convex optimization. We jointly analyze the oracle complexity in both the local stochastic oracle model and the global oracle (or, statistical learning) model. This allows us to decompose the complexity of finding near-stationary points into optimization complexity and sample complexity, and reveals some surprising differences between the complexity of stochastic optimization versus learning. Notably, we show that in the global oracle/statistical learning model, only logarithmic dependence on smoothness is required to find a near-stationary point, whereas polynomial dependence on smoothness is necessary in the local stochastic oracle model. In other words, the separation in complexity between the two models can be exponential, and the folklore understanding that smoothness is required to find stationary points is only weakly true for statistical learning. Our upper bounds are based on extensions of a recent “recursive regularization” technique proposed by Allen-Zhu (2018). We show how to extend the technique to achieve near-optimal rates, and in particular show how to leverage the extra information available in the global oracle model. Our algorithm for the global model can be implemented efficiently through finite sum methods, and suggests an interesting new computational-statistical tradeoff  more » « less
Award ID(s):
1718970
NSF-PAR ID:
10089724
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of Machine Learning Research
Volume:
99
ISSN:
2640-3498
Page Range / eLocation ID:
1319-1345
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We present new algorithms for optimizing non-smooth, non-convex stochastic objectives based on a novel analysis technique. This improves the current best-known complexity for finding a (δ,ϵ)-stationary point from O(ϵ^(-4),δ^(-1)) stochastic gradient queries to O(ϵ^(-3),δ^(-1)), which we also show to be optimal. Our primary technique is a reduction from non-smooth non-convex optimization to online learning, after which our results follow from standard regret bounds in online learning. For deterministic and second-order smooth objectives, applying more advanced optimistic online learning techniques enables a new complexity of O(ϵ^(-1.5),δ^(-0.5)). Our techniques also recover all optimal or best-known results for finding ϵ stationary points of smooth or second-order smooth objectives in both stochastic and deterministic settings. 
    more » « less
  2. Guruswami, Venkatesan (Ed.)
    A fundamental shortcoming of the concept of Nash equilibrium is its computational intractability: approximating Nash equilibria in normal-form games is PPAD-hard. In this paper, inspired by the ideas of smoothed analysis, we introduce a relaxed variant of Nash equilibrium called σ-smooth Nash equilibrium, for a {smoothness parameter} σ. In a σ-smooth Nash equilibrium, players only need to achieve utility at least as high as their best deviation to a σ-smooth strategy, which is a distribution that does not put too much mass (as parametrized by σ) on any fixed action. We distinguish two variants of σ-smooth Nash equilibria: strong σ-smooth Nash equilibria, in which players are required to play σ-smooth strategies under equilibrium play, and weak σ-smooth Nash equilibria, where there is no such requirement. We show that both weak and strong σ-smooth Nash equilibria have superior computational properties to Nash equilibria: when σ as well as an approximation parameter ϵ and the number of players are all constants, there is a {constant-time} randomized algorithm to find a weak ϵ-approximate σ-smooth Nash equilibrium in normal-form games. In the same parameter regime, there is a polynomial-time deterministic algorithm to find a strong ϵ-approximate σ-smooth Nash equilibrium in a normal-form game. These results stand in contrast to the optimal algorithm for computing ϵ-approximate Nash equilibria, which cannot run in faster than quasipolynomial-time, subject to complexity-theoretic assumptions. We complement our upper bounds by showing that when either σ or ϵ is an inverse polynomial, finding a weak ϵ-approximate σ-smooth Nash equilibria becomes computationally intractable. Our results are the first to propose a variant of Nash equilibrium which is computationally tractable, allows players to act independently, and which, as we discuss, is justified by an extensive line of work on individual choice behavior in the economics literature. 
    more » « less
  3. We consider stochastic zeroth-order optimization over Riemannian submanifolds embedded in Euclidean space, where the task is to solve Riemannian optimization problems with only noisy objective function evaluations. Toward this, our main contribution is to propose estimators of the Riemannian gradient and Hessian from noisy objective function evaluations, based on a Riemannian version of the Gaussian smoothing technique. The proposed estimators overcome the difficulty of nonlinearity of the manifold constraint and issues that arise in using Euclidean Gaussian smoothing techniques when the function is defined only over the manifold. We use the proposed estimators to solve Riemannian optimization problems in the following settings for the objective function: (i) stochastic and gradient-Lipschitz (in both nonconvex and geodesic convex settings), (ii) sum of gradient-Lipschitz and nonsmooth functions, and (iii) Hessian-Lipschitz. For these settings, we analyze the oracle complexity of our algorithms to obtain appropriately defined notions of ϵ-stationary point or ϵ-approximate local minimizer. Notably, our complexities are independent of the dimension of the ambient Euclidean space and depend only on the intrinsic dimension of the manifold under consideration. We demonstrate the applicability of our algorithms by simulation results and real-world applications on black-box stiffness control for robotics and black-box attacks to neural networks. 
    more » « less
  4. We study the complexity of optimizing nonsmooth nonconvex Lipschitz functions by producing (δ, ǫ)-Goldstein stationary points. Several recent works have presented randomized algorithms that produce such points using eO(δ−1ǫ−3) first-order oracle calls, independent of the dimension d. It has been an open problem as to whether a similar result can be obtained via a deterministic algorithm. We resolve this open problem, showing that randomization is necessary to obtain a dimension-free rate. In particular, we prove a lower bound of (d) for any deterministic algorithm. Moreover, we show that unlike smooth or convex optimization, access to function values is required for any deterministic algorithm to halt within any finite time horizon. On the other hand, we prove that if the function is even slightly smooth, then the dimension-free rate of eO(δ−1ǫ−3) can be obtained by a deterministic algorithm with merely a logarithmic dependence on the smoothness parameter. Motivated by these findings, we turn to study the complexity of deterministically smoothing Lipschitz functions. Though there are well-known efficient black-box randomized smoothings, we start by showing that no such deterministic procedure can smooth functions in a meaningful manner (suitably defined), resolving an open question in the literature. We then bypass this impossibility result for the structured case of ReLU neural networks. To that end, in a practical “white-box” setting in which the optimizer is granted access to the network’s architecture, we propose a simple, dimension-free, deterministic smoothing of ReLU networks that provably preserves (δ, ǫ)-Goldstein stationary points. Our method applies to a variety of architectures of arbitrary depth, including ResNets and ConvNets. Combined with our algorithm for slightly-smooth functions, this yields the first deterministic, dimension-free algorithm for optimizing ReLU networks, circumventing our lower bound. 
    more » « less
  5. null (Ed.)
    We consider the communication complexity of a number of distributed optimization problems. We start with the problem of solving a linear system. Suppose there is a coordinator together with s servers P1, …, Ps, the i-th of which holds a subset A(i) x = b(i) of ni constraints of a linear system in d variables, and the coordinator would like to output an x ϵ ℝd for which A(i) x = b(i) for i = 1, …, s. We assume each coefficient of each constraint is specified using L bits. We first resolve the randomized and deterministic communication complexity in the point-to-point model of communication, showing it is (d2 L + sd) and (sd2L), respectively. We obtain similar results for the blackboard communication model. As a result of independent interest, we show the probability a random matrix with integer entries in {–2L, …, 2L} is invertible is 1–2−Θ(dL), whereas previously only 1 – 2−Θ(d) was known. When there is no solution to the linear system, a natural alternative is to find the solution minimizing the ℓp loss, which is the ℓp regression problem. While this problem has been studied, we give improved upper or lower bounds for every value of p ≥ 1. One takeaway message is that sampling and sketching techniques, which are commonly used in earlier work on distributed optimization, are neither optimal in the dependence on d nor on the dependence on the approximation ε, thus motivating new techniques from optimization to solve these problems. Towards this end, we consider the communication complexity of optimization tasks which generalize linear systems, such as linear, semi-definite, and convex programming. For linear programming, we first resolve the communication complexity when d is constant, showing it is (sL) in the point-to-point model. For general d and in the point-to-point model, we show an Õ(sd3L) upper bound and an (d2 L + sd) lower bound. In fact, we show if one perturbs the coefficients randomly by numbers as small as 2−Θ(L), then the upper bound is Õ(sd2L) + poly(dL), and so this bound holds for almost all linear programs. Our study motivates understanding the bit complexity of linear programming, which is related to the running time in the unit cost RAM model with words of O(log(nd)) bits, and we give the fastest known algorithms for linear programming in this model. Read More: https://epubs.siam.org/doi/10.1137/1.9781611975994.106 
    more » « less