skip to main content


Title: On the Local Minima of the Empirical Risk
Population risk is always of primary interest in machine learning; however, learning algorithms only have access to the empirical risk. Even for applications with nonconvex nonsmooth losses (such as modern deep networks), the population risk is generally significantly more well-behaved from an optimization point of view than the empirical risk. In particular, sampling can create many spurious local minima. We consider a general framework which aims to optimize a smooth nonconvex function F (population risk) given only access to an approximation f (empirical risk) that is pointwise close to F (i.e., ||F − f||_\infty ≤ ν). Our objective is to find the \eps-approximate local minima of the underlying function F while avoiding the shallow local minima—arising because of the tolerance ν—which exist only in f. We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of f that is guaranteed to achieve our goal as long as ν ≤ O(\eps^{1.5}/d). We also provide an almost matching lower bound showing that our algorithm achieves optimal error tolerance ν among all algorithms making a polynomial number of queries of f. As a concrete example, we show that our results can be directly used to give sample complexities for learning a ReLU unit.  more » « less
Award ID(s):
1704656
PAR ID:
10090514
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Tucker decomposition is a popular technique for many data analysis and machine learning applications. Finding a Tucker decomposition is a nonconvex optimization problem. As the scale of the problems increases, local search algorithms such as stochastic gradient descent have become popular in practice. In this paper, we characterize the optimization landscape of the Tucker decomposition problem. In particular, we show that if the tensor has an exact Tucker decomposition, for a standard nonconvex objective of Tucker decomposition, all local minima are also globally optimal. We also give a local search algorithm that can nd an approximate local (and global) optimal solution in polynomial time. 
    more » « less
  2. Estimating the normalizing constant of an unnormalized probability distribution has important applications in computer science, statistical physics, machine learning, and statistics. In this work, we consider the problem of estimating the normalizing constant to within a multiplication factor of 1 ± ε for a μ-strongly convex and L-smooth function f, given query access to f(x) and ∇f(x). We give both algorithms and lowerbounds for this problem. Using an annealing algorithm combined with a multilevel Monte Carlo method based on underdamped Langevin dynamics, we show that O(d^{4/3}/\eps^2) queries to ∇f are sufficient. Moreover, we provide an information theoretic lowerbound, showing that at least d^{1-o(1)}/\eps^{2-o(1)} queries are necessary. This provides a first nontrivial lowerbound for the problem. 
    more » « less
  3. The densest subgraph problem in a graph (\dsg), in the simplest form, is the following. Given an undirected graph $G=(V,E)$ find a subset $S \subseteq V$ of vertices that maximizes the ratio $|E(S)|/|S|$ where $E(S)$ is the set of edges with both endpoints in $S$. \dsg and several of its variants are well-studied in theory and practice and have many applications in data mining and network analysis. In this paper we study fast algorithms and structural aspects of \dsg via the lens of \emph{supermodularity}. For this we consider the densest supermodular subset problem (\dssp): given a non-negative supermodular function $f: 2^V \rightarrow \mathbb{R}_+$, maximize $f(S)/|S|$. For \dsg we describe a simple flow-based algorithm that outputs a $(1-\eps)$-approximation in deterministic $\tilde{O}(m/\eps)$ time where $m$ is the number of edges. Our algorithm is the first to have a near-linear dependence on $m$ and $1/\eps$ and improves previous methods based on an LP relaxation. It generalizes to hypergraphs, and also yields a faster algorithm for directed \dsg. Greedy peeling algorithms have been very popular for \dsg and several variants due to their efficiency, empirical performance, and worst-case approximation guarantees. We describe a simple peeling algorithm for \dssp and analyze its approximation guarantee in a fashion that unifies several existing results. Boob et al.\ \cite{bgpstww-20} developed an \emph{iterative} peeling algorithm for \dsg which appears to work very well in practice, and made a conjecture about its convergence to optimality. We affirmatively answer their conjecture, and in fact prove that a natural generalization of their algorithm converges to a $(1-\eps)$-approximation for \emph{any} supermodular function $f$; the key to our proof is to consider an LP formulation that is derived via the \Lovasz extension of a supermodular function. For \dsg the bound on the number of iterations we prove is $O(\frac{\Delta \ln |V|}{\lambda^*}\cdot \frac{1}{\eps^2})$ where $\Delta$ is the maximum degree and $\lambda^*$ is the optimum value. Our work suggests that iterative peeling can be an effective heuristic for several objectives considered in the literature. Finally, we show that the $2$-approximation for densest-at-least-$k$ subgraph \cite{ks-09} extends to the supermodular setting. We also give a unified analysis of the peeling algorithm for this problem, and via this analysis derive an approximation guarantee for a generalization of \dssp to maximize $f(S)/g(|S|)$ for a concave function $g$. 
    more » « less
  4. We study the problem of certification: given queries to a function f : {0,1}n → {0,1} with certificate complexity ≤ k and an input x⋆, output a size-k certificate for f’s value on x⋆. For monotone functions, a classic local search algorithm of Angluin accomplishes this task with n queries, which we show is optimal for local search algorithms. Our main result is a new algorithm for certifying monotone functions with O(k8 logn) queries, which comes close to matching the information-theoretic lower bound of Ω(k logn). The design and analysis of our algorithm are based on a new connection to threshold phenomena in monotone functions. We further prove exponential-in-k lower bounds when f is non-monotone, and when f is monotone but the algorithm is only given random examples of f. These lower bounds show that assumptions on the structure of f and query access to it are both necessary for the polynomial dependence on k that we achieve. 
    more » « less
  5. Variance reduction techniques like SVRG provide simple and fast algorithms for optimizing a convex finite-sum objective. For nonconvex objectives, these techniques can also find a first-order stationary point (with small gradient). However, in nonconvex optimization it is often crucial to find a second-order stationary point (with small gradient and almost PSD hessian). In this paper, we show that Stabilized SVRG (a simple variant of SVRG) can find an \eps-second-order stationary point using only O(n^{2/3}/\eps^2+n/\eps^{1.5}) stochastic gradients. To our best knowledge, this is the first second-order guarantee for a simple variant of SVRG. The running time almost matches the known guarantees for finding \eps-first-order stationary points. 
    more » « less