On the Local Minima of the Empirical Risk
Population risk is always of primary interest in machine learning; however, learning algorithms only have access to the empirical risk. Even for applications with nonconvex nonsmooth losses (such as modern deep networks), the population risk is generally significantly more well-behaved from an optimization point of view than the empirical risk. In particular, sampling can create many spurious local minima. We consider a general framework which aims to optimize a smooth nonconvex function F (population risk) given only access to an approximation f (empirical risk) that is pointwise close to F (i.e., ||F − f||_\infty ≤ ν). Our objective is to find the \eps-approximate local minima of the underlying function F while avoiding the shallow local minima—arising because of the tolerance ν—which exist only in f. We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of f that is guaranteed to achieve our goal as long as ν ≤ O(\eps^{1.5}/d). We also provide an almost matching lower bound showing that our algorithm achieves optimal error tolerance ν among all algorithms making a polynomial number of queries of f. As a concrete example, we show that our results can be directly used to give sample complexities for more »
Authors:
; ; ;
Award ID(s):
Publication Date:
NSF-PAR ID:
10090514
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
1. The densest subgraph problem in a graph (\dsg), in the simplest form, is the following. Given an undirected graph $G=(V,E)$ find a subset $S \subseteq V$ of vertices that maximizes the ratio $|E(S)|/|S|$ where $E(S)$ is the set of edges with both endpoints in $S$. \dsg and several of its variants are well-studied in theory and practice and have many applications in data mining and network analysis. In this paper we study fast algorithms and structural aspects of \dsg via the lens of \emph{supermodularity}. For this we consider the densest supermodular subset problem (\dssp): given a non-negative supermodular function $f: 2^V \rightarrow \mathbb{R}_+$, maximize $f(S)/|S|$. For \dsg we describe a simple flow-based algorithm that outputs a $(1-\eps)$-approximation in deterministic $\tilde{O}(m/\eps)$ time where $m$ is the number of edges. Our algorithm is the first to have a near-linear dependence on $m$ and $1/\eps$ and improves previous methods based on an LP relaxation. It generalizes to hypergraphs, and also yields a faster algorithm for directed \dsg. Greedy peeling algorithms have been very popular for \dsg and several variants due to their efficiency, empirical performance, and worst-case approximation guarantees. We describe a simple peeling algorithm for \dssp and analyze its approximation guarantee inmore »