Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, the behavior of SGDs final iterate has received much less attention despite the widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least quares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations that SGD is run for) is known in advance, the behavior of SGDs final iterate with any polynomially decaying learning rate scheme is highly suboptimal compared to the statistical minimax rate (by a condition number factor in the strongly convex case and a factor of \sqrt{T} in the non-strongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offer significant improvements over any polynomially decaying step size schedule. In particular, the behavior of the finalmore »
An Analysis of Constant Step Size SGD in the Non-convex Regime: Asymptotic Normality and Bias
Structured non-convex learning problems, for which critical points have favorable
statistical properties, arise frequently in statistical machine learning. Algorithmic
convergence and statistical estimation rates are well-understood for such problems.
However, quantifying the uncertainty associated with the underlying training algorithm is not well-studied in the non-convex setting. In order to address this
short-coming, in this work, we establish an asymptotic normality result for the
constant step size stochastic gradient descent (SGD) algorithm—a widely used
algorithm in practice. Specifically, based on the relationship between SGD and
Markov Chains [1], we show that the average of SGD iterates is asymptotically normally distributed around the expected value of their unique invariant distribution,
as long as the non-convex and non-smooth objective function satisfies a dissipativity property. We also characterize the bias between this expected value and the
critical points of the objective function under various local regularity conditions.
Together, the above two results could be leveraged to construct confidence intervals
for non-convex problems that are trained using the SGD algorithm.
- Award ID(s):
- 1934568
- Publication Date:
- NSF-PAR ID:
- 10349410
- Journal Name:
- Advances in neural information processing systems
- Volume:
- 34
- ISSN:
- 1049-5258
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems,giving an approximate representation of complex data sets in terms of a reduced number ofextracted features. Convergence guarantees for most of the OMF algorithms in the litera-ture assume independence between data matrices, and the case of dependent data streamsremains largely unexplored. In this paper, we show that a non-convex generalization ofthe well-known OMF algorithm for i.i.d. stream of data in (Mairal et al., 2010) convergesalmost surely to the set of critical points of the expected loss function, even when the datamatrices are functions of some underlying Markov chain satisfying a mild mixing condition.This allows one to extract features more efficiently from dependent data streams, as thereis no need to subsample the data sequence to approximately satisfy the independence as-sumption. As the main application, by combining online non-negative matrix factorizationand a recent MCMC algorithm for sampling motifs from networks, we propose a novelframework ofNetwork Dictionary Learning, which extracts “network dictionary patches”from a given network in an online manner that encodes main features of the network. Wedemonstrate this technique and its application to network denoising problems on real-worldnetwork data
-
Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, the behavior of SGD’s final iterate has received much less attention despite the widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations that SGD is run for) is known in advance, the behavior of SGD’s final iterate with any polynomially decaying learning rate scheme is highly sub-optimal compared to the statistical minimax rate (by a condition number factor in the strongly convex √ case and a factor of shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offer significant improvements over any polynomially decaying step size schedule. In particular, the behavior of the final iterate with step decay schedules is off from themore »
-
In this paper, we study the stability and its trade-off with optimization error for stochastic gradient descent (SGD) algorithms in the pairwise learning setting. Pairwise learning refers to a learning task which involves a loss function depending on pairs of instances among which notable examples are bipartite ranking, metric learning, area under ROC curve (AUC) maximization and minimum error entropy (MEE) principle. Our contribution is twofolded. Firstly, we establish the stability results for SGD for pairwise learning in the convex, strongly convex and non-convex settings, from which generalization errors can be naturally derived. Secondly, we establish the trade-off between stability and optimization error of SGD algorithms for pairwise learning. This is achieved by lower-bounding the sum of stability and optimization error by the minimax statistical error over a prescribed class of pairwise loss functions. From this fundamental trade-off, we obtain lower bounds for the optimization error of SGD algorithms and the excess expected risk over a class of pairwise losses. In addition, we illustrate our stability results by giving some specific examples of AUC maximization, metric learning and MEE.
-
This paper considers policy search in continuous state-action reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Q-function. This paper presents four results: (i) an alternative policy gradient theorem using weak (measure-valued) derivatives instead of score-function is established; (ii) the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the non-convex value function of the reinforcement learning problem; (iii) the sample complexity of the algorithm is derived and is shown to be O(1/ k); (iv) finally, the expected variance of the gradient estimates obtained using weak derivatives is shown to be lower than those obtained using the popular score-function approach. Experiments on OpenAI gym pendulum environment illustrate the superior performance of the proposed algorithm.