Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, the behavior of SGDs final iterate has received much less attention despite the widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least quares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations that SGD is run for) is known in advance, the behavior of SGDs final iterate with any polynomially decaying learning rate scheme is highly suboptimal compared to the statistical minimax rate (by a condition number factor in the strongly convex case and a factor of \sqrt{T} in the nonstrongly convex case). In contrast, this paper shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offer significant improvements over any polynomially decaying step size schedule. In particular, the behavior of the finalmore »
An Analysis of Constant Step Size SGD in the Nonconvex Regime: Asymptotic Normality and Bias
Structured nonconvex learning problems, for which critical points have favorable
statistical properties, arise frequently in statistical machine learning. Algorithmic
convergence and statistical estimation rates are wellunderstood for such problems.
However, quantifying the uncertainty associated with the underlying training algorithm is not wellstudied in the nonconvex setting. In order to address this
shortcoming, in this work, we establish an asymptotic normality result for the
constant step size stochastic gradient descent (SGD) algorithm—a widely used
algorithm in practice. Specifically, based on the relationship between SGD and
Markov Chains [1], we show that the average of SGD iterates is asymptotically normally distributed around the expected value of their unique invariant distribution,
as long as the nonconvex and nonsmooth objective function satisfies a dissipativity property. We also characterize the bias between this expected value and the
critical points of the objective function under various local regularity conditions.
Together, the above two results could be leveraged to construct confidence intervals
for nonconvex problems that are trained using the SGD algorithm.
 Award ID(s):
 1934568
 Publication Date:
 NSFPAR ID:
 10349410
 Journal Name:
 Advances in neural information processing systems
 Volume:
 34
 ISSN:
 10495258
 Sponsoring Org:
 National Science Foundation
More Like this


Online Matrix Factorization (OMF) is a fundamental tool for dictionary learning problems,giving an approximate representation of complex data sets in terms of a reduced number ofextracted features. Convergence guarantees for most of the OMF algorithms in the literature assume independence between data matrices, and the case of dependent data streamsremains largely unexplored. In this paper, we show that a nonconvex generalization ofthe wellknown OMF algorithm for i.i.d. stream of data in (Mairal et al., 2010) convergesalmost surely to the set of critical points of the expected loss function, even when the datamatrices are functions of some underlying Markov chain satisfying a mild mixing condition.This allows one to extract features more efficiently from dependent data streams, as thereis no need to subsample the data sequence to approximately satisfy the independence assumption. As the main application, by combining online nonnegative matrix factorizationand a recent MCMC algorithm for sampling motifs from networks, we propose a novelframework ofNetwork Dictionary Learning, which extracts “network dictionary patches”from a given network in an online manner that encodes main features of the network. Wedemonstrate this technique and its application to network denoising problems on realworldnetwork data

Minimax optimal convergence rates for numerous classes of stochastic convex optimization problems are well characterized, where the majority of results utilize iterate averaged stochastic gradient descent (SGD) with polynomially decaying step sizes. In contrast, the behavior of SGD’s final iterate has received much less attention despite the widespread use in practice. Motivated by this observation, this work provides a detailed study of the following question: what rate is achievable using the final iterate of SGD for the streaming least squares regression problem with and without strong convexity? First, this work shows that even if the time horizon T (i.e. the number of iterations that SGD is run for) is known in advance, the behavior of SGD’s final iterate with any polynomially decaying learning rate scheme is highly suboptimal compared to the statistical minimax rate (by a condition number factor in the strongly convex √ case and a factor of shows that Step Decay schedules, which cut the learning rate by a constant factor every constant number of epochs (i.e., the learning rate decays geometrically) offer significant improvements over any polynomially decaying step size schedule. In particular, the behavior of the final iterate with step decay schedules is off from themore »

In this paper, we study the stability and its tradeoff with optimization error for stochastic gradient descent (SGD) algorithms in the pairwise learning setting. Pairwise learning refers to a learning task which involves a loss function depending on pairs of instances among which notable examples are bipartite ranking, metric learning, area under ROC curve (AUC) maximization and minimum error entropy (MEE) principle. Our contribution is twofolded. Firstly, we establish the stability results for SGD for pairwise learning in the convex, strongly convex and nonconvex settings, from which generalization errors can be naturally derived. Secondly, we establish the tradeoff between stability and optimization error of SGD algorithms for pairwise learning. This is achieved by lowerbounding the sum of stability and optimization error by the minimax statistical error over a prescribed class of pairwise loss functions. From this fundamental tradeoff, we obtain lower bounds for the optimization error of SGD algorithms and the excess expected risk over a class of pairwise losses. In addition, we illustrate our stability results by giving some specific examples of AUC maximization, metric learning and MEE.

This paper considers policy search in continuous stateaction reinforcement learning problems. Typically, one computes search directions using a classic expression for the policy gradient called the Policy Gradient Theorem, which decomposes the gradient of the value function into two factors: the score function and the Qfunction. This paper presents four results: (i) an alternative policy gradient theorem using weak (measurevalued) derivatives instead of scorefunction is established; (ii) the stochastic gradient estimates thus derived are shown to be unbiased and to yield algorithms that converge almost surely to stationary points of the nonconvex value function of the reinforcement learning problem; (iii) the sample complexity of the algorithm is derived and is shown to be O(1/ k); (iv) finally, the expected variance of the gradient estimates obtained using weak derivatives is shown to be lower than those obtained using the popular scorefunction approach. Experiments on OpenAI gym pendulum environment illustrate the superior performance of the proposed algorithm.