skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: On Linear Stochastic Approximation: Fine-grained Polyak-Ruppert and Non-Asymptotic Concentration
We undertake a precise study of the asymptotic and non-asymptotic properties of stochastic approximation procedures with Polyak-Ruppert averaging for solving a linear system $$\bar{A} \theta = \bar{b}$$. When the matrix $$\bar{A}$$ is Hurwitz, we prove a central limit theorem (CLT) for the averaged iterates with fixed step size and number of iterations going to infinity. The CLT characterizes the exact asymptotic covariance matrix, which is the sum of the classical Polyak-Ruppert covariance and a correction term that scales with the step size. Under assumptions on the tail of the noise distribution, we prove a non-asymptotic concentration inequality whose main term matches the covariance in CLT in any direction, up to universal constants. When the matrix $$\bar{A}$$ is not Hurwitz but only has non-negative real parts in its eigenvalues, we prove that the averaged LSA procedure actually achieves an $O(1/T)$ rate in mean-squared error. Our results provide a more refined understanding of linear stochastic approximation in both the asymptotic and non-asymptotic settings. We also show various applications of the main results, including the study of momentum-based stochastic gradient methods as well as temporal difference algorithms in reinforcement learning.  more » « less
Award ID(s):
1909365
PAR ID:
10250956
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of Thirty Third Conference on Learning Theory
Volume:
125
Page Range / eLocation ID:
2947-2997
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Ramanan, Kavita (Ed.)
    The paper concerns the stochastic approximation recursion, \[ \theta_{n+1}= \theta_n + \alpha_{n + 1} f(\theta_n, \Phi_{n+1}) \,,\quad n\ge 0, \] where the {\em estimates} $$\{ \theta_n\} $$ evolve on $$\Re^d$$, and $$\bfPhi \eqdef \{ \Phi_n \}$$ is a stochastic process on a general state space, satisfying a conditional Markov property that allows for parameter-dependent noise. In addition to standard Lipschitz assumptions and conditions on the vanishing step-size sequence, it is assumed that the associated \textit{mean flow} $$ \ddt \odestate_t = \barf(\odestate_t)$$ is globally asymptotically stable, with stationary point denoted $$\theta^*$$. The main results are established under additional conditions on the mean flow and an extension of the Donsker-Varadhan Lyapunov drift condition known as~(DV3): (i) A Lyapunov function is constructed for the joint process $$\{\theta_n,\Phi_n\}$$ that implies convergence of the estimates in $$L_4$$. (ii) A functional central limit theorem (CLT) is established, as well as the usual one-dimensional CLT for the normalized error. Moment bounds combined with the CLT imply convergence of the normalized covariance $$\Expect [ z_n z_n^\transpose ]$$ to the asymptotic covariance $$\SigmaTheta$$ in the CLT, where $$z_n\eqdef (\theta_n-\theta^*)/\sqrt{\alpha_n}$$. (iii) The CLT holds for the normalized averaged parameters $$\zPR_n\eqdef \sqrt{n} (\thetaPR_n -\theta^*)$$, with $$\thetaPR_n \eqdef n^{-1} \sum_{k=1}^n\theta_k$$, subject to standard assumptions on the step-size. Moreover, the covariance of $$\zPR_n$$ converges to $$\SigmaPR$$, the minimal covariance of Polyak and Ruppert. (iv) An example is given where $$f$$ and $$\barf$$ are linear in $$\theta$$, and $$\bfPhi$$ is a geometrically ergodic Markov chain but does not satisfy~(DV3). While the algorithm is convergent, the second moment of $$\theta_n$$ is unbounded and in fact diverges. 
    more » « less
  2. Theory and application of stochastic approximation (SA) has grown within the control systems community since the earliest days of adaptive control. This paper takes a new look at the topic, motivated by recent results establishing remarkable performance of SA with (sufficiently small) constant step-size \alpha>0. If averaging is implemented to obtain the final parameter estimate, then the estimates are asymptotically unbiased with nearly optimal asymptotic covariance. These results have been obtained for random linear SA recursions with i.i.d.\ coefficients. This paper obtains very different conclusions in the more common case of geometrically ergodic Markovian disturbance: (i) The target bias is identified, even in the case of non-linear SA, and is in general non-zero. The remaining results are established for linear SA recursions: (ii) the bivariate parameter-disturbance process is geometrically ergodic in a topological sense; (iii) the representation for bias has a simpler form in this case, and cannot be expected to be zero if there is multiplicative noise; (iv) the asymptotic covariance of the averaged parameters is within O(\alpha) of optimal. The error term is identified, and may be massive if mean dynamics are not well conditioned. The theory is illustrated with application to TD-learning. 
    more » « less
  3. This paper considers the recursive estimation of quantiles using the stochastic gradient descent (SGD) algorithm with Polyak-Ruppert averaging. The algorithm offers a compu- tationally and memory efficient alternative to the usual empirical estimator. Our focus is on studying the non-asymptotic behavior by providing exponentially decreasing tail prob- ability bounds under mild assumptions on the smoothness of the density functions. This novel non-asymptotic result is based on a bound of the moment generating function of the SGD estimate. We apply our result to the problem of best arm identification in a multi-armed stochastic bandit setting under quantile preferences. 
    more » « less
  4. We study stochastic approximation procedures for approximately solving a $$d$$-dimensional linear fixed point equation based on observing a trajectory of length $$n$$ from an ergodic Markov chain. We first exhibit a non-asymptotic bound of the order $$t_{\mathrm{mix}} \tfrac{d}{n}$$ on the squared error of the last iterate of a standard scheme, where $$t_{\mathrm{mix}}$$ is a mixing time. We then prove a non-asymptotic instance-dependent bound on a suitably averaged sequence of iterates, with a leading term that matches the local asymptotic minimax limit, including sharp dependence on the parameters $$(d, t_{\mathrm{mix}})$$ in the higher order terms. We complement these upper bounds with a non-asymptotic minimax lower bound that establishes the instance-optimality of the averaged SA estimator. We derive corollaries of these results for policy evaluation with Markov noise—covering the TD($$\lambda$$) family of algorithms for all $$\lambda \in [0, 1)$$—and linear autoregressive models. Our instance-dependent characterizations open the door to the design of fine-grained model selection procedures for hyperparameter tuning (e.g., choosing the value of $$\lambda$$ when running the TD($$\lambda$$) algorithm). 
    more » « less
  5. We study the problem of solving strongly convex and smooth unconstrained optimization problems using stochastic first-order algorithms. We devise a novel algorithm, referred to as \emph{Recursive One-Over-T SGD} (\ROOTSGD), based on an easily implementable, recursive averaging of past stochastic gradients. We prove that it simultaneously achieves state-of-the-art performance in both a finite-sample, nonasymptotic sense and an asymptotic sense. On the nonasymptotic side, we prove risk bounds on the last iterate of \ROOTSGD with leading-order terms that match the optimal statistical risk with a unity pre-factor, along with a higher-order term that scales at the sharp rate of O(n−3/2) under the Lipschitz condition on the Hessian matrix. On the asymptotic side, we show that when a mild, one-point Hessian continuity condition is imposed, the rescaled last iterate of (multi-epoch) \ROOTSGD converges asymptotically to a Gaussian limit with the Cram\'{e}r-Rao optimal asymptotic covariance, for a broad range of step-size choices. 
    more » « less