skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.

This content will become publicly available on December 13, 2024

Title: The Curse of Memory in Stochastic Approximation
Theory and application of stochastic approximation (SA) has grown within the control systems community since the earliest days of adaptive control. This paper takes a new look at the topic, motivated by recent results establishing remarkable performance of SA with (sufficiently small) constant step-size \alpha>0. If averaging is implemented to obtain the final parameter estimate, then the estimates are asymptotically unbiased with nearly optimal asymptotic covariance. These results have been obtained for random linear SA recursions with i.i.d.\ coefficients. This paper obtains very different conclusions in the more common case of geometrically ergodic Markovian disturbance: (i) The target bias is identified, even in the case of non-linear SA, and is in general non-zero. The remaining results are established for linear SA recursions: (ii) the bivariate parameter-disturbance process is geometrically ergodic in a topological sense; (iii) the representation for bias has a simpler form in this case, and cannot be expected to be zero if there is multiplicative noise; (iv) the asymptotic covariance of the averaged parameters is within O(\alpha) of optimal. The error term is identified, and may be massive if mean dynamics are not well conditioned. The theory is illustrated with application to TD-learning.  more » « less
Award ID(s):
Author(s) / Creator(s):
Publisher / Repository:
Date Published:
Journal Name:
Proceedings of the IEEE Conference on Decision Control
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The theory and application of mean field games has grown significantly since its origins less than two decades ago. This paper considers a special class in which the game is cooperative, and the cost includes a control penalty defined by Kullback-Leibler divergence, as commonly used in reinforcement learning and other fields. Its use as a control cost or regularizer is often preferred because this leads to an attractive solution. This paper considers a particular control paradigm called Kullback-Leibler Quadratic (KLQ) optimal control, and arrives at the following conclusions: 1. in application to distributed control of electric loads, a new modeling technique is introduced to obtain a simple Markov model for each load (the `agent' in mean field theory). 2. It is argued that the optimality equations may be solved using Monte-Carlo techniques---a specialized version of stochastic gradient descent (SGD). 3. The use of averaging minimizes the asymptotic covariance in the SGD algorithm; the form of the optimal covariance is identified for the first time. 
    more » « less
  2. This paper concerns error bounds for recursive equations subject to Markovian disturbances. Motivating examples abound within the fields of Markov chain Monte Carlo (MCMC) and Reinforcement Learning (RL), and many of these algorithms can be interpreted as special cases of stochastic approximation (SA). It is argued that it is not possible in general to obtain a Hoeffding bound on the error sequence, even when the underlying Markov chain is reversible and geometrically ergodic, such as the M/M/1 queue. This is motivation for the focus on mean square error bounds for parameter estimates. It is shown that mean square error achieves the optimal rate of $O(1/n)$, subject to conditions on the step-size sequence. Moreover, the exact constants in the rate are obtained, which is of great value in algorithm design 
    more » « less
  3. We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCB-AVG, based on an optimistic variant of variance-reduced Q-learning. We show that UCB-AVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of state-action space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCB-AVG to develop a model-free algorithm that finds an $\epsilon$-optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{-2} + S^2Asp(h^*)\epsilon^{-1}).$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{-2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worst-case mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of value-difference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity. 
    more » « less
  4. An influential paper by Kleibergen (2005, Econometrica 73, 1103–1123) introduces Lagrange multiplier (LM) and conditional likelihood ratio-like (CLR) tests for nonlinear moment condition models. These procedures aim to have good size performance even when the parameters are unidentified or poorly identified. However, the asymptotic size and similarity (in a uniform sense) of these procedures have not been determined in the literature. This paper does so. This paper shows that the LM test has correct asymptotic size and is asymptotically similar for a suitably chosen parameter space of null distributions. It shows that the CLR tests also have these properties when the dimension p of the unknown parameter θ equals 1. When p ≥ 2, however, the asymptotic size properties are found to depend on how the conditioning statistic, upon which the CLR tests depend, is weighted. Two weighting methods have been suggested in the literature. The paper shows that the CLR tests are guaranteed to have correct asymptotic size when p ≥ 2 when the weighting is based on an estimator of the variance of the sample moments, i.e., moment-variance weighting, combined with the Robin and Smith (2000, Econometric Theory 16, 151–175) rank statistic. The paper also determines a formula for the asymptotic size of the CLR test when the weighting is based on an estimator of the variance of the sample Jacobian. However, the results of the paper do not guarantee correct asymptotic size when p ≥ 2 with the Jacobian-variance weighting, combined with the Robin and Smith (2000, Econometric Theory 16, 151–175) rank statistic, because two key sample quantities are not necessarily asymptotically independent under some identification scenarios. Analogous results for confidence sets are provided. Even for the special case of a linear instrumental variable regression model with two or more right-hand side endogenous variables, the results of the paper are new to the literature. 
    more » « less
  5. Abstract (WSN) using encrypted non-binary quantized data is studied. In a WSN, sensors transmit their observations to a fusion center through a wireless medium where the observations are susceptible to unauthorized eavesdropping. Encryption approaches for WSNs with fixed threshold binary quantization were previously explored. However, fixed threshold binary quantization limits parameter estimation to scalar parameters. In this paper, we propose a stochastic encryption approach for WSNs that can operate on non-binary quantized observations and has the capability for vector parameter estimation. We extend a binary stochastic encryption approach proposed previously, to a nonbinary generalized case. Sensor outputs are quantized using a quantizer with R + 1 levels, where R in {1.2. 3 ...}, encrypted by flipping them with certain flipping probabilities, and then transmitted. Optimal estimators using maximum-likelihood estimation are derived for both a legitimate fusion center (LFC) and a third party fusion center (TPFC) perspectives. We assume the TPFC is unaware of the encryption. Asymptotic analysis of the estimators is performed by deriving the Cramer-Rao lower bound for LFC estimation, and the asymptotic bias and variance for TPFC estimation. Numerical results validating the asymptotic analysis are presented. 
    more » « less