The theory and application of mean field games has grown significantly since its origins less than two decades ago. This paper considers a special class in which the game is cooperative, and the cost includes a control penalty defined by KullbackLeibler divergence, as commonly used in reinforcement learning and other fields. Its use as a control cost or regularizer is often preferred because this leads to an attractive solution. This paper considers a particular control paradigm called KullbackLeibler Quadratic (KLQ) optimal control, and arrives at the following conclusions: 1. in application to distributed control of electric loads, a new modeling technique is introduced to obtain a simple Markov model for each load (the `agent' in mean field theory). 2. It is argued that the optimality equations may be solved using MonteCarlo techniquesa specialized version of stochastic gradient descent (SGD). 3. The use of averaging minimizes the asymptotic covariance in the SGD algorithm; the form of the optimal covariance is identified for the first time.
more »
« less
This content will become publicly available on December 13, 2024
The Curse of Memory in Stochastic Approximation
Theory and application of stochastic approximation (SA) has grown within the control systems community since the earliest days of adaptive control.
This paper takes a new look at the topic, motivated by recent results establishing remarkable performance of SA with (sufficiently small) constant stepsize \alpha>0. If averaging is implemented to obtain the final parameter estimate, then the estimates are asymptotically unbiased with nearly optimal asymptotic covariance. These results have been obtained for random linear SA recursions with i.i.d.\ coefficients.
This paper obtains very different conclusions in the more common case of geometrically ergodic Markovian disturbance: (i) The target bias is identified, even in the case of nonlinear SA, and is in general nonzero. The remaining results are established for linear SA recursions:
(ii) the bivariate parameterdisturbance process is geometrically ergodic in a topological sense;
(iii) the representation for bias has a simpler form in this case, and cannot be expected to be zero if there is multiplicative noise;
(iv) the asymptotic covariance of the averaged parameters is within O(\alpha) of optimal. The error term is identified, and may be massive if mean dynamics are not well conditioned. The theory is illustrated with application to TDlearning.
more »
« less
 Award ID(s):
 1935389
 NSFPAR ID:
 10483719
 Publisher / Repository:
 IEEE
 Date Published:
 Journal Name:
 Proceedings of the IEEE Conference on Decision Control
 ISSN:
 07431546
 Format(s):
 Medium: X
 Location:
 Singapore
 Sponsoring Org:
 National Science Foundation
More Like this


This paper concerns error bounds for recursive equations subject to Markovian disturbances. Motivating examples abound within the fields of Markov chain Monte Carlo (MCMC) and Reinforcement Learning (RL), and many of these algorithms can be interpreted as special cases of stochastic approximation (SA). It is argued that it is not possible in general to obtain a Hoeffding bound on the error sequence, even when the underlying Markov chain is reversible and geometrically ergodic, such as the M/M/1 queue. This is motivation for the focus on mean square error bounds for parameter estimates. It is shown that mean square error achieves the optimal rate of $O(1/n)$, subject to conditions on the stepsize sequence. Moreover, the exact constants in the rate are obtained, which is of great value in algorithm designmore » « less

We study modelfree reinforcement learning (RL) algorithms for infinitehorizon averagereward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of modelfree RL algorithms is relatively inadequate for the averagereward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient modelfree algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCBAVG, based on an optimistic variant of variancereduced Qlearning. We show that UCBAVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of stateaction space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient modelfree algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCBAVG to develop a modelfree algorithm that finds an $\epsilon$optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{2} + S^2Asp(h^*)\epsilon^{1}).$ This sample complexity is nearoptimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worstcase mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$discounted MDP as an approximation, and leveraging referenceadvantage decomposition for variance in optimistic Qlearning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of valuedifference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a goodquality reference value function with $O(SA)$ space complexity.more » « less

An influential paper by Kleibergen (2005, Econometrica 73, 1103–1123) introduces Lagrange multiplier (LM) and conditional likelihood ratiolike (CLR) tests for nonlinear moment condition models. These procedures aim to have good size performance even when the parameters are unidentified or poorly identified. However, the asymptotic size and similarity (in a uniform sense) of these procedures have not been determined in the literature. This paper does so. This paper shows that the LM test has correct asymptotic size and is asymptotically similar for a suitably chosen parameter space of null distributions. It shows that the CLR tests also have these properties when the dimension p of the unknown parameter θ equals 1. When p ≥ 2, however, the asymptotic size properties are found to depend on how the conditioning statistic, upon which the CLR tests depend, is weighted. Two weighting methods have been suggested in the literature. The paper shows that the CLR tests are guaranteed to have correct asymptotic size when p ≥ 2 when the weighting is based on an estimator of the variance of the sample moments, i.e., momentvariance weighting, combined with the Robin and Smith (2000, Econometric Theory 16, 151–175) rank statistic. The paper also determines a formula for the asymptotic size of the CLR test when the weighting is based on an estimator of the variance of the sample Jacobian. However, the results of the paper do not guarantee correct asymptotic size when p ≥ 2 with the Jacobianvariance weighting, combined with the Robin and Smith (2000, Econometric Theory 16, 151–175) rank statistic, because two key sample quantities are not necessarily asymptotically independent under some identification scenarios. Analogous results for confidence sets are provided. Even for the special case of a linear instrumental variable regression model with two or more righthand side endogenous variables, the results of the paper are new to the literature.more » « less

Abstract (WSN) using encrypted nonbinary quantized data is studied. In a WSN, sensors transmit their observations to a fusion center through a wireless medium where the observations are susceptible to unauthorized eavesdropping. Encryption approaches for WSNs with fixed threshold binary quantization were previously explored. However, fixed threshold binary quantization limits parameter estimation to scalar parameters. In this paper, we propose a stochastic encryption approach for WSNs that can operate on nonbinary quantized observations and has the capability for vector parameter estimation. We extend a binary stochastic encryption approach proposed previously, to a nonbinary generalized case. Sensor outputs are quantized using a quantizer with R + 1 levels, where R in {1.2. 3 ...}, encrypted by flipping them with certain flipping probabilities, and then transmitted. Optimal estimators using maximumlikelihood estimation are derived for both a legitimate fusion center (LFC) and a third party fusion center (TPFC) perspectives. We assume the TPFC is unaware of the encryption. Asymptotic analysis of the estimators is performed by deriving the CramerRao lower bound for LFC estimation, and the asymptotic bias and variance for TPFC estimation. Numerical results validating the asymptotic analysis are presented.more » « less