Stochastic Approximation (SA) is a widely used algorithmic approach in various fields, including optimization and reinforcement learning (RL). Among RL algorithms, Q-learning is particularly popular due to its empirical success. In this paper, we study asynchronous Q-learning with constant stepsize, which is commonly used in practice for its fast convergence. By connecting the constant stepsize Q-learning to a time-homogeneous Markov chain, we show the distributional convergence of the iterates in Wasserstein distance and establish its exponential convergence rate. We also establish a Central Limit Theory for Q-learning iterates, demonstrating the asymptotic normality of the averaged iterates. Moreover, we provide an explicit expansion of the asymptotic bias of the averaged iterate in stepsize. Specifically, the bias is proportional to the stepsize up to higher-order terms and we provide an explicit expression for the linear coefficient. This precise characterization of the bias allows the application of Richardson-Romberg (RR) extrapolation technique to construct a new estimate that is provably closer to the optimal Q function. Numerical results corroborate our theoretical finding on the improvement of the RR extrapolation method.
more »
« less
Empirical Q-Value Iteration
We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov decision process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and actor-critic algorithms, this algorithm does not depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical Q-value iteration algorithm, converges to the optimal Q-value function. We also give a rate of convergence or a nonasymptotic sample complexity bound and show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ballpark estimate for our algorithm compared with stochastic approximation-based algorithms.
more »
« less
- PAR ID:
- 10214001
- Date Published:
- Journal Name:
- Stochastic Systems
- ISSN:
- 1946-5238
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper develops a unified Lyapunov framework for finite-sample analysis of a Markovian stochastic approximation (SA) algorithm under a contraction operator with respect to an arbitrary norm. The main novelty lies in the construction of a valid Lyapunov function called the generalized Moreau envelope. The smoothness and an approximation property of the generalized Moreau envelope enable us to derive a one-step Lyapunov drift inequality, which is the key to establishing the finite-sample bounds. Our SA result has wide applications, especially in the context of reinforcement learning (RL). Specifically, we show that a large class of value-based RL algorithms can be modeled in the exact form of our Markovian SA algorithm. Therefore, our SA results immediately imply finite-sample guarantees for popular RL algorithms such as n-step temporal difference (TD) learning, TD(𝜆), off-policy V-trace, and Q-learning. As byproducts, by analyzing the convergence bounds of n-step TD and TD(𝜆), we provide theoretical insight into the problem about the efficiency of bootstrapping. Moreover, our finite-sample bounds of off-policy V-trace explicitly capture the tradeoff between the variance of the stochastic iterates and the bias in the limit.more » « less
-
null (Ed.)Stochastic approximation (SA) algorithms are recursive techniques used to obtain the roots of functions that can be expressed as expectations of a noisy parameterized family of functions. In this paper two new SA algorithms are introduced: 1) PolSA, an extension of Polyak's momentum technique with a specially designed matrix momentum, and 2) NeSA, which can either be regarded as a variant of Nesterov's acceleration method, or a simplification of PolSA. The rates of convergence of SA algorithms is well understood. Under special conditions, the mean square error of the parameter estimates is bounded by σ 2 /n+o(1/n), where σ 2 ≥ 0 is an identifiable constant. If these conditions fail, the rate is typically sub-linear. There are two well known SA algorithms that ensure a linear rate, with minimal value of variance, σ 2 : the Ruppert-Polyak averaging technique, and the stochastic Newton-Raphson (SNR) algorithm. It is demonstrated here that under mild technical assumptions, the PolSA algorithm also achieves this optimality criteria. This result is established via novel coupling arguments: It is shown that the parameter estimates obtained from the PolSA algorithm couple with those of the optimal variance (but computationally more expensive) SNR algorithm, at a rate O(1/n 2 ). The newly proposed algorithms are extended to a reinforcement learning setting to obtain new Q-learning algorithms, and numerical results confirm the coupling of PolSA and SNR.more » « less
-
Astolfi, Alessandro (Ed.)Q-learning has become an important part of the reinforcement learning toolkit since its introduction in the dissertation of Chris Watkins in the 1980s. In the original tabular formulation, the goal is to compute exactly a solution to the discounted-cost optimality equation, and thereby obtain the optimal policy for a Markov Decision Process. The goal today is more modest: obtain an approximate solution within a prescribed function class. The standard algorithms are based on the same architecture as formulated in the 1980s, with the goal of finding a value function approximation that solves the so-called projected Bellman equation. While reinforcement learning has been an active research area for over four decades, there is little theory providing conditions for convergence of these Q-learning algorithms, or even existence of a solution to this equation. The purpose of this paper is to show that a solution to the projected Bellman equation does exist, provided the function class is linear and the input used for training is a form of epsilon-greedy policy with sufficiently small epsilon. Moreover, under these conditions it is shown that the Q-learning algorithm is stable, in terms of bounded parameter estimates. Convergence remains one of many open topics for research.more » « less
-
We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCB-AVG, based on an optimistic variant of variance-reduced Q-learning. We show that UCB-AVG achieves a regret bound $$\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$$ after $$T$$ steps, where $$S\times A$$ is the size of state-action space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $$T$$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $$T$$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCB-AVG to develop a model-free algorithm that finds an $$\epsilon$$-optimal policy with sample complexity $$\widetilde{O}(SAsp^2(h^*)\epsilon^{-2} + S^2Asp(h^*)\epsilon^{-1}).$$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $$\Omega(SAsp(^*)\epsilon^{-2})$$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $$t_{mix},$$ the worst-case mixing time induced by a policy. We remark that the diameter $$D$$ and mixing time $$t_{mix}$$ are both lower bounded by $sp(h^*)$, and $$t_{mix}$$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $$\gamma$$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of value-difference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $$\gamma$$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity.more » « less
An official website of the United States government

