This paper develops a unified Lyapunov framework for finitesample analysis of a Markovian stochastic approximation (SA) algorithm under a contraction operator with respect to an arbitrary norm. The main novelty lies in the construction of a valid Lyapunov function called the generalized Moreau envelope. The smoothness and an approximation property of the generalized Moreau envelope enable us to derive a onestep Lyapunov drift inequality, which is the key to establishing the finitesample bounds. Our SA result has wide applications, especially in the context of reinforcement learning (RL). Specifically, we show that a large class of valuebased RL algorithms can be modeled in the exact form of our Markovian SA algorithm. Therefore, our SA results immediately imply finitesample guarantees for popular RL algorithms such as nstep temporal difference (TD) learning, TD(𝜆), offpolicy Vtrace, and Qlearning. As byproducts, by analyzing the convergence bounds of nstep TD and TD(𝜆), we provide theoretical insight into the problem about the efficiency of bootstrapping. Moreover, our finitesample bounds of offpolicy Vtrace explicitly capture the tradeoff between the variance of the stochastic iterates and the bias in the limit.
more »
« less
Empirical QValue Iteration
We propose a new simple and natural algorithm for learning the optimal Qvalue function of a discountedcost Markov decision process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Qlearning and actorcritic algorithms, this algorithm does not depend on a stochastic approximationbased method. We show that our algorithm, which we call the empirical Qvalue iteration algorithm, converges to the optimal Qvalue function. We also give a rate of convergence or a nonasymptotic sample complexity bound and show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ballpark estimate for our algorithm compared with stochastic approximationbased algorithms.
more »
« less
 NSFPAR ID:
 10214001
 Date Published:
 Journal Name:
 Stochastic Systems
 ISSN:
 19465238
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Astolfi, Alessandro (Ed.)Qlearning has become an important part of the reinforcement learning toolkit since its introduction in the dissertation of Chris Watkins in the 1980s. In the original tabular formulation, the goal is to compute exactly a solution to the discountedcost optimality equation, and thereby obtain the optimal policy for a Markov Decision Process. The goal today is more modest: obtain an approximate solution within a prescribed function class. The standard algorithms are based on the same architecture as formulated in the 1980s, with the goal of finding a value function approximation that solves the socalled projected Bellman equation. While reinforcement learning has been an active research area for over four decades, there is little theory providing conditions for convergence of these Qlearning algorithms, or even existence of a solution to this equation. The purpose of this paper is to show that a solution to the projected Bellman equation does exist, provided the function class is linear and the input used for training is a form of epsilongreedy policy with sufficiently small epsilon. Moreover, under these conditions it is shown that the Qlearning algorithm is stable, in terms of bounded parameter estimates. Convergence remains one of many open topics for research.more » « less

null (Ed.)Stochastic approximation (SA) algorithms are recursive techniques used to obtain the roots of functions that can be expressed as expectations of a noisy parameterized family of functions. In this paper two new SA algorithms are introduced: 1) PolSA, an extension of Polyak's momentum technique with a specially designed matrix momentum, and 2) NeSA, which can either be regarded as a variant of Nesterov's acceleration method, or a simplification of PolSA. The rates of convergence of SA algorithms is well understood. Under special conditions, the mean square error of the parameter estimates is bounded by σ 2 /n+o(1/n), where σ 2 ≥ 0 is an identifiable constant. If these conditions fail, the rate is typically sublinear. There are two well known SA algorithms that ensure a linear rate, with minimal value of variance, σ 2 : the RuppertPolyak averaging technique, and the stochastic NewtonRaphson (SNR) algorithm. It is demonstrated here that under mild technical assumptions, the PolSA algorithm also achieves this optimality criteria. This result is established via novel coupling arguments: It is shown that the parameter estimates obtained from the PolSA algorithm couple with those of the optimal variance (but computationally more expensive) SNR algorithm, at a rate O(1/n 2 ). The newly proposed algorithms are extended to a reinforcement learning setting to obtain new Qlearning algorithms, and numerical results confirm the coupling of PolSA and SNR.more » « less

The paper introduces the first formulation of convex Qlearning for Markov decision processes with function approximation. The algorithms and theory rest on a relaxation of a dual of Manne's celebrated linear programming characterization of optimal control. The main contributions firstly concern properties of the relaxation, described as a deterministic convex program: we identify conditions for a bounded solution, a significant connection between the solution to the new convex program, and the solution to standard Qlearning with linear function approximation. The second set of contributions concern algorithm design and analysis: (i) A direct modelfree method for approximating the convex program for Qlearning shares properties with its ideal. In particular, a bounded solution is ensured subject to a simple property of the basis functions; (ii) The proposed algorithms are convergent and new techniques are introduced to obtain the rate of convergence in a meansquare sense; (iii) The approach can be generalized to a range of performance criteria, and it is found that variance can be reduced by considering ``relative'' dynamic programming equations; (iv) The theory is illustrated with an application to a classical inventory control problem.more » « less

We study modelfree reinforcement learning (RL) algorithms for infinitehorizon averagereward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of modelfree RL algorithms is relatively inadequate for the averagereward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient modelfree algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCBAVG, based on an optimistic variant of variancereduced Qlearning. We show that UCBAVG achieves a regret bound $\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$ after $T$ steps, where $S\times A$ is the size of stateaction space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient modelfree algorithm that achieves the optimal dependence in $T$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $T$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCBAVG to develop a modelfree algorithm that finds an $\epsilon$optimal policy with sample complexity $\widetilde{O}(SAsp^2(h^*)\epsilon^{2} + S^2Asp(h^*)\epsilon^{1}).$ This sample complexity is nearoptimal for weakly communicating MDPs, in view of the minimax lower bound $\Omega(SAsp(^*)\epsilon^{2})$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $t_{mix},$ the worstcase mixing time induced by a policy. We remark that the diameter $D$ and mixing time $t_{mix}$ are both lower bounded by $sp(h^*)$, and $t_{mix}$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $\gamma$discounted MDP as an approximation, and leveraging referenceadvantage decomposition for variance in optimistic Qlearning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of valuedifference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $\gamma$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a goodquality reference value function with $O(SA)$ space complexity.more » « less