skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 4, 2025

Title: Feasible Q-Learning for Average Reward Reinforcement Learning
Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $$Q$$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $$Q$$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $$\epsilon$$-close to optimal, (ii) estimate optimal average reward with $$\epsilon$$-accuracy, and (iii) estimate the bias function (similar to $$Q$$-function in discounted case) with $$\epsilon$$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$$ samples, (ii) with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$$ samples, and (iii) with $$\tilde{O}(\frac{SA B}{\epsilon^9})$$ samples, where $$t_\mathrm{mix}$$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $$S, A, t_{\mathrm{mix}}, \epsilon$$ for a feasible variant of $$Q$$-learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input.  more » « less
Award ID(s):
2312204
PAR ID:
10529087
Author(s) / Creator(s):
; ; ;
Corporate Creator(s):
Editor(s):
Dasgupta, Sanjoy; Mandt, Stephan; Li, Yingzhen
Publisher / Repository:
Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, PMLR
Date Published:
Volume:
238
Page Range / eLocation ID:
1630--1638
Subject(s) / Keyword(s):
Reinforcement Learning, Distributionally Robust Reinforcement Learning, Q-Learning
Format(s):
Medium: X
Location:
https://proceedings.mlr.press/v238/jin24b/jin24b.pdf
Sponsoring Org:
National Science Foundation
More Like this
  1. Dasgupta, Sanjoy; Mandt, Stephan; Li, Yingzhen (Ed.)
    Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $$Q$$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $$Q$$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $$\epsilon$$-close to optimal, (ii) estimate optimal average reward with $$\epsilon$$-accuracy, and (iii) estimate the bias function (similar to $$Q$$-function in discounted case) with $$\epsilon$$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$$ samples, (ii) with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$$ samples, and (iii) with $$\tilde{O}(\frac{SA B}{\epsilon^9})$$ samples, where $$t_\mathrm{mix}$$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $$S, A, t_{\mathrm{mix}}, \epsilon$$ for a feasible variant of $$Q$$-learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input. 
    more » « less
  2. Dasgupta, Sanjoy; Mandt, Stephan; Li, Yingzhen (Ed.)
    Average reward reinforcement learning (RL) provides a suitable framework for capturing the objective (i.e. long-run average reward) for continuing tasks, where there is often no natural way to identify a discount factor. However, existing average reward RL algorithms with sample complexity guarantees are not feasible, as they take as input the (unknown) mixing time of the Markov decision process (MDP). In this paper, we make initial progress towards addressing this open problem. We design a feasible average-reward $$Q$$-learning framework that requires no knowledge of any problem parameter as input. Our framework is based on discounted $$Q$$-learning, while we dynamically adapt the discount factor (and hence the effective horizon) to progressively approximate the average reward. In the synchronous setting, we solve three tasks: (i) learn a policy that is $$\epsilon$$-close to optimal, (ii) estimate optimal average reward with $$\epsilon$$-accuracy, and (iii) estimate the bias function (similar to $$Q$$-function in discounted case) with $$\epsilon$$-accuracy. We show that with carefully designed adaptation schemes, (i) can be achieved with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^{8}}{\epsilon^{8}})$$ samples, (ii) with $$\tilde{O}(\frac{SA t_{\mathrm{mix}}^5}{\epsilon^5})$$ samples, and (iii) with $$\tilde{O}(\frac{SA B}{\epsilon^9})$$ samples, where $$t_\mathrm{mix}$$ is the mixing time, and $B > 0$ is an MDP-dependent constant. To our knowledge, we provide the first finite-sample guarantees that are polynomial in $$S, A, t_{\mathrm{mix}}, \epsilon$$ for a feasible variant of $$Q$$-learning. That said, the sample complexity bounds have tremendous room for improvement, which we leave for the community’s best minds. Preliminary simulations verify that our framework is effective without prior knowledge of parameters as input. 
    more » « less
  3. We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCB-AVG, based on an optimistic variant of variance-reduced Q-learning. We show that UCB-AVG achieves a regret bound $$\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$$ after $$T$$ steps, where $$S\times A$$ is the size of state-action space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $$T$$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $$T$$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCB-AVG to develop a model-free algorithm that finds an $$\epsilon$$-optimal policy with sample complexity $$\widetilde{O}(SAsp^2(h^*)\epsilon^{-2} + S^2Asp(h^*)\epsilon^{-1}).$$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $$\Omega(SAsp(^*)\epsilon^{-2})$$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $$t_{mix},$$ the worst-case mixing time induced by a policy. We remark that the diameter $$D$$ and mixing time $$t_{mix}$$ are both lower bounded by $sp(h^*)$, and $$t_{mix}$$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $$\gamma$$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of value-difference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $$\gamma$$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity. 
    more » « less
  4. Sample complexity bounds are a common performance metric in the Reinforcement Learning literature. In the discounted cost, infinite horizon setting, all of the known bounds can be arbitrarily large, as the discount factor approaches unity. These results seem to imply that a very large number of samples is required to achieve an epsilon-optimal policy. The objective of the present work is to introduce a new class of algorithms that have sample complexity uniformly bounded over all discount factors. One may argue that this is impossible, due to a recent min-max lower bound. The explanation is that these prior bounds concern value function approximation and not policy approximation. We show that the asymptotic covariance of the tabular Q-learning algorithm with an optimized step-size sequence is a quadratic function of a factor that goes to infinity, as discount factor approaches 1; an essentially known result. The new relative Q-learning algorithm proposed here is shown to have asymptotic covariance that is uniformly bounded over all discount factors. 
    more » « less
  5. We study Markov potential games under the infinite horizon average reward criterion. Most previous studies have been for discounted rewards. We prove that both algorithms based on independent policy gradient and independent natural policy gradient converge globally to a Nash equilibrium for the average reward criterion. To set the stage for gradient-based methods, we first establish that the average reward is a smooth function of policies and provide sensitivity bounds for the differential value functions, under certain conditions on ergodicity and the second largest eigenvalue of the underlying Markov decision process (MDP). We prove that three algorithms, policy gradient, proximal-Q, and natural policy gradient (NPG), converge to an ϵ-Nash equilibrium with time complexity O(1ϵ2), given a gradient/differential Q function oracle. When policy gradients have to be estimated, we propose an algorithm with ~O(1mins,aπ(a|s)δ) sample complexity to achieve δ approximation error w.r.t~the ℓ2 norm. Equipped with the estimator, we derive the first sample complexity analysis for a policy gradient ascent algorithm, featuring a sample complexity of ~O(1/ϵ5). Simulation studies are presented. 
    more » « less