Optimal Sample Complexity for Average Reward Markov Decision Processes

Wang, Shengbo; Blanchet, Jose; Glynn, Peter

Citation Details

We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of O(|S||A|t2mixϵ−2)* and a lower bound of Ω(|S||A|tmixϵ−2). In these expressions, |S| and |A| denote the cardinalities of the state and action spaces respectively, tmix serves as a uniform upper limit for the total variation mixing times, and ϵ signifies the error tolerance. Therefore, a notable gap of tmix still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of eO(|S||A|tmixϵ−2). This marks the first algorithm and analysis to reach the literature’s lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin & Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings. more »

Award ID(s):: 2229011

PAR ID:: 10533336

Author(s) / Creator(s):: Wang, Shengbo; Blanchet, Jose; Glynn, Peter

Publisher / Repository:: The Twelfth International Conference on Learning Representations (ICLR)

Date Published:: 2024-03-15

Subject(s) / Keyword(s):: Optimal sample complexity Average Reward Reinforcement Learning Markov Decision Processes.

Format(s):: Medium: X

Location:: https://openreview.net/forum?id=jOm5p3q7c7

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this