NSF PAR Search | NSF Public Access Repository

Optimal Sample Complexity for Average Reward Markov Decision Processes

Wang, Shengbo; Blanchet, Jose; Glynn, Peter (March 2024, The Twelfth International Conference on Learning Representations (ICLR))

We resolve the open question regarding the sample complexity of policy learning for maximizing the long-run average reward associated with a uniformly ergodic Markov decision process (MDP), assuming a generative model. In this context, the existing literature provides a sample complexity upper bound of O(|S||A|t2mixϵ−2)* and a lower bound of Ω(|S||A|tmixϵ−2). In these expressions, |S| and |A| denote the cardinalities of the state and action spaces respectively, tmix serves as a uniform upper limit for the total variation mixing times, and ϵ signifies the error tolerance. Therefore, a notable gap of tmix still remains to be bridged. Our primary contribution is the development of an estimator for the optimal policy of average reward MDPs with a sample complexity of eO(|S||A|tmixϵ−2). This marks the first algorithm and analysis to reach the literature’s lower bound. Our new algorithm draws inspiration from ideas in Li et al. (2020), Jin & Sidford (2021), and Wang et al. (2023). Additionally, we conduct numerical experiments to validate our theoretical findings.

Full Text Available

Search for: All records