Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

Agrawal, Shipra; Jia, Randy

doi:10.1287/moor.2022.1266

Citation Details

Optimistic Posterior Sampling for Reinforcement Learning: Worst-Case Regret Bounds

We present an algorithm based on posterior sampling (aka Thompson sampling) that achieves near-optimal worst-case regret bounds when the underlying Markov decision process (MDP) is communicating with a finite, although unknown, diameter. Our main result is a high probability regret upper bound of [Formula: see text] for any communicating MDP with S states, A actions, and diameter D. Here, regret compares the total reward achieved by the algorithm to the total expected reward of an optimal infinite-horizon undiscounted average reward policy in time horizon T. This result closely matches the known lower bound of [Formula: see text]. Our techniques involve proving some novel results about the anti-concentration of Dirichlet distribution, which may be of independent interest. more »

Award ID(s):: 1846792

PAR ID:: 10374182

Author(s) / Creator(s):: Agrawal, Shipra; Jia, Randy

Date Published:: 2022-05-01

Journal Name:: Mathematics of Operations Research

ISSN:: 0364-765X

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.1287/moor.2022.1266

More Like this