Finite-Sample Regret Bound for Distributionally Robust Offline Tabular Reinforcement Learning

Zhou, Zhengqing and

Citation Details

While reinforcement learning has witnessed tremendous success recently in a wide range of domains, robustness–or the lack thereof–remains an important issue that remains inadequately addressed. In this paper, we provide a distributionally robust formulation of offline learning policy in tabular RL that aims to learn a policy from historical data (collected by some other behavior policy) that is robust to the future environment arising as a perturbation of the training environment. We first develop a novel policy evaluation scheme that accurately estimates the robust value (i.e. how robust it is in a perturbed environment) of any given policy and establish its finite-sample estimation error. Building on this, we then develop a novel and minimax-optimal distributionally robust learning algorithm that achieves $$O_P\left(1/\sqrt{n}\right)$$ regret, meaning that with high probability, the policy learned from using $$n$$ training data points will be $$O\left(1/\sqrt{n}\right)$$ close to the optimal distributionally robust policy. Finally, our simulation results demonstrate the superiority of our distributionally robust approach compared to non-robust RL algorithms. more »

Award ID(s):: 1915967

PAR ID:: 10344976

Author(s) / Creator(s):: Zhou, Zhengqing and

Editor(s):: Banerjee, Arindam and

Date Published:: 2021-01-01

Journal Name:: Proceedings of The 24th International Conference on Artificial Intelligence and Statistics

Volume:: 130

Issue:: 2021

Page Range / eLocation ID:: 3331--3339

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this