NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Unbiased Optimal Stopping via the MUSE

https://doi.org/10.1016/j.spa.2022.12.007

Zhou, Zhengqing; Wang, Guanyang; Blanchet, Jose H.; Glynn, Peter W. (December 2022, Stochastic Processes and their Applications)

Full Text Available
Distributionally Robust Q-Learning

Liu, Zijian; Bai, Qinxun; Blanchet, Jose H.; Dong, Perry; Xu, Wei; Zhou, Zhengqing; Zhou, Zhengyuan (July 2022, Proceedings of Machine Learning Research)
Chaudhuri, Kamalika; Jegelka, Stefanie; Song, Le; Szepesvari, Csaba; Niu, Gang; Sabato, Sivan (Ed.)
Reinforcement learning (RL) has demonstrated remarkable achievements in simulated environments. However, carrying this success to real environments requires the important attribute of robustness, which the existing RL algorithms often lack as they assume that the future deployment environment is the same as the training environment (i.e. simulator) in which the policy is learned. This assumption often does not hold due to the discrepancy between the simulator and the real environment and, as a result, and hence renders the learned policy fragile when deployed. In this paper, we propose a novel distributionally robust Q-learning algorithm that learns the best policy in the worst distributional perturbation of the environment. Our algorithm first transforms the infinite-dimensional learning problem (since the environment MDP perturbation lies in an infinite-dimensional space) into a finite-dimensional dual problem and subsequently uses a multi-level Monte-Carlo scheme to approximate the dual value using samples from the simulator. Despite the complexity, we show that the resulting distributionally robust Q-learning algorithm asymptotically converges to optimal worst-case policy, thus making it robust to future environment changes. Simulation results further demonstrate its strong empirical robustness.
more » « less
Full Text Available
Finite-Sample Regret Bound for Distributionally Robust Offline Tabular Reinforcement Learning

Zhou, Zhengqing and (January 2021, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics)
Banerjee, Arindam and (Ed.)
While reinforcement learning has witnessed tremendous success recently in a wide range of domains, robustness–or the lack thereof–remains an important issue that remains inadequately addressed. In this paper, we provide a distributionally robust formulation of offline learning policy in tabular RL that aims to learn a policy from historical data (collected by some other behavior policy) that is robust to the future environment arising as a perturbation of the training environment. We first develop a novel policy evaluation scheme that accurately estimates the robust value (i.e. how robust it is in a perturbed environment) of any given policy and establish its finite-sample estimation error. Building on this, we then develop a novel and minimax-optimal distributionally robust learning algorithm that achieves $$O_P\left(1/\sqrt{n}\right)$$ regret, meaning that with high probability, the policy learned from using $$n$$ training data points will be $$O\left(1/\sqrt{n}\right)$$ close to the optimal distributionally robust policy. Finally, our simulation results demonstrate the superiority of our distributionally robust approach compared to non-robust RL algorithms.
more » « less
Full Text Available

Search for: All records