Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

Bennett, Andrew; Kallus, Nathan

doi:10.1287/opre.2021.0781

Citation Details

Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes

In applications of offline reinforcement learning to observational data, such as in healthcare or education, a general concern is that observed actions might be affected by unobserved factors, inducing confounding and biasing estimates derived assuming a perfect Markov decision process (MDP) model. In “Proximal Reinforcement Learning: Efficient Off-Policy Evaluation in Partially Observed Markov Decision Processes,” A. Bennett and N. Kallus tackle this by considering off-policy evaluation in a partially observed MDP (POMDP). Specifically, they consider estimating the value of a given target policy in an unknown POMDP, given observations of trajectories generated by a different and unknown policy, which may depend on the unobserved states. They consider both when the target policy value can be identified the observed data and, given identification, how best to estimate it. Both these problems are addressed by extending the framework of proximal causal inference to POMDP settings, using sequences of so-called bridge functions. This results in a novel framework for off-policy evaluation in POMDPs that they term proximal reinforcement learning, which they validate in various empirical settings. more »

Award ID(s):: 1939704

PAR ID:: 10466992

Author(s) / Creator(s):: Bennett, Andrew; Kallus, Nathan

Publisher / Repository:: INFORMS

Date Published:: 2023-09-26

Journal Name:: Operations Research

ISSN:: 0030-364X

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.1287/opre.2021.0781

More Like this