Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes

Kallus, Nathan; Uehara, Masatoshi

Citation Details

Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate novel decision policies without needing to conduct exploration, which is often costly or otherwise infeasible. We consider for the first time the semiparametric efficiency limits of OPE in Markov decision processes (MDPs), where actions, rewards, and states are memoryless. We show existing OPE estimators may fail to be efficient in this setting. We develop a new estimator based on cross-fold estimation of q-functions and marginalized density ratios, which we term double reinforcement learning (DRL). We show that DRL is efficient when both components are estimated at fourth-root rates and is also doubly robust when only one component is consistent. We investigate these properties empirically and demonstrate the performance benefits due to harnessing memorylessness. more »

Award ID(s):: 1846210

PAR ID:: 10320788

Author(s) / Creator(s):: Kallus, Nathan; Uehara, Masatoshi

Date Published:: 2020-01-01

Journal Name:: Journal of machine learning research

Volume:: 21

Issue:: 167

ISSN:: 1532-4435

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
The DOI is not currently available.

More Like this