Data-Efficient Policy Evaluation Through Behavior Policy Search

Hanna, Josiah P; Chandak, Yash; Thomas, Philip S; White, Martha; Stone, Peter; Niekum, Scott

Citation Details

We consider the task of evaluating a policy for a Markov decision process (MDP). The standard unbiased technique for evaluating a policy is to deploy the policy and observe its performance. We show that the data collected from deploying a different policy, commonly called the behavior policy, can be used to produce unbiased estimates with lower mean squared error than this standard technique. We derive an analytic expression for a minimal variance behavior policy -- a behavior policy that minimizes the mean squared error of the resulting estimates. Because this expression depends on terms that are unknown in practice, we propose a novel policy evaluation sub-problem, behavior policy search: searching for a behavior policy that reduces mean squared error. We present two behavior policy search algorithms and empirically demonstrate their effectiveness in lowering the mean squared error of policy performance estimates. more »

Award ID(s):: 2410981

PAR ID:: 10631573

Author(s) / Creator(s):: Hanna, Josiah P; Chandak, Yash; Thomas, Philip S; White, Martha; Stone, Peter; Niekum, Scott

Editor(s):: Ravikumar, Pradeep

Publisher / Repository:: JMLR

Date Published:: 2024-10-01

Journal Name:: Journal of machine learning research

ISSN:: 1533-7928

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Journal Article:
The DOI is not currently available.

More Like this