Universal Off-Policy Evaluation

Chandak, Yash; Niekum, Scott; Castro da Silva, Bruno; Learned-Miller, Erik; Brunskill, Emma; Thomas, Philip

Citation Details

When faced with sequential decision-making problems, it is often useful to be able to predict what would happen if decisions were made using a new policy. Those predictions must often be based on data collected under some previously used decision-making rule. Many previous methods enable such off-policy (or counterfactual) estimation of the expected value of a performance measure called the return. In this paper, we take the first steps towards a universal off-policy estimator (UnO)—one that provides off-policy estimates and high-confidence bounds for any parameter of the return distribution. We use UnO for estimating and simultaneously bounding the mean, variance, quantiles/median, inter-quantile range, CVaR, and the entire cumulative distribution of returns. Finally, we also discuss UnO’s applicability in various settings, including fully observable, partially observable (i.e., with unobserved confounders), Markovian, non-Markovian, stationary, smoothly non-stationary, and discrete distribution shifts. more »

Award ID(s):: 2018372

PAR ID:: 10358096

Author(s) / Creator(s):: Chandak, Yash; Niekum, Scott; Castro da Silva, Bruno; Learned-Miller, Erik; Brunskill, Emma; Thomas, Philip

Date Published:: 2021-12-06

Journal Name:: Advances in neural information processing systems

ISSN:: 1049-5258

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this