SPEED: Experimental Design for Policy Evaluation in Linear Heteroscedastic Bandits

Mukherjee, Subhojyoti; Xie, Qiaomin; Hanna, Josiah P; Nowak, Robert

Citation Details

In this paper, we study the problem of optimal data collection for policy evaluation in linear bandits. In policy evaluation, we are given a \textit{target} policy and asked to estimate the expected reward it will obtain when executed in a multi-armed bandit environment. Our work is the first work that focuses on such an optimal data collection strategy for policy evaluation involving heteroscedastic reward noise in the linear bandit setting. We first formulate an optimal design for weighted least squares estimates in the heteroscedastic linear bandit setting with the knowledge of noise variances. This design minimizes the mean squared error (MSE) of the estimated value of the target policy and is termed the oracle design. Since the noise variance is typically unknown, we then introduce a novel algorithm, SPEED (\textbf{S}tructured \textbf{P}olicy \textbf{E}valuation \textbf{E}xperimental \textbf{D}esign), that tracks the oracle design and derive its regret with respect to the oracle design. We show that regret scales as 𝑂˜(𝑑3𝑛−3/2) and prove a matching lower bound of Ω(𝑑2𝑛−3/2) . Finally, we evaluate SPEED on a set of policy evaluation tasks and demonstrate that it achieves MSE comparable to an optimal oracle and much lower than simply running the target policy. more »

Award ID(s):: 2023239

PAR ID:: 10533138

Author(s) / Creator(s):: Mukherjee, Subhojyoti; Xie, Qiaomin; Hanna, Josiah P; Nowak, Robert

Publisher / Repository:: International Conference on Artificial Intelligence and Statistics

Date Published:: 2024-04-18

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Proceeding:
The DOI is not currently available.

More Like this