Koyejo, S.
; Mohamed, S.
; Agarwal, A.
; Belgrave, D.
; Cho, K.
; Oh, A.
(Ed.)
In the stochastic contextual bandit setting, regret-minimizing algorithms have been extensively researched, but their instance-minimizing best-arm identification counterparts remain
seldom studied. In this work, we focus on the stochastic bandit problem in the (ǫ, δ)-PAC setting: given a policy class Π the goal of the learner is to return a policy π ∈ Π whose expected
reward is within ǫ of the optimal policy with probability greater than 1 − δ. We characterize
the first instance-dependent PAC sample complexity of contextual bandits through a quantity
ρΠ, and provide matching upper and lower bounds in terms of ρΠ for the agnostic and linear
contextual best-arm identification settings. We show that no algorithm can be simultaneously
minimax-optimal for regret minimization and instance-dependent PAC for best-arm identification. Our main result is a new instance-optimal and computationally efficient algorithm that
relies on a polynomial number of calls to an argmax oracle.
more »
« less