EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET

Cowan, Wesley; Katehakis, Michael N

doi:10.1017/S0269964818000529

Citation Details

EXPLORATION–EXPLOITATION POLICIES WITH ALMOST SURE, ARBITRARILY SLOW GROWING ASYMPTOTIC REGRET

The purpose of this paper is to provide further understanding into the structure of the sequential allocation (“stochastic multi-armed bandit”) problem by establishing probability one finite horizon bounds and convergence rates for the sample regret associated with two simple classes of allocation policies. For any slowly increasing functiong, subject to mild regularity constraints, we construct two policies (theg-Forcing, and theg-Inflated Sample Mean) that achieve a measure of regret of orderO(g(n)) almost surely asn→ ∞, bound from above and below. Additionally, almost sure upper and lower bounds on the remainder term are established. In the constructions herein, the functiongeffectively controls the “exploration” of the classical “exploration/exploitation” tradeoff. more »

Award ID(s):: 1662629

PAR ID:: 10669475

Author(s) / Creator(s):: Cowan, Wesley; Katehakis, Michael N

Publisher / Repository:: Cambridge University Press.

Date Published:: 2020-07-01

Journal Name:: Probability in the Engineering and Informational Sciences

Volume:: 34

Issue:: 3

ISSN:: 0269-9648

Page Range / eLocation ID:: 406 to 428

Subject(s) / Keyword(s):: AMS Subject Classification: Primary: 62G05, secondary: 62G20 Keywords: bandits, forcing actions, inflated sample means, multi-armed, online learning, sequential allocation, upper confidence bounds

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Journal Article:
https://doi.org/10.1017/S0269964818000529

More Like this