Stochastic Bandits with Linear Constraints

Pacchiano, Aldo; Ghavamzadeh, Mohammad; Bartlett, Peter L.; Jiang, Heinrich

Citation Details

We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of multiple rounds is maximum, and each one of them has an expected cost below a certain threshold. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove a sublinear bound on its regret that is inversely proportional to the difference between the constraint threshold and the cost of a known feasible action. Our algorithm balances exploration and constraint satisfaction using a novel idea that scales the radii of the reward and cost confidence sets with different scaling factors. We further specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting and prove a a regret bound that is better than simply casting multi-armed bandits as an instance of linear bandits and using the regret bound of OPLB. We also prove a lower-bound for the problem studied in the paper and provide simulations to validate our theoretical results. Finally, we show how our algorithm and analysis can be extended to multiple constraints and to the case when the cost of the feasible action is unknown. more »

Award ID(s):: 2023505

PAR ID:: 10273286

Author(s) / Creator(s):: Pacchiano, Aldo; Ghavamzadeh, Mohammad; Bartlett, Peter L.; Jiang, Heinrich

Editor(s):: Banerjee, Arindam; Fukumizu, Kenji

Date Published:: 2021-04-01

Journal Name:: Proceedings of The 24th International Conference on Artificial Intelligence and Statistics

Volume:: 130

Page Range / eLocation ID:: 2827-2835

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this