We consider the bandit problem of selecting K out of N arms at each time step. The joint reward can be a nonlinear function of the rewards of the selected individual arms. The direct use of a multiarmed bandit algorithm requires choosing among all possible combinations, making the action space large. To simplify the problem, existing works on combinatorial bandits typically assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for topK subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in N. Further, DART achieves a regret bound of Õ(K√KNT) for a time horizon T, which matches the lower bound in bandit feedback up to a factor of √log 2NT. When applied to the problem of crossselling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of stateoftheart algorithms. We also show that DART significantly outperforms existing methods for both linear and nonlinear joint reward environments.
more »
« less
Stochastic Bandits with Linear Constraints
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of multiple rounds is maximum, and each one of them has an expected cost below a certain threshold. We propose an upperconfidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove a sublinear bound on its regret that is inversely proportional to the difference between the constraint threshold and the cost of a known feasible action. Our algorithm balances exploration and constraint satisfaction using a novel idea that scales the radii of the reward and cost confidence sets with different scaling factors. We further specialize our results to multiarmed bandits and propose a computationally efficient algorithm for this setting and prove a a regret bound that is better than simply casting multiarmed bandits as an instance of linear bandits and using the regret bound of OPLB. We also prove a lowerbound for the problem studied in the paper and provide simulations to validate our theoretical results. Finally, we show how our algorithm and analysis can be extended to multiple constraints and to the case when the cost of the feasible action is unknown.
more »
« less
 Award ID(s):
 2023505
 NSFPAR ID:
 10273286
 Editor(s):
 Banerjee, Arindam; Fukumizu, Kenji
 Date Published:
 Journal Name:
 Proceedings of The 24th International Conference on Artificial Intelligence and Statistics
 Volume:
 130
 Page Range / eLocation ID:
 28272835
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We investigate the problem of combinatorial multiarmed bandits with stochastic submodular (in expectation) rewards and fullbandit feedback, where no extra information other than the reward of selected action at each time step is observed. We propose a simple algorithm, ExploreThenCommit Greedy (ETCG) and prove that it achieves a regret upper bound of for a horizon , number of base elements , and cardinality constraint . We also show in experiments with synthetic and realworld data that the ETCG empirically outperforms other fullbandit methods.more » « less

Cussens, James ; Zhang, Kun (Ed.)We investigate the problem of combinatorial multiarmed bandits with stochastic submodular (in expectation) rewards and fullbandit feedback, where no extra information other than the reward of selected action at each time step $t$ is observed. We propose a simple algorithm, ExploreThenCommit Greedy (ETCG) and prove that it achieves a $(11/e)$regret upper bound of $\mathcal{O}(n^\frac{1}{3}k^\frac{4}{3}T^\frac{2}{3}\log(T)^\frac{1}{2})$ for a horizon $T$, number of base elements $n$, and cardinality constraint $k$. We also show in experiments with synthetic and realworld data that the ETCG empirically outperforms other fullbandit methods.more » « less

Contextual bandit algorithms have become widely used for recommendation in online systems (e.g. marketplaces, music streaming, news), where they now wield substantial influence on which items get shown to users. This raises questions of fairness to the items — and to the sellers, artists, and writers that benefit from this exposure. We argue that the conventional bandit formulation can lead to an undesirable and unfair winnertakesall allocation of exposure. To remedy this problem, we propose a new bandit objective that guarantees meritbased fairness of exposure to the items while optimizing utility to the users. We formulate fairness regret and reward regret in this setting and present algorithms for both stochastic multiarmed bandits and stochastic linear bandits. We prove that the algorithms achieve sublinear fairness regret and reward regret. Beyond the theoretical analysis, we also provide empirical evidence that these algorithms can allocate exposure to different arms effectively.more » « less

null (Ed.)In this paper, we study Federated Bandit, a decentralized MultiArmed Bandit problem with a set of N agents, who can only communicate their local data with neighbors described by a connected graph G. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet they only have access to local and potentially biased feedback/evaluation of the true reward for each action taken. Learning only locally will lead agents to suboptimal actions while converging to a noregret strategy requires a collection of distributed data. Motivated by the proposal of federated learning, we aim for a solution with which agents will never share their local observations with a central entity, and will be allowed to only share a private copy of his/her own information with their neighbors. We first propose a decentralized bandit algorithm \textttGossip\_UCB, which is a coupling of variants of both the classical gossiping algorithm and the celebrated Upper Confidence Bound (UCB) bandit algorithm. We show that \textttGossip\_UCB successfully adapts local bandit learning into a global gossiping process for sharing information among connected agents, and achieves guaranteed regret at the order of O(\max\ \textttpoly (N,M) łog T, \textttpoly (N,M)łog_łambda_2^1 N\ ) for all N agents, where łambda_2\in(0,1) is the second largest eigenvalue of the expected gossip matrix, which is a function of G. We then propose \textttFed\_UCB, a differentially private version of \textttGossip\_UCB, in which the agents preserve εdifferential privacy of their local data while achieving O(\max \\frac\textttpoly (N,M) ε łog^2.5 T, \textttpoly (N,M) (łog_łambda_2^1 N + łog T) \ ) regret.more » « less