We consider the bandit problem of selecting K out of N arms at each time step. The joint reward can be a nonlinear function of the rewards of the selected individual arms. The direct use of a multiarmed bandit algorithm requires choosing among all possible combinations, making the action space large. To simplify the problem, existing works on combinatorial bandits typically assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for topK subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without usingmore »
Stochastic Bandits with Linear Constraints
We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of multiple rounds is maximum, and each one of them has an expected cost below a certain threshold. We propose an upperconfidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove a sublinear bound on its regret that is inversely proportional to the difference between the constraint threshold and the cost of a known feasible action. Our algorithm balances exploration and constraint satisfaction using a novel idea that scales the radii of the reward and cost confidence sets with different scaling factors. We further specialize our results to multiarmed bandits and propose a computationally efficient algorithm for this setting and prove a a regret bound that is better than simply casting multiarmed bandits as an instance of linear bandits and using the regret bound of OPLB. We also prove a lowerbound for the problem studied in the paper and provide simulations to validate our theoretical results. Finally, we show how our algorithm and analysis can be extended to multiple constraints and to the case when the cost of more »
 Editors:
 Banerjee, Arindam; Fukumizu, Kenji
 Award ID(s):
 2023505
 Publication Date:
 NSFPAR ID:
 10273286
 Journal Name:
 Proceedings of The 24th International Conference on Artificial Intelligence and Statistics
 Volume:
 130
 Page Range or eLocationID:
 28272835
 Sponsoring Org:
 National Science Foundation
More Like this


In this paper, we study Federated Bandit, a decentralized MultiArmed Bandit problem with a set of N agents, who can only communicate their local data with neighbors described by a connected graph G. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet they only have access to local and potentially biased feedback/evaluation of the true reward for each action taken. Learning only locally will lead agents to suboptimal actions while converging to a noregret strategy requires a collection of distributed data. Motivated by the proposal of federated learning, we aim for a solution withmore »

Contextual bandit algorithms have become widely used for recommendation in online systems (e.g. marketplaces, music streaming, news), where they now wield substantial influence on which items get shown to users. This raises questions of fairness to the items — and to the sellers, artists, and writers that benefit from this exposure. We argue that the conventional bandit formulation can lead to an undesirable and unfair winnertakesall allocation of exposure. To remedy this problem, we propose a new bandit objective that guarantees meritbased fairness of exposure to the items while optimizing utility to the users. We formulate fairness regret and rewardmore »

We investigate robust data aggregation in a multiagent online learning setting. In reality, multiple online learning agents are often deployed to perform similar tasks and receive similar feedback. We study how agents can improve their collective performance by sharing information among each other. In this paper, we formulate the εmultiplayer multiarmed bandit problem, in which a set of M players that have similar reward distributions for each arm play concurrently. We develop an upper confidence boundbased algorithm that adaptively aggregates rewards collected by different players. To our best knowledge, we are the first to develop such a scheme in amore »

Contextual bandit is a classic multiarmed bandit setting, where side information (i.e., context) is available before arm selection. A standard assumption is that exact contexts are perfectly known prior to arm selection and only single feedback is returned. In this work, we focus on multifeedback bandit learning with probabilistic contexts, where a bundle of contexts are revealed to the agent along with their corresponding probabilities at the beginning of each round. This models such scenarios as where contexts are drawn from the probability output of a neural network and the reward function is jointly determined by multiple feedback signals. Wemore »