skip to main content


Title: Stochastic Multi-Player Bandit Learning from Player-Dependent Feedback
We investigate robust data aggregation in a multi-agent online learning setting. In reality, multiple online learning agents are often deployed to perform similar tasks and receive similar feedback. We study how agents can improve their collective performance by sharing information among each other. In this paper, we formulate the ε-multi-player multi-armed bandit problem, in which a set of M players that have similar reward distributions for each arm play concurrently. We develop an upper confidence bound-based algorithm that adaptively aggregates rewards collected by different players. To our best knowledge, we are the first to develop such a scheme in a multi-player bandit learning setting. We show that under the assumption that pairwise distances between the means of the player-dependent distributions for each arm are small, we improve the (collective) regret bound by nearly a factor of M , in comparison with a baseline algorithm in which the players learn individually using the UCB-1 algorithm (Auer et al., 2002). Our algorithm also exhibits a fallback guarantee, namely, if our task similarity assumption fails to hold, our algorithm still has a performance guarantee that cannot be worse than the baseline by a constant factor. Empirically, we validate our algorithm on synthetic data.  more » « less
Award ID(s):
1915734 1935500
NSF-PAR ID:
10173290
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
ICML Workshop on Real World Experiment Design and Active Learning
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In many real-world applications, multiple agents seek to learn how to perform highly related yet slightly different tasks in an online bandit learning protocol. We formulate this problem as the ϵ-multi-player multi-armed bandit problem, in which a set of players concurrently interact with a set of arms, and for each arm, the reward distributions for all players are similar but not necessarily identical. We develop an upper confidence bound-based algorithm, RobustAgg(ϵ), that adaptively aggregates rewards collected by different players. In the setting where an upper bound on the pairwise dissimilarities of reward distributions between players is known, we achieve instance-dependent regret guarantees that depend on the amenability of information sharing across players. We complement these upper bounds with nearly matching lower bounds. In the setting where pairwise dissimilarities are unknown, we provide a lower bound, as well as an algorithm that trades off minimax regret guarantees for adaptivity to unknown similarity structure. 
    more » « less
  2. arXiv:2402.05300v2 (Ed.)
    This paper considers a multi-player resource-sharing game with a fair reward allocation model. Multiple players choose from a collection of resources. Each resource brings a random reward equally divided among the players who choose it. We consider two settings. The first setting is a one-slot game where the mean rewards of the resources are known to all the players, and the objective of player 1 is to maximize their worst-case expected utility. Certain special cases of this setting have explicit solutions. These cases provide interesting yet non-intuitive insights into the problem. The second setting is an online setting, where the game is played over a finite time horizon, where the mean rewards are unknown to the first player. Instead, the first player receives, as feedback, the rewards of the resources they chose after the action. We develop a novel Upper Confidence Bound (UCB) algorithm that minimizes the worst-case regret of the first player using the feedback received. 
    more » « less
  3. We propose and evaluate a learning-based framework to address multi-agent resource allocation in coupled wireless systems. In particular we consider, multiple agents (e.g., base stations, access points, etc.) that choose amongst a set of resource allocation options towards achieving their own performance objective /requirements, and where the performance observed at each agent is further coupled with the actions chosen by the other agents, e.g., through interference, channel leakage, etc. The challenge is to find the best collective action. To that end we propose a Multi-Armed Bandit (MAB) framework wherein the best actions (aka arms) are adaptively learned through online reward feedback. Our focus is on systems which are "weakly-coupled" wherein the best arm of each agent is invariant to others' arm selection the majority of the time - this majority structure enables one to develop light weight efficient algorithms. This structure is commonly found in many wireless settings such as channel selection and power control. We develop a bandit algorithm based on the Track-and-Stop strategy, which shows a logarithmic regret with respect to a genie. Finally through simulation, we exhibit the potential use of our model and algorithm in several wireless application scenarios. 
    more » « less
  4. We study the problem of online multi-task learning where the tasks are performed within similar but not necessarily identical multi-armed bandit environments. In particular, we study how a learner can improve its overall performance across multiple related tasks through robust transfer of knowledge. While an upper confidence bound (UCB)-based algorithm has recently been shown to achieve nearly-optimal performance guarantees in a setting where all tasks are solved concurrently, it remains unclear whether Thompson sampling (TS) algorithms, which have superior empirical performance in general, share similar theoretical properties. In this work, we present a TS-type algorithm for a more general online multi-task learning protocol, which extends the concurrent setting. We provide its frequentist analysis and prove that it is also nearly-optimal using a novel concentration inequality for multi-task data aggregation at random stopping times. Finally, we evaluate the algorithm on synthetic data and show that the TS-type algorithm enjoys superior empirical performance in comparison with the UCB-based algorithm and a baseline algorithm that performs TS for each individual task without transfer. 
    more » « less
  5. Contextual bandit is a classic multi-armed bandit setting, where side information (i.e., context) is available before arm selection. A standard assumption is that exact contexts are perfectly known prior to arm selection and only single feedback is returned. In this work, we focus on multi-feedback bandit learning with probabilistic contexts, where a bundle of contexts are revealed to the agent along with their corresponding probabilities at the beginning of each round. This models such scenarios as where contexts are drawn from the probability output of a neural network and the reward function is jointly determined by multiple feedback signals. We propose a kernelized learning algorithm based on upper confidence bound to choose the optimal arm in reproducing kernel Hilbert space for each context bundle. Moreover, we theoretically establish an upper bound on the cumulative regret with respect to an oracle that knows the optimal arm given probabilistic contexts, and show that the bound grows sublinearly with time. Our simula- tion on machine learning model recommendation further validates the sub-linearity of our cumulative regret and demonstrates that our algorithm outper- forms the approach that selects arms based on the most probable context. 
    more » « less