skip to main content


Title: An Efficient Algorithm for Fair Multi-Agent Multi-Armed Bandit with Low Regret

Recently a multi-agent variant of the classical multi-armed bandit was proposed to tackle fairness issues in online learning. Inspired by a long line of work in social choice and economics, the goal is to optimize the Nash social welfare instead of the total utility. Unfortunately previous algorithms either are not efficient or achieve sub-optimal regret in terms of the number of rounds. We propose a new efficient algorithm with lower regret than even previous inefficient ones. We also complement our efficient algorithm with an inefficient approach with regret that matches the lower bound for one agent. The experimental findings confirm the effectiveness of our efficient algorithm compared to the previous approaches.

 
more » « less
Award ID(s):
1750716
NSF-PAR ID:
10493897
Author(s) / Creator(s):
; ;
Publisher / Repository:
Association for the Advancement of Artificial Intelligence
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Volume:
37
Issue:
7
ISSN:
2159-5399
Page Range / eLocation ID:
8159 to 8167
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    https://arxiv.org/abs/2010.13724 We study the question of obtaining last-iterate convergence rates for no-regret learning algorithms in multi-player games. We show that the optimistic gradient (OG) algorithm with a constant step-size, which is no-regret, achieves a last-iterate rate of O(1/T‾‾√) with respect to the gap function in smooth monotone games. This result addresses a question of Mertikopoulos & Zhou (2018), who asked whether extra-gradient approaches (such as OG) can be applied to achieve improved guarantees in the multi-agent learning setting. The proof of our upper bound uses a new technique centered around an adaptive choice of potential function at each iteration. We also show that the O(1/T‾‾√) rate is tight for all p-SCLI algorithms, which includes OG as a special case. As a byproduct of our lower bound analysis we additionally present a proof of a conjecture of Arjevani et al. (2015) which is more direct than previous approaches. 
    more » « less
  2. We consider a multi-agent multi-armed bandit setting in which n honest agents collaborate over a network to minimize regret but m malicious agents can disrupt learning arbitrarily. Assuming the network is the complete graph, existing algorithms incur O((m + K/n) łog (T) / Δ ) regret in this setting, where K is the number of arms and Δ is the arm gap. For m łl K, this improves over the single-agent baseline regret of O(Kłog(T)/Δ). In this work, we show the situation is murkier beyond the case of a complete graph. In particular, we prove that if the state-of-the-art algorithm is used on the undirected line graph, honest agents can suffer (nearly) linear regret until time is doubly exponential in K and n . In light of this negative result, we propose a new algorithm for which the i -th agent has regret O(( dmal (i) + K/n) łog(T)/Δ) on any connected and undirected graph, where dmal(i) is the number of i 's neighbors who are malicious. Thus, we generalize existing regret bounds beyond the complete graph (where dmal(i) = m), and show the effect of malicious agents is entirely local (in the sense that only the dmal (i) malicious agents directly connected to i affect its long-term regret). 
    more » « less
  3. We propose and evaluate a learning-based framework to address multi-agent resource allocation in coupled wireless systems. In particular we consider, multiple agents (e.g., base stations, access points, etc.) that choose amongst a set of resource allocation options towards achieving their own performance objective /requirements, and where the performance observed at each agent is further coupled with the actions chosen by the other agents, e.g., through interference, channel leakage, etc. The challenge is to find the best collective action. To that end we propose a Multi-Armed Bandit (MAB) framework wherein the best actions (aka arms) are adaptively learned through online reward feedback. Our focus is on systems which are "weakly-coupled" wherein the best arm of each agent is invariant to others' arm selection the majority of the time - this majority structure enables one to develop light weight efficient algorithms. This structure is commonly found in many wireless settings such as channel selection and power control. We develop a bandit algorithm based on the Track-and-Stop strategy, which shows a logarithmic regret with respect to a genie. Finally through simulation, we exhibit the potential use of our model and algorithm in several wireless application scenarios. 
    more » « less
  4. Banerjee, Arindam ; Fukumizu, Kenji (Ed.)
    We study a constrained contextual linear bandit setting, where the goal of the agent is to produce a sequence of policies, whose expected cumulative reward over the course of multiple rounds is maximum, and each one of them has an expected cost below a certain threshold. We propose an upper-confidence bound algorithm for this problem, called optimistic pessimistic linear bandit (OPLB), and prove a sublinear bound on its regret that is inversely proportional to the difference between the constraint threshold and the cost of a known feasible action. Our algorithm balances exploration and constraint satisfaction using a novel idea that scales the radii of the reward and cost confidence sets with different scaling factors. We further specialize our results to multi-armed bandits and propose a computationally efficient algorithm for this setting and prove a a regret bound that is better than simply casting multi-armed bandits as an instance of linear bandits and using the regret bound of OPLB. We also prove a lower-bound for the problem studied in the paper and provide simulations to validate our theoretical results. Finally, we show how our algorithm and analysis can be extended to multiple constraints and to the case when the cost of the feasible action is unknown. 
    more » « less
  5. Avidan, S. ; Brostow, G. ; Cissé, M. ; Farinella, G.M. ; Hassner, T. (Ed.)
    Wen, S., Wang, H., Metaxas, D. (2022). Social ODE: Multi-agent Trajectory Forecasting with Neural Ordinary Differential Equations. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13682. Springer, Cham. https://doi.org/10.1007/978-3-031-20047-2_13 Multi-agent trajectory forecasting has recently attracted a lot of attention due to its widespread applications including autonomous driving. Most previous methods use RNNs or Transformers to model agent dynamics in the temporal dimension and social pooling or GNNs to model interactions with other agents; these approaches usually fail to learn the underlying continuous temporal dynamics and agent interactions explicitly. To address these problems, we propose Social ODE which explicitly models temporal agent dynamics and agent interactions. Our approach leverages Neural ODEs to model continuous temporal dynamics, and incorporates distance, interaction intensity, and aggressiveness estimation into agent interaction modeling in latent space. We show in extensive experiments that our Social ODE approach compares favorably with state-of-the-art, and more importantly, can successfully avoid sudden obstacles and effectively control the motion of the agent, while previous methods often fail in such cases. 
    more » « less