skip to main content


This content will become publicly available on December 13, 2024

Title: Privacy-Engineered Value Decomposition Networks for Cooperative Multi-Agent Reinforcement Learning
In cooperative multi-agent reinforcement learning (Co-MARL), a team of agents must jointly optimize the team's longterm rewards to learn a designated task. Optimizing rewards as a team often requires inter-agent communication and data sharing, leading to potential privacy implications. We assume privacy considerations prohibit the agents from sharing their environment interaction data. Accordingly, we propose Privacy-Engineered Value Decomposition Networks (PE-VDN), a Co-MARL algorithm that models multi-agent coordination while provably safeguarding the confidentiality of the agents' environment interaction data. We integrate three privacy-engineering techniques to redesign the data flows of the VDN algorithm-an existing Co-MARL algorithm that consolidates the agents' environment interaction data to train a central controller that models multi-agent coordination-and develop PE-VDN. In the first technique, we design a distributed computation scheme that eliminates Vanilla VDN's dependency on sharing environment interaction data. Then, we utilize a privacy-preserving multi-party computation protocol to guar-antee that the data flows of the distributed computation scheme do not pose new privacy risks. Finally, we enforce differential privacy to preempt inference threats against the agents' training data-past environment interactions-when they take actions based on their neural network predictions. We implement PE-VDN in StarCraft Multi-Agent Competition (SMAC) and show that it achieves 80% of Vanilla VDN's win rate while maintaining differential privacy levels that provide meaningful privacy guarantees. The results demonstrate that PE-VDN can safeguard the confidentiality of agents' environment interaction data without sacrificing multi-agent coordination.  more » « less
Award ID(s):
1943275
NSF-PAR ID:
10494460
Author(s) / Creator(s):
; ;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
Proceedings of the 62nd IEEE Conference on Decision and Control (CDC)
Page Range / eLocation ID:
8038 to 8044
Format(s):
Medium: X
Location:
Singapore, Singapore
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Intelligent utilization of resources and improved mission performance in an autonomous agent require consideration of cyber and physical resources. The allocation of these resources becomes more complex when the system expands from one agent to multiple agents, and the control shifts from centralized to decentralized. Consensus is a distributed algorithm that lets multiple agents agree on a shared value, but typically does not leverage mobility. We propose a coupled consensus control strategy that co-regulates computation, communication frequency, and connectivity of the agents to achieve faster convergence times at lower communication rates and computational costs. In this strategy, agents move towards a common location to increase connectivity. Simultaneously, the communication frequency is increased when the shared state error between an agent and its connected neighbors is high. When the shared state converges (i.e., consensus is reached), the agents withdraw to the initial positions and the communication frequency is decreased. Convergence properties of our algorithm are demonstrated under the proposed co-regulated control algorithm. We evaluated the proposed approach through a new set of cyber-physical, multi-agent metrics and demonstrated our approach in a simulation of unmanned aircraft systems measuring temperatures at multiple sites. The results demonstrate that, compared with fixed-rate and event-triggered consensus algorithms, our co-regulation scheme can achieve improved performance with fewer resources, while maintaining high reactivity to changes in the environment and system. 
    more » « less
  2. null (Ed.)
    In this paper, we study Federated Bandit, a decentralized Multi-Armed Bandit problem with a set of N agents, who can only communicate their local data with neighbors described by a connected graph G. Each agent makes a sequence of decisions on selecting an arm from M candidates, yet they only have access to local and potentially biased feedback/evaluation of the true reward for each action taken. Learning only locally will lead agents to sub-optimal actions while converging to a no-regret strategy requires a collection of distributed data. Motivated by the proposal of federated learning, we aim for a solution with which agents will never share their local observations with a central entity, and will be allowed to only share a private copy of his/her own information with their neighbors. We first propose a decentralized bandit algorithm \textttGossip\_UCB, which is a coupling of variants of both the classical gossiping algorithm and the celebrated Upper Confidence Bound (UCB) bandit algorithm. We show that \textttGossip\_UCB successfully adapts local bandit learning into a global gossiping process for sharing information among connected agents, and achieves guaranteed regret at the order of O(\max\ \textttpoly (N,M) łog T, \textttpoly (N,M)łog_łambda_2^-1 N\ ) for all N agents, where łambda_2\in(0,1) is the second largest eigenvalue of the expected gossip matrix, which is a function of G. We then propose \textttFed\_UCB, a differentially private version of \textttGossip\_UCB, in which the agents preserve ε-differential privacy of their local data while achieving O(\max \\frac\textttpoly (N,M) ε łog^2.5 T, \textttpoly (N,M) (łog_łambda_2^-1 N + łog T) \ ) regret. 
    more » « less
  3. We consider the problem of spectrum sharing by multiple cellular operators. We propose a novel deep Reinforcement Learning (DRL)-based distributed power allocation scheme which utilizes the multi-agent Deep Deterministic Policy Gradient (MA-DDPG) algorithm. In particular, we model the base stations (BSs) that belong to the multiple operators sharing the same band, as DRL agents that simultaneously determine the transmit powers to their scheduled user equipment (UE) in a synchronized manner. The power decision of each BS is based on its own observation of the radio environment (RF) environment, which consists of interference measurements reported from the UEs it serves, and a limited amount of information obtained from other BSs. One advantage of the proposed scheme is that it addresses the single-agent non-stationarity problem of RL in the multi-agent scenario by incorporating the actions and observations of other BSs into each BS's own critic which helps it to gain a more accurate perception of the overall RF environment. A centralized-training-distributed-execution framework is used to train the policies where the critics are trained over the joint actions and observations of all BSs while the actor of each BS only takes the local observation as input in order to produce the transmit power. Simulation with the 6 GHz Unlicensed National Information Infrastructure (U-NII)-5 band shows that the proposed power allocation scheme can achieve better throughput performance than several state-of-the-art approaches. 
    more » « less
  4. Information sharing among agents to jointly solve problems is challenging for multi-agent reinforcement learning algorithms (MARL) in smart environments. In this paper, we present a novel information sharing approach for MARL, which introduces a Team Information Matrix (TIM) that integrates scenario-independent spatial and environmental information combined with the agent's local observations, augmenting both individual agent's performance and global awareness during the MARL learning. To evaluate this approach, we conducted experiments on three multi-agent scenarios of varying difficulty levels implemented in Unity ML-Agents Toolkit. Experimental results show that the agents utilizing our TIM-Shared variation outperformed those using decentralized MARL and achieved comparable performance to agents employing centralized MARL. 
    more » « less
  5. We investigate robust data aggregation in a multi-agent online learning setting. In reality, multiple online learning agents are often deployed to perform similar tasks and receive similar feedback. We study how agents can improve their collective performance by sharing information among each other. In this paper, we formulate the ε-multi-player multi-armed bandit problem, in which a set of M players that have similar reward distributions for each arm play concurrently. We develop an upper confidence bound-based algorithm that adaptively aggregates rewards collected by different players. To our best knowledge, we are the first to develop such a scheme in a multi-player bandit learning setting. We show that under the assumption that pairwise distances between the means of the player-dependent distributions for each arm are small, we improve the (collective) regret bound by nearly a factor of M , in comparison with a baseline algorithm in which the players learn individually using the UCB-1 algorithm (Auer et al., 2002). Our algorithm also exhibits a fallback guarantee, namely, if our task similarity assumption fails to hold, our algorithm still has a performance guarantee that cannot be worse than the baseline by a constant factor. Empirically, we validate our algorithm on synthetic data. 
    more » « less