One of the challenges for multiagent reinforcement learning (MARL) is designing efficient learning algorithms for a large system in which each agent has only limited or partial information of the entire system. Whereas exciting progress has been made to analyze decentralized MARL with the network of agents for social networks and team video games, little is known theoretically for decentralized MARL with the network of states for modeling self-driving vehicles, ride-sharing, and data and traffic routing. This paper proposes a framework of localized training and decentralized execution to study MARL with the network of states. Localized training means that agents only need to collect local information in their neighboring states during the training phase; decentralized execution implies that agents can execute afterward the learned decentralized policies, which depend only on agents’ current states. The theoretical analysis consists of three key components: the first is the reformulation of the MARL system as a networked Markov decision process with teams of agents, enabling updating the associated team Q-function in a localized fashion; the second is the Bellman equation for the value function and the appropriate Q-function on the probability measure space; and the third is the exponential decay property of the team Q-function, facilitating its approximation with efficient sample efficiency and controllable error. The theoretical analysis paves the way for a new algorithm LTDE-Neural-AC, in which the actor–critic approach with overparameterized neural networks is proposed. The convergence and sample complexity are established and shown to be scalable with respect to the sizes of both agents and states. To the best of our knowledge, this is the first neural network–based MARL algorithm with network structure and provable convergence guarantee.
more »
« less
This content will become publicly available on May 3, 2026
Scalable spectral representations for multiagent reinforcement learning in network MDPs
Network Markov Decision Processes (MDPs), which are the de-facto model for multi-agent control, pose a significant challenge to efficient learning caused by the exponential growth of the global state-action space with the number of agents. In this work, utilizing the exponential decay property of network dynamics, we first derive scalable spectral local representations for multiagent reinforcement learning in network MDPs, which induces a network linear subspace for the local $$Q$$-function of each agent. Building on these local spectral representations, we design a scalable algorithmic framework for multiagent reinforcement learning in continuous state-action network MDPs, and provide end-to-end guarantees for the convergence of our algorithm. Empirically, we validate the effectiveness of our scalable representation-based approach on two benchmark problems, and demonstrate the advantages of our approach over generic function approximation approaches to representing the local $$Q$$-functions.
more »
« less
- Award ID(s):
- 2328241
- PAR ID:
- 10630949
- Editor(s):
- Li, Yingzhen; Mandt, Stephan; Agrawal, Shipra; Khan, Emtiyaz
- Publisher / Repository:
- Proceedings of Machine Learning Research
- Date Published:
- Volume:
- 258
- Page Range / eLocation ID:
- 550-558
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We study multi-agent reinforcement learning (MARL) in a stochastic network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are static, fixed and local, e.g., between neighbors in a fixed, time-invariant underlying graph. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies can be non-local and stochastic, and provide a finite-time error bound that shows how the convergence rate depends on the speed of information spread in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation, which apply beyond the setting of MARL in networked systems.more » « less
-
The practicality of reinforcement learning algorithms has been limited due to poor scaling with respect to the problem size, as the sample complexity of learning an ε-optimal policy is Ω(|S||A|H/ ε2) over worst case instances of an MDP with state space S, action space A, and horizon H. We consider a class of MDPs for which the associated optimal Q* function is low rank, where the latent features are unknown. While one would hope to achieve linear sample complexity in |S| and |A| due to the low rank structure, we show that without imposing further assumptions beyond low rank of Q*, if one is constrained to estimate the Q function using only observations from a subset of entries, there is a worst case instance in which one must incur a sample complexity exponential in the horizon H to learn a near optimal policy. We subsequently show that under stronger low rank structural assumptions, given access to a generative model, Low Rank Monte Carlo Policy Iteration (LR-MCPI) and Low Rank Empirical Value Iteration (LR-EVI) achieve the desired sample complexity of Õ((|S|+|A|)poly (d,H)/ε2) for a rank d setting, which is minimax optimal with respect to the scaling of |S|, |A|, and ε. In contrast to literature on linear and low-rank MDPs, we do not require a known feature mapping, our algorithm is computationally simple, and our results hold for long time horizons. Our results provide insights on the minimal low-rank structural assumptions required on the MDP with respect to the transition kernel versus the optimal action-value function.more » « less
-
Optimal sensor placement is critical for enhancing the effectiveness of monitoring dynamical systems. Deterministic solutions do not reflect the effects of input and parameter uncertainty on the sensor placement. Using a Markov decision process (MDP) and a sensor placement agent, this study proposes a stochastic approach to maximize the gain from placing a fixed number of sensors within the system. Utilizing Deep Reinforcement Learning (DRL), the agent is trained by collecting interactive samples within the environment, which uses an information-theoretic reward function that is a measure, based on Shannon entropy, of the identifiability of the model parameters. The goal of the agent is to maximize its expected future reward by selecting, at each step, the action (placing a sensor) that provides the most information. This framework is validated using a synthetic model of a base-isolated structure. To consider the existing uncertainty in the parameters, a prior probability distribution is chosen (e.g., based on expert judgement or preliminary study) for each model parameter. Further, a probabilistic model for the input is used to reflect input variability. In a Deep Q-network, a type of DRL algorithm, the agent learns a mapping from states (i.e., sensor configurations) to the "quality" of each action at that state, called "Q-values". This network is trained using samples of state, action, and reward by interacting with the environment. The modular property of the framework and the function approximation used in this study makes it scalable to complex real-world applications of sensor placement problems in the presence of uncertainties.more » « less
-
Federated reinforcement learning (FRL) has emerged as a promising paradigm for reducing the sample complexity of reinforcement learning tasks by exploiting information from different agents. However, when each agent interacts with a po- tentially different environment, little to nothing is known theoretically about the non-asymptotic performance of FRL algorithms. The lack of such results can be attributed to various technical challenges and their intricate interplay: Markovian sampling, linear function approximation, multiple local updates to save communi- cation, heterogeneity in the reward functions and transition kernels of the agents’ MDPs, and continuous state-action spaces. Moreover, in the on-policy setting, the behavior policies vary with time, further complicating the analysis. In response, we introduce FedSARSA, a novel federated on-policy reinforcement learning scheme, equipped with linear function approximation, to address these challenges and provide a comprehensive finite-time error analysis. Notably, we establish that FedSARSA converges to a policy that is near-optimal for all agents, with the ex- tent of near-optimality proportional to the level of heterogeneity. Furthermore, we prove that FedSARSA leverages agent collaboration to enable linear speedups as the number of agents increases, which holds for both fixed and adaptive step-size configurations.more » « less
An official website of the United States government
