skip to main content

Title: Scalable Reinforcement Learning for Multiagent Networked Systems
We study reinforcement learning (RL) in a setting with a network of agents whose states and actions interact in a local manner where the objective is to find localized policies such that the (discounted) global reward is maximized. A fundamental challenge in this setting is that the state-action space size scales exponentially in the number of agents, rendering the problem intractable for large networks. In this paper, we propose a scalable actor critic (SAC) framework that exploits the network structure and finds a localized policy that is an [Formula: see text]-approximation of a stationary point of the objective for some [Formula: see text], with complexity that scales with the local state-action space size of the largest [Formula: see text]-hop neighborhood of the network. We illustrate our model and approach using examples from wireless communication, epidemics, and traffic.  more » « less
Award ID(s):
2105648 2146814 2136197 2106403 1637598 2038603
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Operations Research
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We study multi-agent reinforcement learning (MARL) in a stochastic network of agents. The objective is to find localized policies that maximize the (discounted) global reward. In general, scalability is a challenge in this setting because the size of the global state/action space can be exponential in the number of agents. Scalable algorithms are only known in cases where dependencies are static, fixed and local, e.g., between neighbors in a fixed, time-invariant underlying graph. In this work, we propose a Scalable Actor Critic framework that applies in settings where the dependencies can be non-local and stochastic, and provide a finite-time error bound that shows how the convergence rate depends on the speed of information spread in the network. Additionally, as a byproduct of our analysis, we obtain novel finite-time convergence results for a general stochastic approximation scheme and for temporal difference learning with state aggregation, which apply beyond the setting of MARL in networked systems. 
    more » « less
  2. We propose subsampling as a unified algorithmic technique for submodular maximization in centralized and online settings. The idea is simple: independently sample elements from the ground set and use simple combinatorial techniques (such as greedy or local search) on these sampled elements. We show that this approach leads to optimal/state-of-the-art results despite being much simpler than existing methods. In the usual off-line setting, we present SampleGreedy, which obtains a [Formula: see text]-approximation for maximizing a submodular function subject to a p-extendible system using [Formula: see text] evaluation and feasibility queries, where k is the size of the largest feasible set. The approximation ratio improves to p + 1 and p for monotone submodular and linear objectives, respectively. In the streaming setting, we present Sample-Streaming, which obtains a [Formula: see text]-approximation for maximizing a submodular function subject to a p-matchoid using O(k) memory and [Formula: see text] evaluation and feasibility queries per element, and m is the number of matroids defining the p-matchoid. The approximation ratio improves to 4p for monotone submodular objectives. We empirically demonstrate the effectiveness of our algorithms on video summarization, location summarization, and movie recommendation tasks. 
    more » « less
  3. In this paper, we consider federated Q-learning, which aims to learn an optimal Q-function by periodically aggregating local Q-estimates trained on local data alone. Focusing on infinite-horizon tabular Markov decision processes, we provide sample complexity guarantees for both the synchronous and asynchronous variants of federated Q-learning. In both cases, our bounds exhibit a linear speedup with respect to the number of agents and sharper dependencies on other salient problem parameters. Moreover, existing approaches to federated Q-learning adopt an equally-weighted averaging of local Q-estimates, which can be highly sub-optimal in the asynchronous setting since the local trajectories can be highly heterogeneous due to different local behavior policies. Existing sample complexity scales inverse proportionally to the minimum entry of the stationary state-action occupancy distributions over all agents, requiring that every agent covers the entire state-action space. Instead, we propose a novel importance averaging algorithm, giving larger weights to more frequently visited state-action pairs. The improved sample complexity scales inverse proportionally to the minimum entry of the average stationary state-action occupancy distribution of all agents, thus only requiring the agents collectively cover the entire state-action space, unveiling the blessing of heterogeneity. 
    more » « less
  4. null (Ed.)
    This work concerns the asymptotic behavior of solutions to a (strictly) subcritical fluid model for a data communication network, where file sizes are generally distributed and the network operates under a fair bandwidth-sharing policy. Here we consider fair bandwidth-sharing policies that are a slight generalization of the [Formula: see text]-fair policies introduced by Mo and Walrand [Mo J, Walrand J (2000) Fair end-to-end window-based congestion control. IEEE/ACM Trans. Networks 8(5):556–567.]. Since the year 2000, it has been a standing problem to prove stability of the data communications network model of Massoulié and Roberts [Massoulié L, Roberts J (2000) Bandwidth sharing and admission control for elastic traffic. Telecommunication Systems 15(1):185–201.], with general file sizes and operating under fair bandwidth sharing policies, when the offered load is less than capacity (subcritical conditions). A crucial step in an approach to this problem is to prove stability of subcritical fluid model solutions. In 2012, Paganini et al. [Paganini F, Tang A, Ferragut A, Andrew LLH (2012) Network stability under alpha fair bandwidth allocation with general file size distribution. IEEE Trans. Automatic Control 57(3):579–591.] introduced a Lyapunov function for this purpose and gave an argument, assuming that fluid model solutions are sufficiently smooth in time and space that they are strong solutions of a partial differential equation and assuming that no fluid level on any route touches zero before all route levels reach zero. The aim of the current paper is to prove stability of the subcritical fluid model without these strong assumptions. Starting with a slight generalization of the Lyapunov function proposed by Paganini et al., assuming that each component of the initial state of a measure-valued fluid model solution, as well as the file size distributions, have no atoms and have finite first moments, we prove absolute continuity in time of the composition of the Lyapunov function with any subcritical fluid model solution and describe the associated density. We use this to prove that the Lyapunov function composed with such a subcritical fluid model solution converges to zero as time goes to infinity. This implies that each component of the measure-valued fluid model solution converges vaguely on [Formula: see text] to the zero measure as time goes to infinity. Under the further assumption that the file size distributions have finite pth moments for some p > 1 and that each component of the initial state of the fluid model solution has finite pth moment, it is proved that the fluid model solution reaches the measure with all components equal to the zero measure in finite time and that the time to reach this zero state has a uniform bound for all fluid model solutions having a uniform bound on the initial total mass and the pth moment of each component of the initial state. In contrast to the analysis of Paganini et al., we do not need their strong smoothness assumptions on fluid model solutions and we rigorously treat the realistic, but singular situation, where the fluid level on some routes becomes zero, whereas other route levels remain positive. 
    more » « less
  5. This work aims to prove a Hardy-type inequality and a trace theorem for a class of function spaces on smooth domains with a nonlocal character. Functions in these spaces are allowed to be as rough as an [Formula: see text]-function inside the domain of definition but as smooth as a [Formula: see text]-function near the boundary. This feature is captured by a norm that is characterized by a nonlocal interaction kernel defined heterogeneously with a special localization feature on the boundary. Thus, the trace theorem we obtain here can be viewed as an improvement and refinement of the classical trace theorem for fractional Sobolev spaces [Formula: see text]. Similarly, the Hardy-type inequalities we establish for functions that vanish on the boundary show that functions in this generalized space have the same decay rate to the boundary as functions in the smaller space [Formula: see text]. The results we prove extend existing results shown in the Hilbert space setting with p = 2. A Poincaré-type inequality we establish for the function space under consideration together with the new trace theorem allows formulating and proving well-posedness of a nonlinear nonlocal variational problem with conventional local boundary condition. 
    more » « less