skip to main content


This content will become publicly available on January 16, 2025

Title: A Black-box Approach for Non-stationary Multi-agent Reinforcement Learning
We investigate learning the equilibria in non-stationary multi-agent systems and address the challenges that differentiate multi-agent learning from single-agent learning. Specifically, we focus on games with bandit feedback, where testing an equilibrium can result in substantial regret even when the gap to be tested is small, and the existence of multiple optimal solutions (equilibria) in stationary games poses extra challenges. To overcome these obstacles, we propose a versatile black-box approach applicable to a broad spectrum of problems, such as general-sum games, potential games, and Markov games, when equipped with appropriate learning and testing oracles for stationary environments. Our algorithms can achieve O(∆^1/4 T^3/4) regret when the degree of nonstationarity, as measured by total variation ∆, is known, and O(∆^1/5 T^4/5) regret when ∆ is unknown, where T is the number of rounds. Meanwhile, our algorithm inherits the favorable dependence on number of agents from the oracles. As a side contribution that may be independent of interest, we show how to test for various types of equilibria by a black-box reduction to single-agent learning, which includes Nash equilibria, correlated equilibria, and coarse correlated equilibria.  more » « less
Award ID(s):
2023166 2212261
PAR ID:
10534275
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Proceedings of the International Conference on Learning Representations
Date Published:
ISSN:
2640-3498
Format(s):
Medium: X
Location:
Vienna, Austria
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Regret minimization has proved to be a versatile tool for tree- form sequential decision making and extensive-form games. In large two-player zero-sum imperfect-information games, mod- ern extensions of counterfactual regret minimization (CFR) are currently the practical state of the art for computing a Nash equilibrium. Most regret-minimization algorithms for tree-form sequential decision making, including CFR, require (i) an exact model of the player’s decision nodes, observation nodes, and how they are linked, and (ii) full knowledge, at all times t, about the payoffs—even in parts of the decision space that are not encountered at time t. Recently, there has been growing interest towards relaxing some of those restric- tions and making regret minimization applicable to settings for which reinforcement learning methods have traditionally been used—for example, those in which only black-box access to the environment is available. We give the first, to our knowl- edge, regret-minimization algorithm that guarantees sublinear regret with high probability even when requirement (i)—and thus also (ii)—is dropped. We formalize an online learning setting in which the strategy space is not known to the agent and gets revealed incrementally whenever the agent encoun- ters new decision points. We give an efficient algorithm that achieves O(T 3/4) regret with high probability for that setting, even when the agent faces an adversarial environment. Our experiments show it significantly outperforms the prior algo- rithms for the problem, which do not have such guarantees. It can be used in any application for which regret minimization is useful: approximating Nash equilibrium or quantal response equilibrium, approximating coarse correlated equilibrium in multi-player games, learning a best response, learning safe opponent exploitation, and online play against an unknown opponent/environment. 
    more » « less
  2. Characterizing the performance of no-regret dynamics in multi-player games is a foundational problem at the interface of online learning and game theory. Recent results have revealed that when all players adopt specific learning algorithms, it is possible to improve exponentially over what is predicted by the overly pessimistic no-regret framework in the traditional adversarial regime, thereby leading to faster convergence to the set of coarse correlated equilibria (CCE) – a standard game-theoretic equilibrium concept. Yet, despite considerable recent progress, the fundamental complexity barriers for learning in normal- and extensive-form games are poorly understood. In this paper, we make a step towards closing this gap by first showing that – barring major complexity breakthroughs – any polynomial-time learning algorithms in extensive-form games need at least 2log1/2−o(1) |T | iterations for the average regret to reach below even an absolute constant, where |T | is the number of nodes in the game. This establishes a superpolynomial separation between no-regret learning in normal- and extensive-form games, as in the former class a logarithmic number of iterations suffices to achieve constant average regret. Furthermore, our results imply that algorithms such as multiplicative weights update, as well as its optimistic counterpart, require at least 2(log logm)1/2−o(1) iterations to attain an O(1)-CCE in m-action normal-form games under any parameterization. These are the first non-trivial – and dimension-dependent – lower bounds in that setting for the most well-studied algorithms in the literature. From a technical standpoint, we follow a beautiful connection recently made by Foster, Golowich, and Kakade (ICML ’23) between sparse CCE and Nash equilibria in the context of Markov games. Consequently, our lower bounds rule out polynomial-time algorithms well beyond the traditional online learning framework, capturing techniques commonly used for accelerating centralized equilibrium computation. 
    more » « less
  3. null (Ed.)
    The existence of simple uncoupled no-regret learning dynamics that converge to correlated equilibria in normal-form games is a celebrated result in the theory of multi-agent systems. Specifically, it has been known for more than 20 years that when all players seek to minimize their internal regret in a repeated normal-form game, the empirical frequency of play converges to a normal-form correlated equilibrium. Extensive-form games generalize normal-form games by modeling both sequential and simultaneous moves, as well as imperfect information. Because of the sequential nature and presence of private information in the game, correlation in extensive-form games possesses significantly different properties than its counterpart in normal-form games, many of which are still open research directions. Extensive-form correlated equilibrium (EFCE) has been proposed as the natural extensive-form counterpart to the classical notion of correlated equilibrium in normal-form games. Compared to the latter, the constraints that define the set of EFCEs are significantly more complex, as the correlation device must keep into account the evolution of beliefs of each player as they make observations throughout the game. Due to that significant added complexity, the existence of uncoupled learning dynamics leading to an EFCE has remained a challenging open research question for a long time. In this article, we settle that question by giving the first uncoupled no-regret dynamics that converge to the set of EFCEs in n-player general-sum extensive-form games with perfect recall. We show that each iterate can be computed in time polynomial in the size of the game tree, and that, when all players play repeatedly according to our learning dynamics, the empirical frequency of play is proven to be a O(T^-0.5)-approximate EFCE with high probability after T game repetitions, and an EFCE almost surely in the limit. 
    more » « less
  4. Policy gradient methods enjoy strong practical performance in numerous tasks in reinforcement learning. Their theoretical understanding in multiagent settings, however, remains limited, especially beyond two-player competitive and potential Markov games. In this paper, we develop a new framework to characterize optimistic policy gradient methods in multi-player Markov games with a single controller. Specifically, under the further assumption that the game exhibits an equilibrium collapse, in that the marginals of coarse correlated equilibria (CCE) induce Nash equilibria (NE), we show convergence to stationary ϵ-NE in O(1/ϵ2) iterations, where O(⋅) suppresses polynomial factors in the natural parameters of the game. Such an equilibrium collapse is well-known to manifest itself in two-player zero-sum Markov games, but also occurs even in a class of multi-player Markov games with separable interactions, as established by recent work. As a result, we bypass known complexity barriers for computing stationary NE when either of our assumptions fails. Our approach relies on a natural generalization of the classical Minty property that we introduce, which we anticipate to have further applications beyond Markov games. 
    more » « less
  5. In black-box optimization problems, we aim to maximize an unknown objective function, where the function is only accessible through feedbacks of an evaluation or simulation oracle. In real-life, the feedbacks of such oracles are often noisy and available after some unknown delay that may depend on the computation time of the oracle. Additionally, if the exact evaluations are expensive but coarse approximations are available at a lower cost, the feedbacks can have multi-fidelity. In order to address this problem, we propose a generic extension of hierarchical optimistic tree search (HOO), called ProCrastinated Tree Search (PCTS), that flexibly accommodates a delay and noise-tolerant bandit algorithm. We provide a generic proof technique to quantify regret of PCTS under delayed, noisy, and multi-fidelity feedbacks. Specifically, we derive regret bounds of PCTS enabled with delayed-UCB1 (DUCB1) and delayed-UCB-V (DUCBV) algorithms. Given a horizon T, PCTS retains the regret bound of non-delayed HOO for expected delay of O(log T), and worsens by T^((1-α)/(d+2)) for expected delays of O(T^(1-α)) for α ∈ (0,1]. We experimentally validate on multiple synthetic functions and hyperparameter tuning problems that PCTS outperforms the state-of-the-art black-box optimization methods for feedbacks with different noise levels, delays, and fidelity. 
    more » « less