We develop provably efficient reinforcement learning algorithms for twoplayer zerosum finitehorizon Markov games with simultaneous moves. To incorporate function approximation, we consider a family of Markov games where the reward function and transition kernel possess a linear structure. Both the offline and online settings of the problems are considered. In the offline setting, we control both players and aim to find the Nash equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret. For both settings, we propose an optimistic variant of the leastsquares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an [Formula: see text] upper bound on the duality gap and regret, where d is the linear dimension, H the horizon and T the total number of timesteps. Our results do not require additional assumptions on the sampling model. Our setting requires overcoming several new challenges that are absent in Markov decision processes or turnbased Markov games. In particular, to achieve optimism with simultaneous moves, we construct both upper and lower confidence bounds of the value function, and then compute the optimistic policy by solving a generalsum matrix game with these bounds as the payoff matrices. As finding the Nash equilibrium of a generalsum game is computationally hard, our algorithm instead solves for a coarse correlated equilibrium (CCE), which can be obtained efficiently. To our best knowledge, such a CCEbased scheme for optimism has not appeared in the literature and might be of interest in its own right.
more »
« less
XDO: A Double Oracle Algorithm for ExtensiveForm Games
Policy Space Response Oracles (PSRO) is a reinforcement learning (RL) algo
rithm for twoplayer zerosum games that has been empirically shown to find
approximate Nash equilibria in large games. Although PSRO is guaranteed to
converge to an approximate Nash equilibrium and can handle continuous actions,
it may take an exponential number of iterations as the number of information
states (infostates) grows. We propose ExtensiveForm Double Oracle (XDO), an
extensiveform double oracle algorithm for twoplayer zerosum games that is guar
anteed to converge to an approximate Nash equilibrium linearly in the number of
infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO
mixes best responses at every infostate. We also introduce Neural XDO (NXDO),
where the best response is learned through deep RL. In tabular experiments on
Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a
number of iterations an order of magnitude smaller than PSRO. Experiments on
a modified Leduc poker game and OshiZumo show that tabular XDO achieves
a lower exploitability than CFR with the same amount of computation. We also
find that NXDO outperforms PSRO and NFSP on a sequential multidimensional
continuousaction game. NXDO is the first deep RL method that can find an
approximate Nash equilibrium in highdimensional continuousaction sequential
games. Experiment code is available at https://github.com/indylab/nxdo.
more »
« less
 Award ID(s):
 1839429
 NSFPAR ID:
 10313270
 Date Published:
 Journal Name:
 Advances in neural information processing systems
 ISSN:
 10495258
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Francisco Ruiz, Jennifer Dy (Ed.)We study the sample complexity of identifying an approximate equilibrium for twoplayer zerosum n × 2 matrix games. That is, in a sequence of repeated game plays, how many rounds must the two players play before reaching an approximate equilibrium (e.g., Nash)? We derive instancedependent bounds that define an ordering over game matrices that captures the intuition that the dynamics of some games converge faster than others. Specifically, we consider a stochastic observation model such that when the two players choose actions i and j, respectively, they both observe each other’s played actions and a stochastic observation Xij such that E [Xij ] = Aij . To our knowledge, our work is the first case of instancedependent lower bounds on the number of rounds the players must play before reaching an approximate equilibrium in the sense that the number of rounds depends on the specific properties of the game matrix A as well as the desired accuracy. We also prove a converse statement: there exist player strategies that achieve this lower bound.more » « less

null (Ed.)Regret minimization has proved to be a versatile tool for tree form sequential decision making and extensiveform games. In large twoplayer zerosum imperfectinformation games, mod ern extensions of counterfactual regret minimization (CFR) are currently the practical state of the art for computing a Nash equilibrium. Most regretminimization algorithms for treeform sequential decision making, including CFR, require (i) an exact model of the player’s decision nodes, observation nodes, and how they are linked, and (ii) full knowledge, at all times t, about the payoffs—even in parts of the decision space that are not encountered at time t. Recently, there has been growing interest towards relaxing some of those restric tions and making regret minimization applicable to settings for which reinforcement learning methods have traditionally been used—for example, those in which only blackbox access to the environment is available. We give the first, to our knowl edge, regretminimization algorithm that guarantees sublinear regret with high probability even when requirement (i)—and thus also (ii)—is dropped. We formalize an online learning setting in which the strategy space is not known to the agent and gets revealed incrementally whenever the agent encoun ters new decision points. We give an efficient algorithm that achieves O(T 3/4) regret with high probability for that setting, even when the agent faces an adversarial environment. Our experiments show it significantly outperforms the prior algo rithms for the problem, which do not have such guarantees. It can be used in any application for which regret minimization is useful: approximating Nash equilibrium or quantal response equilibrium, approximating coarse correlated equilibrium in multiplayer games, learning a best response, learning safe opponent exploitation, and online play against an unknown opponent/environment.more » « less

null (Ed.)The existence of simple uncoupled noregret learning dynamics that converge to correlated equilibria in normalform games is a celebrated result in the theory of multiagent systems. Specifically, it has been known for more than 20 years that when all players seek to minimize their internal regret in a repeated normalform game, the empirical frequency of play converges to a normalform correlated equilibrium. Extensiveform games generalize normalform games by modeling both sequential and simultaneous moves, as well as imperfect information. Because of the sequential nature and presence of private information in the game, correlation in extensiveform games possesses significantly different properties than its counterpart in normalform games, many of which are still open research directions. Extensiveform correlated equilibrium (EFCE) has been proposed as the natural extensiveform counterpart to the classical notion of correlated equilibrium in normalform games. Compared to the latter, the constraints that define the set of EFCEs are significantly more complex, as the correlation device must keep into account the evolution of beliefs of each player as they make observations throughout the game. Due to that significant added complexity, the existence of uncoupled learning dynamics leading to an EFCE has remained a challenging open research question for a long time. In this article, we settle that question by giving the first uncoupled noregret dynamics that converge to the set of EFCEs in nplayer generalsum extensiveform games with perfect recall. We show that each iterate can be computed in time polynomial in the size of the game tree, and that, when all players play repeatedly according to our learning dynamics, the empirical frequency of play is proven to be a O(T^0.5)approximate EFCE with high probability after T game repetitions, and an EFCE almost surely in the limit.more » « less

We consider computational games, sequences of games G = (G1,G2,...) where, for all n, Gn has the same set of players. Computational games arise in electronic money systems such as Bitcoin, in cryptographic protocols, and in the study of generative adversarial networks in machine learning. Assuming that oneway functions exist, we prove that there is 2player zerosum computational game G such that, for all n, the size of the action space in Gn is polynomial in n and the utility function in Gn is computable in time polynomial in n, and yet there is no εNash equilibrium if players are restricted to using strategies computable by polynomialtime Turing machines, where we use a notion of Nash equilibrium that is tailored to computational games. We also show that an εNash equilibrium may not exist if players are constrained to perform at most T computational steps in each of the games in the sequence. On the other hand, we show that if players can use arbitrary Turing machines to compute their strategies, then every computational game has an εNash equilibrium. These results may shed light on competitive settings where the availability of more running time or faster algorithms can lead to a “computational arms race”, precluding the existence of equilibrium. They also point to inherent limitations of concepts such as “best response” and Nash equilibrium in games with resourcebounded players.more » « less