skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 10:00 PM ET on Friday, December 8 until 2:00 AM ET on Saturday, December 9 due to maintenance. We apologize for the inconvenience.


This content will become publicly available on January 1, 2024

Title: Instance-dependent Sample Complexity Bounds for Zero-sum Matrix Games
We study the sample complexity of identifying an approximate equilibrium for two-player zero-sum n × 2 matrix games. That is, in a sequence of repeated game plays, how many rounds must the two players play before reaching an approximate equilibrium (e.g., Nash)? We derive instance-dependent bounds that define an ordering over game matrices that captures the intuition that the dynamics of some games converge faster than others. Specifically, we consider a stochastic observation model such that when the two players choose actions i and j, respectively, they both observe each other’s played actions and a stochastic observation Xij such that E [Xij ] = Aij . To our knowledge, our work is the first case of instance-dependent lower bounds on the number of rounds the players must play before reaching an approximate equilibrium in the sense that the number of rounds depends on the specific properties of the game matrix A as well as the desired accuracy. We also prove a converse statement: there exist player strategies that achieve this lower bound.  more » « less
Award ID(s):
1907907 2023166
NSF-PAR ID:
10415280
Author(s) / Creator(s):
; ;
Editor(s):
Francisco Ruiz, Jennifer Dy
Date Published:
Journal Name:
Proceedings of Machine Learning Research
Volume:
206
ISSN:
2640-3498
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We study the following problem, which to our knowledge has been addressed only partially in the literature and not in full generality. An agent observes two players play a zero-sum game that is known to the players but not the agent. The agent observes the actions and state transitions of their game play, but not rewards. The players may play either op-timally (according to some Nash equilibrium) or according to any other solution concept, such as a quantal response equilibrium. Following these observations, the agent must recommend a policy for one player, say Player 1. The goal is to recommend a policy that is minimally exploitable un-der the true, but unknown, game. We take a Bayesian ap-proach. We establish a likelihood function based on obser-vations and the specified solution concept. We then propose an approach based on Markov chain Monte Carlo (MCMC), which allows us to approximately sample games from the agent’s posterior belief distribution. Once we have a batch of independent samples from the posterior, we use linear pro-gramming and backward induction to compute a policy for Player 1 that minimizes the sum of exploitabilities over these games. This approximates the policy that minimizes the ex-pected exploitability under the full distribution. Our approach is also capable of handling counterfactuals, where known modifications are applied to the unknown game. We show that our Bayesian MCMC-based technique outperforms two other techniques—one based on the equilibrium policy of the maximum-probability game and the other based on imitation of observed behavior—on all the tested stochastic game envi-ronments. 
    more » « less
  2. We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves. To incorporate function approximation, we consider a family of Markov games where the reward function and transition kernel possess a linear structure. Both the offline and online settings of the problems are considered. In the offline setting, we control both players and aim to find the Nash equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret. For both settings, we propose an optimistic variant of the least-squares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an [Formula: see text] upper bound on the duality gap and regret, where d is the linear dimension, H the horizon and T the total number of timesteps. Our results do not require additional assumptions on the sampling model. Our setting requires overcoming several new challenges that are absent in Markov decision processes or turn-based Markov games. In particular, to achieve optimism with simultaneous moves, we construct both upper and lower confidence bounds of the value function, and then compute the optimistic policy by solving a general-sum matrix game with these bounds as the payoff matrices. As finding the Nash equilibrium of a general-sum game is computationally hard, our algorithm instead solves for a coarse correlated equilibrium (CCE), which can be obtained efficiently. To our best knowledge, such a CCE-based scheme for optimism has not appeared in the literature and might be of interest in its own right. 
    more » « less
  3. null (Ed.)
    The existence of simple uncoupled no-regret learning dynamics that converge to correlated equilibria in normal-form games is a celebrated result in the theory of multi-agent systems. Specifically, it has been known for more than 20 years that when all players seek to minimize their internal regret in a repeated normal-form game, the empirical frequency of play converges to a normal-form correlated equilibrium. Extensive-form games generalize normal-form games by modeling both sequential and simultaneous moves, as well as imperfect information. Because of the sequential nature and presence of private information in the game, correlation in extensive-form games possesses significantly different properties than its counterpart in normal-form games, many of which are still open research directions. Extensive-form correlated equilibrium (EFCE) has been proposed as the natural extensive-form counterpart to the classical notion of correlated equilibrium in normal-form games. Compared to the latter, the constraints that define the set of EFCEs are significantly more complex, as the correlation device must keep into account the evolution of beliefs of each player as they make observations throughout the game. Due to that significant added complexity, the existence of uncoupled learning dynamics leading to an EFCE has remained a challenging open research question for a long time. In this article, we settle that question by giving the first uncoupled no-regret dynamics that converge to the set of EFCEs in n-player general-sum extensive-form games with perfect recall. We show that each iterate can be computed in time polynomial in the size of the game tree, and that, when all players play repeatedly according to our learning dynamics, the empirical frequency of play is proven to be a O(T^-0.5)-approximate EFCE with high probability after T game repetitions, and an EFCE almost surely in the limit. 
    more » « less
  4. Jin, Shi (Ed.)
    In this paper, we apply the idea of fictitious play to design deep neural networks (DNNs), and develop deep learning theory and algorithms for computing the Nash equilibrium of asymmetric N-player non-zero-sum stochastic differential games, for which we refer as deep fictitious play, a multi-stage learning process. Specifically at each stage, we propose the strategy of letting individual player optimize her own payoff subject to the other players’ previous actions, equivalent to solving N decoupled stochastic control optimization problems, which are approximated by DNNs. Therefore, the fictitious play strategy leads to a structure consisting of N DNNs, which only communicate at the end of each stage. The resulting deep learning algorithm based on fictitious play is scalable, parallel and model-free, i.e., using GPU parallelization, it can be applied to any N-player stochastic differential game with different symmetries and heterogeneities (e.g., existence of major players). We illustrate the performance of the deep learning algorithm by comparing to the closed-form solution of the linear quadratic game. Moreover, we prove the convergence of fictitious play under appropriate assumptions, and verify that the convergent limit forms an open-loop Nash equilibrium. We also discuss the extensions to other strategies designed upon fictitious play and closed-loop Nash equilibrium in the end. 
    more » « less
  5. null (Ed.)
    A two-player finite game is represented by two payoff matrices (A, B), one for each player. Imitation games are a subclass of two-player games in which B is the identity matrix, implying that the second player gets a positive payoff only if she "imitates" the first. Given that the problem of computing a Nash equilibrium (NE) is known to be provably hard, even to approximate, we ask if it is any easier for imitation games. We show that much like the general case, for any c > 0, computing a 1 over n^c -approximate NE of imitation games remains PPAD-hard, where n is the number of moves available to the players. On the other hand, we design a polynomial-time algorithm to find ε-approximate NE for any given constant ε > 0 (PTAS). The former result also rules out the smooth complexity being in P, unless PPAD ⊂ RP. 
    more » « less