Deep Fictitious Play for Finding Markovian Nash Equilibrium in Multi-Agent Games
We propose a deep neural network-based algorithm to identify the Markovian Nash equilibrium of general large 𝑁-player stochastic differential games. Following the idea of fictitious play, we recast the 𝑁-player game into 𝑁 decoupled decision problems (one for each player) and solve them iteratively. The individual decision problem is characterized by a semilinear Hamilton-Jacobi-Bellman equation, to solve which we employ the recently developed deep BSDE method. The resulted algorithm can solve large 𝑁-player games for which conventional numerical methods would suffer from the curse of dimensionality. Multiple numerical examples involving identical or heterogeneous agents, with risk-neutral or risk-sensitive objectives, are tested to validate the accuracy of the proposed algorithm in large group games. Even for a fifty-player game with the presence of common noise, the proposed algorithm still finds the approximate Nash equilibrium accurately, which, to our best knowledge, is difficult to achieve by other numerical algorithms.
Authors:
;
Award ID(s):
Publication Date:
NSF-PAR ID:
10253642
Journal Name:
Proceedings of Machine Learning Research
Volume:
107
Page Range or eLocation-ID:
221-245
ISSN:
2640-3498
National Science Foundation
More Like this
1. Stochastic differential games have been used extensively to model agents' competitions in finance, for instance, in P2P lending platforms from the Fintech industry, the banking system for systemic risk, and insurance markets. The recently proposed machine learning algorithm, deep fictitious play, provides a novel and efficient tool for finding Markovian Nash equilibrium of large \begin{document}$N$\end{document}-player asymmetric stochastic differential games [J. Han and R. Hu, Mathematical and Scientific Machine Learning Conference, pages 221-245, PMLR, 2020]. By incorporating the idea of fictitious play, the algorithm decouples the game into \begin{document}$N$\end{document} sub-optimization problems, and identifies each player's optimal strategy with the deep backward stochastic differential equation (BSDE) method parallelly and repeatedly. In this paper, we prove the convergence of deep fictitious play (DFP) to the true Nash equilibrium. We can also show that the strategy based on DFP forms an \begin{document}$\epsilon$\end{document}-Nash equilibrium. We generalize the algorithm by proposing a new approach to decouple the games, and present numerical results of large population games showing the empirical convergence of the algorithm beyond the technical assumptions in the theorems.

2. Policy Space Response Oracles (PSRO) is a reinforcement learning (RL) algo- rithm for two-player zero-sum games that has been empirically shown to find approximate Nash equilibria in large games. Although PSRO is guaranteed to converge to an approximate Nash equilibrium and can handle continuous actions, it may take an exponential number of iterations as the number of information states (infostates) grows. We propose Extensive-Form Double Oracle (XDO), an extensive-form double oracle algorithm for two-player zero-sum games that is guar- anteed to converge to an approximate Nash equilibrium linearly in the number of infostates. Unlike PSRO, which mixes best responses at the root of the game, XDO mixes best responses at every infostate. We also introduce Neural XDO (NXDO), where the best response is learned through deep RL. In tabular experiments on Leduc poker, we find that XDO achieves an approximate Nash equilibrium in a number of iterations an order of magnitude smaller than PSRO. Experiments on a modified Leduc poker game and Oshi-Zumo show that tabular XDO achieves a lower exploitability than CFR with the same amount of computation. We also find that NXDO outperforms PSRO and NFSP on a sequential multidimensional continuous-action game. NXDO is the first deep RLmore »
3. Unmanned Aerial Vehicle (UAV)-assisted Multi-access Edge Computing (MEC) systems have emerged recently as a flexible and dynamic computing environment, providing task offloading service to the users. In order for such a paradigm to be viable, the operator of a UAV-mounted MEC server should enjoy some form of profit by offering its computing capabilities to the end users. To deal with this issue in this paper, we apply a usage-based pricing policy for allowing the exploitation of the servers’ computing resources. The proposed pricing mechanism implicitly introduces a more social behavior to the users with respect to competing for the UAV-mounted MEC servers’ computation resources. In order to properly model the users’ risk-aware behavior within the overall data offloading decision-making process the principles of Prospect Theory are adopted, while the exploitation of the available computation resources is considered based on the theory of the Tragedy of the Commons. Initially, the user’s prospect-theoretic utility function is formulated by quantifying the user’s risk seeking and loss aversion behavior, while taking into account the pricing mechanism. Accordingly, the users’ pricing and risk-aware data offloading problem is formulated as a distributed maximization problem of each user’s expected prospect-theoretic utility function and addressed as a non-cooperativemore »
4. We develop provably efficient reinforcement learning algorithms for two-player zero-sum finite-horizon Markov games with simultaneous moves. To incorporate function approximation, we consider a family of Markov games where the reward function and transition kernel possess a linear structure. Both the offline and online settings of the problems are considered. In the offline setting, we control both players and aim to find the Nash equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret. For both settings, we propose an optimistic variant of the least-squares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an [Formula: see text] upper bound on the duality gap and regret, where d is the linear dimension, H the horizon and T the total number of timesteps. Our results do not require additional assumptions on the sampling model. Our setting requires overcoming several new challenges that are absent in Markov decision processes or turn-based Markov games. In particular, to achieve optimism with simultaneous moves, we construct both upper and lower confidence bounds of the value function, and then compute the optimistic policy by solvingmore »
5. (Ed.)
In this paper, we apply the idea of fictitious play to design deep neural networks (DNNs), and develop deep learning theory and algorithms for computing the Nash equilibrium of asymmetric N-player non-zero-sum stochastic differential games, for which we refer as deep fictitious play, a multi-stage learning process. Specifically at each stage, we propose the strategy of letting individual player optimize her own payoff subject to the other players’ previous actions, equivalent to solving N decoupled stochastic control optimization problems, which are approximated by DNNs. Therefore, the fictitious play strategy leads to a structure consisting of N DNNs, which only communicate at the end of each stage. The resulting deep learning algorithm based on fictitious play is scalable, parallel and model-free, i.e., using GPU parallelization, it can be applied to any N-player stochastic differential game with different symmetries and heterogeneities (e.g., existence of major players). We illustrate the performance of the deep learning algorithm by comparing to the closed-form solution of the linear quadratic game. Moreover, we prove the convergence of fictitious play under appropriate assumptions, and verify that the convergent limit forms an open-loop Nash equilibrium. We also discuss the extensions to other strategies designed upon fictitious play and closed-loopmore »