In this paper, zerosum meanfield type games (ZSMFTG) with linear dynamics and quadratic cost are studied under infinitehorizon discounted utility function. ZSMFTG are a class of games in which two decision makers whose utilities sum to zero, compete to influence a large population of indistinguishable agents. In particular, the case in which the transition and utility functions depend on the state, the action of the controllers, and the mean of the state and the actions, is investigated. The optimality conditions of the game are analysed for both openloop and closedloop controls, and explicit expressions for the Nash equilibrium strategies are derived. Moreover, two policy optimization methods that rely on policy gradient are proposed for both modelbased and samplebased frameworks. In the modelbased case, the gradients are computed exactly using the model, whereas they are estimated using MonteCarlo simulations in the samplebased case. Numerical experiments are conducted to show the convergence of the utility function as well as the two players' controls.
more » « less Award ID(s):
 1716673
 NSFPAR ID:
 10299691
 Date Published:
 Journal Name:
 Journal of Dynamics & Games
 Volume:
 8
 Issue:
 4
 ISSN:
 21646066
 Page Range / eLocation ID:
 403
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Nonzero sum games typically have multiple Nash equilibriums (or no equilibrium), and unlike the zerosum case, they may have different values at different equilibriums. Instead of focusing on the existence of individual equilibriums, we study the set of values over all equilibriums, which we call the set value of the game. The set value is unique by nature and always exists (with possible value [Formula: see text]). Similar to the standard value function in control literature, it enjoys many nice properties, such as regularity, stability, and more importantly, the dynamic programming principle. There are two main features in order to obtain the dynamic programming principle: (i) we must use closedloop controls (instead of openloop controls); and (ii) we must allow for path dependent controls, even if the problem is in a statedependent (Markovian) setting. We shall consider both discrete and continuous time models with finite time horizon. For the latter, we will also provide a duality approach through certain standard PDE (or pathdependent PDE), which is quite efficient for numerically computing the set value of the game.more » « less

null (Ed.)We study the following problem, which to our knowledge has been addressed only partially in the literature and not in full generality. An agent observes two players play a zerosum game that is known to the players but not the agent. The agent observes the actions and state transitions of their game play, but not rewards. The players may play either optimally (according to some Nash equilibrium) or according to any other solution concept, such as a quantal response equilibrium. Following these observations, the agent must recommend a policy for one player, say Player 1. The goal is to recommend a policy that is minimally exploitable under the true, but unknown, game. We take a Bayesian approach. We establish a likelihood function based on observations and the specified solution concept. We then propose an approach based on Markov chain Monte Carlo (MCMC), which allows us to approximately sample games from the agent’s posterior belief distribution. Once we have a batch of independent samples from the posterior, we use linear programming and backward induction to compute a policy for Player 1 that minimizes the sum of exploitabilities over these games. This approximates the policy that minimizes the expected exploitability under the full distribution. Our approach is also capable of handling counterfactuals, where known modifications are applied to the unknown game. We show that our Bayesian MCMCbased technique outperforms two other techniques—one based on the equilibrium policy of the maximumprobability game and the other based on imitation of observed behavior—on all the tested stochastic game environments.more » « less

null (Ed.)Modelbased reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the cornerstones of RL. It is especially suitable for multiagent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the nonstationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widelyused, the sample complexity of modelbased MARL algorithms has been investigated relatively much less often. In this paper, we aim to address the fundamental open question about the sample complexity of modelbased MARL. We study arguably the most basic MARL setting: twoplayer discounted zerosum Markov games, given only access to a generative model of state transition. We show that modelbased MARL achieves a near optimal sample complexity for finding the Nash equilibrium (NE) \emph{value} up to some additive error. We also show that this method is nearminimax optimal with a tight dependence on the horizon and the number of states. Our results justify the efficiency of this simple modelbased approach in the multiagent RL setting.more » « less

We develop provably efficient reinforcement learning algorithms for twoplayer zerosum finitehorizon Markov games with simultaneous moves. To incorporate function approximation, we consider a family of Markov games where the reward function and transition kernel possess a linear structure. Both the offline and online settings of the problems are considered. In the offline setting, we control both players and aim to find the Nash equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret. For both settings, we propose an optimistic variant of the leastsquares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an [Formula: see text] upper bound on the duality gap and regret, where d is the linear dimension, H the horizon and T the total number of timesteps. Our results do not require additional assumptions on the sampling model. Our setting requires overcoming several new challenges that are absent in Markov decision processes or turnbased Markov games. In particular, to achieve optimism with simultaneous moves, we construct both upper and lower confidence bounds of the value function, and then compute the optimistic policy by solving a generalsum matrix game with these bounds as the payoff matrices. As finding the Nash equilibrium of a generalsum game is computationally hard, our algorithm instead solves for a coarse correlated equilibrium (CCE), which can be obtained efficiently. To our best knowledge, such a CCEbased scheme for optimism has not appeared in the literature and might be of interest in its own right.more » « less

This paper is concerned with twoperson meanfield linearquadratic nonzero sum stochastic differential games in an infinite horizon. Both openloop and closedloop Nash equilibria are introduced. The existence of an openloop Nash equilibrium is characterized by the solvability of a system of meanfield forwardbackward stochastic differential equations in an infinite horizon and the convexity of the cost functionals, and the closedloop representation of an openloop Nash equilibrium is given through the solution to a system of two coupled nonsymmetric algebraic Riccati equations. The existence of a closedloop Nash equilibrium is characterized by the solvability of a system of two coupled symmetric algebraic Riccati equations. Twoperson meanfield linearquadratic zerosum stochastic differential games in an infinite horizon are also considered. Both the existence of openloop and closedloop saddle points are characterized by the solvability of a system of two coupled generalized algebraic Riccati equations with static stabilizing solutions. Meanfield linearquadratic stochastic optimal control problems in an infinite horizon are discussed as well, for which it is proved that the openloop solvability and closedloop solvability are equivalent.more » « less