This content will become publicly available on December 13, 2024
 Award ID(s):
 2211548
 NSFPAR ID:
 10511436
 Publisher / Repository:
 IEEE
 Date Published:
 Journal Name:
 Conference on Decision and Control
 ISBN:
 9798350301243
 Page Range / eLocation ID:
 2202 to 2207
 Format(s):
 Medium: X
 Location:
 Singapore, Singapore
 Sponsoring Org:
 National Science Foundation
More Like this

null (Ed.)We study the following problem, which to our knowledge has been addressed only partially in the literature and not in full generality. An agent observes two players play a zerosum game that is known to the players but not the agent. The agent observes the actions and state transitions of their game play, but not rewards. The players may play either optimally (according to some Nash equilibrium) or according to any other solution concept, such as a quantal response equilibrium. Following these observations, the agent must recommend a policy for one player, say Player 1. The goal is to recommend a policy that is minimally exploitable under the true, but unknown, game. We take a Bayesian approach. We establish a likelihood function based on observations and the specified solution concept. We then propose an approach based on Markov chain Monte Carlo (MCMC), which allows us to approximately sample games from the agent’s posterior belief distribution. Once we have a batch of independent samples from the posterior, we use linear programming and backward induction to compute a policy for Player 1 that minimizes the sum of exploitabilities over these games. This approximates the policy that minimizes the expected exploitability under the full distribution. Our approach is also capable of handling counterfactuals, where known modifications are applied to the unknown game. We show that our Bayesian MCMCbased technique outperforms two other techniques—one based on the equilibrium policy of the maximumprobability game and the other based on imitation of observed behavior—on all the tested stochastic game environments.more » « less

Learning optimal policies in realworld domains with delayed rewards is a major challenge in Reinforcement Learning. We address the credit assignment problem by proposing a Gaussian Process (GP)based immediate reward approximation algorithm and evaluate its effectiveness in 4 contexts where rewards can be delayed for long trajectories. In one GridWorld game and 8 Atari games, where immediate rewards are available, our results showed that on 7 out 9 games, the proposed GP inferred reward policy performed at least as well as the immediate reward policy and significantly outperformed the corresponding delayed reward policy. In elearning and healthcare applications, we combined GPinferred immediate rewards with offline Deep QNetwork (DQN) policy induction and showed that the GPinferred reward policies outperformed the policies induced using delayed rewards in both realworld contexts.more » « less

We develop provably efficient reinforcement learning algorithms for twoplayer zerosum finitehorizon Markov games with simultaneous moves. To incorporate function approximation, we consider a family of Markov games where the reward function and transition kernel possess a linear structure. Both the offline and online settings of the problems are considered. In the offline setting, we control both players and aim to find the Nash equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret. For both settings, we propose an optimistic variant of the leastsquares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an [Formula: see text] upper bound on the duality gap and regret, where d is the linear dimension, H the horizon and T the total number of timesteps. Our results do not require additional assumptions on the sampling model. Our setting requires overcoming several new challenges that are absent in Markov decision processes or turnbased Markov games. In particular, to achieve optimism with simultaneous moves, we construct both upper and lower confidence bounds of the value function, and then compute the optimistic policy by solving a generalsum matrix game with these bounds as the payoff matrices. As finding the Nash equilibrium of a generalsum game is computationally hard, our algorithm instead solves for a coarse correlated equilibrium (CCE), which can be obtained efficiently. To our best knowledge, such a CCEbased scheme for optimism has not appeared in the literature and might be of interest in its own right.more » « less

Yllka Velaj and Ulrich Berger (Ed.)
This paper considers a twoplayer game where each player chooses a resource from a finite collection of options. Each resource brings a random reward. Both players have statistical information regarding the rewards of each resource. Additionally, there exists an information asymmetry where each player has knowledge of the reward realizations of different subsets of the resources. If both players choose the same resource, the reward is divided equally between them, whereas if they choose different resources, each player gains the full reward of the resource. We first implement the iterative best response algorithm to find an ϵapproximate Nash equilibrium for this game. This method of finding a Nash equilibrium may not be desirable when players do not trust each other and place no assumptions on the incentives of the opponent. To handle this case, we solve the problem of maximizing the worstcase expected utility of the first player. The solution leads to counterintuitive insights in certain special cases. To solve the general version of the problem, we develop an efficient algorithmic solution that combines online convex optimization and the driftplus penalty technique.

null (Ed.)We propose a deep neural networkbased algorithm to identify the Markovian Nash equilibrium of general large 𝑁player stochastic differential games. Following the idea of fictitious play, we recast the 𝑁player game into 𝑁 decoupled decision problems (one for each player) and solve them iteratively. The individual decision problem is characterized by a semilinear HamiltonJacobiBellman equation, to solve which we employ the recently developed deep BSDE method. The resulted algorithm can solve large 𝑁player games for which conventional numerical methods would suffer from the curse of dimensionality. Multiple numerical examples involving identical or heterogeneous agents, with riskneutral or risksensitive objectives, are tested to validate the accuracy of the proposed algorithm in large group games. Even for a fiftyplayer game with the presence of common noise, the proposed algorithm still finds the approximate Nash equilibrium accurately, which, to our best knowledge, is difficult to achieve by other numerical algorithms.more » « less