Consider a set of n players. We suppose that each game involves two players, that there is some unknown player who wins each game it plays with a probability greater than 1/2, and that our objective is to determine this best player. Under the requirement that the policy employed guarantees a correct choice with a probability of at least some specified value, we look for a policy that has a relatively small expected number of games played before decision. We consider this problem both under the assumption that the best player wins each game with a probability of at least some specified value >1/2, and under a Bayesian assumption that the probability that player i wins a game against player j is its value divided by the sum of the values, where the values are the unknown values of n independent and identically distributed exponential random variables. In the former case, we propose a policy where chosen pairs play a match that ends when one of them has had a specified number of wins more than the other; in the latter case, we propose a Thompson sampling type rule.
more »
« less
ModelFree Reinforcement Learning for Stochastic Parity Games
This paper investigates the use of modelfree reinforcement learning to compute the optimal value in twoplayer stochastic games with parity objectives. In this setting, two decision makers, player Min and player Max, compete on a finite game arena  a stochastic game graph with unknown but fixed probability distributions  to minimize and maximize, respectively, the probability of satisfying a parity objective. We give a reduction from stochastic parity games to a family of stochastic reachability games with a parameter ε, such that the value of a stochastic parity game equals the limit of the values of the corresponding simple stochastic games as the parameter ε tends to 0. Since this reduction does not require the knowledge of the probabilistic transition structure of the underlying game arena, modelfree reinforcement learning algorithms, such as minimax Qlearning, can be used to approximate the value and mutual bestresponse strategies for both players in the underlying stochastic parity game. We also present a streamlined reduction from 1 1/2player parity games to reachability games that avoids recourse to nondeterminism. Finally, we report on the experimental evaluations of both reductions
more »
« less
 Award ID(s):
 2009022
 NSFPAR ID:
 10237400
 Date Published:
 Journal Name:
 31st International Conference on Concurrency Theory (CONCUR 2020)
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


null (Ed.)A twoplayer finite game is represented by two payoff matrices (A, B), one for each player. Imitation games are a subclass of twoplayer games in which B is the identity matrix, implying that the second player gets a positive payoff only if she "imitates" the first. Given that the problem of computing a Nash equilibrium (NE) is known to be provably hard, even to approximate, we ask if it is any easier for imitation games. We show that much like the general case, for any c > 0, computing a 1 over n^c approximate NE of imitation games remains PPADhard, where n is the number of moves available to the players. On the other hand, we design a polynomialtime algorithm to find εapproximate NE for any given constant ε > 0 (PTAS). The former result also rules out the smooth complexity being in P, unless PPAD ⊂ RP.more » « less

We propose Adversarially Trained Actor Critic (ATAC), a new modelfree algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a twoplayer Stackelberg game: A policy actor competes against an adversarially trained value critic, who finds dataconsistent scenarios where the actor is inferior to the datacollection behavior policy. We prove that, when the actor attains no regret in the twoplayer game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms stateoftheart offline RL algorithms on a range of continuous control tasks.more » « less

Francisco Ruiz, Jennifer Dy (Ed.)We study the sample complexity of identifying an approximate equilibrium for twoplayer zerosum n × 2 matrix games. That is, in a sequence of repeated game plays, how many rounds must the two players play before reaching an approximate equilibrium (e.g., Nash)? We derive instancedependent bounds that define an ordering over game matrices that captures the intuition that the dynamics of some games converge faster than others. Specifically, we consider a stochastic observation model such that when the two players choose actions i and j, respectively, they both observe each other’s played actions and a stochastic observation Xij such that E [Xij ] = Aij . To our knowledge, our work is the first case of instancedependent lower bounds on the number of rounds the players must play before reaching an approximate equilibrium in the sense that the number of rounds depends on the specific properties of the game matrix A as well as the desired accuracy. We also prove a converse statement: there exist player strategies that achieve this lower bound.more » « less

null (Ed.)We obtain global, nonasymptotic convergence guarantees for independent learning algorithms in competitive reinforcement learning settings with two agents (i.e., zerosum stochastic games). We consider an episodic setting where in each episode, each player independently selects a policy and observes only their own actions and rewards, along with the state. We show that if both players run policy gradient methods in tandem, their policies will converge to a minmax equilibrium of the game, as long as their learning rates follow a twotimescale rule (which is necessary). To the best of our knowledge, this constitutes the first finitesample convergence result for independent policy gradient methods in competitive RL; prior work has largely focused on centralized, coordinated procedures for equilibrium computation.more » « less