challenging when the number of information states is large. Policy Space Response Oracles (PSRO) is a deep reinforcement learning algorithm grounded in game theory that is guaranteed to converge to an approximate Nash equilibrium. However, PSRO requires training a reinforcement learning policy at each iteration, making it too slow for large games. We show through counterexamples and experiments that DCH and Rectified PSRO, two existing approaches to scaling up PSRO, fail to converge even in small games. We introduce Pipeline PSRO (P2SRO), the first scalable PSRObased method for finding approximate Nash equilibria in large zerosum imperfectinformation games. P2SRO is ablemore »
Solving the Rubik's Cube with Approximate Policy Iteration
Recently, Approximate Policy Iteration (API) algorithms have achieved superhuman
proficiency in twoplayer zerosum games such as Go, Chess, and Shogi
without human data. These API algorithms iterate between two policies: a slow
policy (tree search), and a fast policy (a neural network). In these twoplayer games,
a reward is always received at the end of the game. However, the Rubik’s Cube has
only a single solved state, and episodes are not guaranteed to terminate. This poses
a major problem for these API algorithms since they rely on the reward received at
the end of the game. We introduce Autodidactic Iteration: an API algorithm that
overcomes the problem of sparse rewards by training on a distribution of states that
allows the reward to propagate from the goal state to states farther away. Autodidactic
Iteration is able to learn how to solve the Rubik’s Cube without relying on
human data. Our algorithm is able to solve 100% of randomly scrambled cubes
while achieving a median solve length of 30 moves — less than or equal to solvers
that employ human domain knowledge.
 Award ID(s):
 1839429
 Publication Date:
 NSFPAR ID:
 10120458
 Journal Name:
 International Conference on Learning Representations
 Sponsoring Org:
 National Science Foundation
More Like this


We propose a deep neural networkbased algorithm to identify the Markovian Nash equilibrium of general large 𝑁player stochastic differential games. Following the idea of fictitious play, we recast the 𝑁player game into 𝑁 decoupled decision problems (one for each player) and solve them iteratively. The individual decision problem is characterized by a semilinear HamiltonJacobiBellman equation, to solve which we employ the recently developed deep BSDE method. The resulted algorithm can solve large 𝑁player games for which conventional numerical methods would suffer from the curse of dimensionality. Multiple numerical examples involving identical or heterogeneous agents, with riskneutral or risksensitive objectives, aremore »

We focus on the problem of finding an optimal strategy for a team of two players that faces an opponent in an imperfectinformation zerosum extensiveform game. Team members are not allowed to communicate during play but can coordinate before the game. In that setting, it is known that the best the team can do is sample a profile of potentially randomized strategies (one per player) from a joint (a.k.a. correlated) probability distribution at the beginning of the game. In this paper, we first provide new modeling results about computing such an optimal distribution by drawing a connection to a differentmore »

We focus on the problem of finding an optimal strategy for a team of two players that faces an opponent in an imperfectinformation zerosum extensiveform game. Team members are not allowed to communicate during play but can coordinate before the game. In that setting, it is known that the best the team can do is sample a profile of potentially randomized strategies (one per player) from a joint (a.k.a. correlated) probability distribution at the beginning of the game. In this paper, we first provide new modeling results about computing such an optimal distribution by drawing a connection to a differentmore »

We focus on the problem of finding an optimal strategy for a team of players that faces an opponent in an imperfectinformation zerosum extensiveform game. Team members are not allowed to communicate during play but can coordinate before the game. In this setting, it is known that the best the team can do is sample a profile of potentially randomized strategies (one per player) from a joint (a.k.a. correlated) probability distribution at the beginning of the game. In this paper, we first provide new modeling results about computing such an optimal distribution by drawing a connection to a different literaturemore »