Often—for example in war games, strategy video games, and financial simulations—the game is given to us only as a blackbox simulator in which we can play it. In these settings, since the game may have unknown nature action distributions (from which we can only obtain samples) and/or be too large to expand fully, it can be difficult to compute strategies with guarantees on exploitability. Recent work (Zhang and Sandholm 2020) resulted in a notion of certificate for extensiveform games that allows exploitability guarantees while not expanding the full game tree. However, that work assumed that the black box could sample or expand arbitrary nodes of the game tree at any time, and that a series of exact game solves (via, for example, linear programming) can be conducted to compute the certificate. Each of those two assumptions severely restricts the practical applicability of that method. In this work, we relax both of the assumptions. We show that highprobability certificates can be obtained with a black box that can do nothing more than play through games, using only a regret minimizer as a subroutine. As a bonus, we obtain an equilibriumfinding algorithm with ~O (1= p T) convergence rate in the extensiveformmore »
Efficient exploration of zerosum stochastic games
We investigate the increasingly important and common gamesolving setting where we do not have an explicit description of the game but only oracle access to it through gameplay, such as in financial or military simulations and computer games. During a limitedduration learning phase, the algorithm can control the actions of both players in order to try to learn the game and how to play it well. After that, the algorithm has to produce a strategy that has low exploitability. Our motivation is to quickly learn strategies that have low exploitability in situations where evaluating the payoffs of a queried strategy profile is costly. For the stochastic game setting, we propose using the distribution of stateaction value functions induced by a belief distribution over possible environments. We compare the performance of various exploration strategies for this task, including generalizations of Thompson sampling and BayesUCB to this new setting. These two consistently outperform other strategies.
 Award ID(s):
 1901403
 Publication Date:
 NSFPAR ID:
 10288491
 Journal Name:
 ArXivorg
 Volume:
 arXiv:2002.10524v1
 ISSN:
 23318422
 Sponsoring Org:
 National Science Foundation
More Like this


Treeform sequential decision making (TFSDM) extends classical oneshot decision making by modeling treeform interactions between an agent and a potentially adversarial environment. It captures the online decisionmaking problems that each player faces in an extensiveform game, as well as Markov decision processes and partiallyobservable Markov decision processes where the agent conditions on observed history. Over the past decade, there has been considerable effort into designing online optimization methods for TFSDM. Virtually all of that work has been in the fullfeedback setting, where the agent has access to counterfactuals, that is, information on what would have happened had the agent chosen a different action at any decision node. Little is known about the bandit setting, where that assumption is reversed (no counterfactual information is available), despite this latter setting being well understood for almost 20 years in oneshot decision making. In this paper, we give the first algorithm for the bandit linear optimization problem for TFSDM that offers both (i) lineartime iterations (in the size of the decision tree) and (ii) O(T−−√) cumulative regret in expectation compared to any fixed strategy, at all times T. This is made possible by new results that we derive, which may have independent uses asmore »

Treeform sequential decision making (TFSDM) extends classical oneshot decision making by modeling treeform interactions between an agent and a potentially adversarial environment. It captures the online decisionmaking problems that each player faces in an extensiveform game, as well as Markov decision processes and partiallyobservable Markov decision processes where the agent conditions on observed history. Over the past decade, there has been considerable effort into designing online optimization methods for TFSDM. Virtually all of that work has been in the fullfeedback setting, where the agent has access to counterfactuals, that is, information on what would have happened had the agent chosen a different action at any decision node. Little is known about the bandit setting, where that assumption is reversed (no counterfactual information is available), despite this latter setting being well understood for almost 20 years in oneshot decision making. In this paper, we give the first algorithm for the bandit linear optimization problem for TFSDM that offers both (i) lineartime iterations (in the size of the decision tree) and (ii) O(T−−√) cumulative regret in expectation compared to any fixed strategy, at all times T. This is made possible by new results that we derive, which may have independent uses asmore »

We focus on the problem of finding an optimal strategy for a team of two players that faces an opponent in an imperfectinformation zerosum extensiveform game. Team members are not allowed to communicate during play but can coordinate before the game. In that setting, it is known that the best the team can do is sample a profile of potentially randomized strategies (one per player) from a joint (a.k.a. correlated) probability distribution at the beginning of the game. In this paper, we first provide new modeling results about computing such an optimal distribution by drawing a connection to a different literature on extensiveform correlation. Second, we provide an algorithm that computes such an optimal distribution by only using profiles where only one of the team members gets to randomize in each profile. We can also cap the number of such profiles we allow in the solution. This begets an anytime algorithm by increasing the cap. We find that often a handful of wellchosen such profiles suffices to reach optimal utility for the team. This enables team members to reach coordination through a relatively simple and understandable plan. Finally, inspired by this observation and leveraging theoretical concepts that we introduce, wemore »

We focus on the problem of finding an optimal strategy for a team of two players that faces an opponent in an imperfectinformation zerosum extensiveform game. Team members are not allowed to communicate during play but can coordinate before the game. In that setting, it is known that the best the team can do is sample a profile of potentially randomized strategies (one per player) from a joint (a.k.a. correlated) probability distribution at the beginning of the game. In this paper, we first provide new modeling results about computing such an optimal distribution by drawing a connection to a different literature on extensiveform correlation. Second, we provide an algorithm that computes such an optimal distribution by only using profiles where only one of the team members gets to randomize in each profile. We can also cap the number of such profiles we allow in the solution. This begets an anytime algorithm by increasing the cap. We find that often a handful of wellchosen such profiles suffices to reach optimal utility for the team. This enables team members to reach coordination through a relatively simple and understandable plan. Finally, inspired by this observation and leveraging theoretical concepts that we introduce, wemore »