Treeform sequential decision making (TFSDM) extends classical oneshot decision making by modeling treeform interactions between an agent and a potentially adversarial environment. It captures the online decisionmaking problems that each player faces in an extensiveform game, as well as Markov decision processes and partiallyobservable Markov decision processes where the agent conditions on observed history. Over the past decade, there has been considerable effort into designing online optimization methods for TFSDM. Virtually all of that work has been in the fullfeedback setting, where the agent has access to counterfactuals, that is, information on what would have happened had the agent chosen a different action at any decision node. Little is known about the bandit setting, where that assumption is reversed (no counterfactual information is available), despite this latter setting being well understood for almost 20 years in oneshot decision making. In this paper, we give the first algorithm for the bandit linear optimization problem for TFSDM that offers both (i) lineartime iterations (in the size of the decision tree) and (ii) O(T−−√) cumulative regret in expectation compared to any fixed strategy, at all times T. This is made possible by new results that we derive, which may have independent uses asmore »
Bandit Linear Optimization for Sequential Decision Making and ExtensiveForm Games
Treeform sequential decision making (TFSDM) extends classical oneshot decision making by modeling treeform interactions between an agent and a potentially adversarial environment. It captures the online decisionmaking problems that each player faces in an extensiveform game, as well as Markov decision processes and partiallyobservable Markov decision processes where the agent conditions on observed history. Over the past decade, there has been considerable effort into designing online optimization methods for TFSDM. Virtually all of that work has been in the fullfeedback setting, where the agent has access to counterfactuals, that is, information on what would have happened had the agent chosen a different action at any decision node. Little is known about the bandit setting, where that assumption is reversed (no counterfactual information is available), despite this latter setting being well understood for almost 20 years in oneshot decision making. In this paper, we give the first algorithm for the bandit linear optimization problem for TFSDM that offers both (i) lineartime iterations (in the size of the decision tree) and (ii) O(T−−√) cumulative regret in expectation compared to any fixed strategy, at all times T. This is made possible by new results that we derive, which may have independent uses as more »
 Award ID(s):
 1733556
 Publication Date:
 NSFPAR ID:
 10296645
 Journal Name:
 AAAI Annual conference
 Sponsoring Org:
 National Science Foundation
More Like this


Regret minimization has proved to be a versatile tool for tree form sequential decision making and extensiveform games. In large twoplayer zerosum imperfectinformation games, mod ern extensions of counterfactual regret minimization (CFR) are currently the practical state of the art for computing a Nash equilibrium. Most regretminimization algorithms for treeform sequential decision making, including CFR, require (i) an exact model of the player’s decision nodes, observation nodes, and how they are linked, and (ii) full knowledge, at all times t, about the payoffs—even in parts of the decision space that are not encountered at time t. Recently, there has been growing interest towards relaxing some of those restric tions and making regret minimization applicable to settings for which reinforcement learning methods have traditionally been used—for example, those in which only blackbox access to the environment is available. We give the first, to our knowl edge, regretminimization algorithm that guarantees sublinear regret with high probability even when requirement (i)—and thus also (ii)—is dropped. We formalize an online learning setting in which the strategy space is not known to the agent and gets revealed incrementally whenever the agent encoun ters new decision points. We give an efficient algorithm that achieves O(T 3/4)more »

In this paper, we study a class of online optimization problems with longterm budget constraints where the objective functions are not necessarily concave (nor convex), but they instead satisfy the Diminishing Returns (DR) property. In this online setting, a sequence of monotone DRsubmodular objective functions and linear budget functions arrive over time and assuming a limited total budget, the goal is to take actions at each time, before observing the utility and budget function arriving at that round, to achieve sublinear regret bound while the total budget violation is sublinear as well. Prior work has shown that achieving sublinear regret and total budget violation simultaneously is impossible if the utility and budget functions are chosen adversarially. Therefore, we modify the notion of regret by comparing the agent against the best fixed decision in hindsight which satisfies the budget constraint proportionally over any window of length W. We propose the Online Saddle Point Hybrid Gradient (OSPHG) algorithm to solve this class of online problems. For W = T, we recover the aforementioned impossibility result. However, if W is sublinear in T, we show that it is possible to obtain sublinear bounds for both the regret and the total budget violation.

We consider multiple parallel Markov decision processes (MDPs) coupled by global constraints, where the time varying objective and constraint functions can only be observed after the decision is made. Special attention is given to how well the decision maker can perform in T slots, starting from any state, compared to the best feasible randomized stationary policy in hindsight. We develop a new distributed online algorithm where each MDP makes its own decision each slot after observing a multiplier computed from past information. While the scenario is significantly more challenging than the classical online learning context, the algorithm is shown to have a tight O( T ) regret and constraint viola tions simultaneously. To obtain such a bound, we combine several new ingredients including ergodicity and mixing time bound in weakly coupled MDPs, a new regret analysis for online constrained optimization, a drift analysis for queue processes, and a perturbation analysis based on Farkas’ Lemma.

We develop provably efficient reinforcement learning algorithms for twoplayer zerosum finitehorizon Markov games with simultaneous moves. To incorporate function approximation, we consider a family of Markov games where the reward function and transition kernel possess a linear structure. Both the offline and online settings of the problems are considered. In the offline setting, we control both players and aim to find the Nash equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret. For both settings, we propose an optimistic variant of the leastsquares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an [Formula: see text] upper bound on the duality gap and regret, where d is the linear dimension, H the horizon and T the total number of timesteps. Our results do not require additional assumptions on the sampling model. Our setting requires overcoming several new challenges that are absent in Markov decision processes or turnbased Markov games. In particular, to achieve optimism with simultaneous moves, we construct both upper and lower confidence bounds of the value function, and then compute the optimistic policy by solvingmore »