MonteCarlo planning, as exemplified by MonteCarlo Tree Search (MCTS), has demonstrated remarkable performance in applications with finite spaces. In this paper, we consider MonteCarlo planning in an environment with continuous stateaction spaces, a much less understood problem with important applications in control and robotics. We introduce POLYHOOT , an algorithm that augments MCTS with a continuous armed bandit strategy named Hierarchical Optimistic Optimization (HOO) (Bubeck et al., 2011). Specifically, we enhance HOO by using an appropriate polynomial, rather than logarithmic, bonus term in the upper confidence bounds. Such a polynomial bonus is motivated by its empirical successes in AlphaGo Zero (Silver et al., 2017b), as well as its significant role in achieving theoretical guarantees of finite space MCTS (Shah et al., 2019). We investigate, for the first time, the regret of the enhanced HOO algorithm in nonstationary bandit problems. Using this result as a building block, we establish nonasymptotic convergence guarantees for POLYHOOT : the value estimate converges to an arbitrarily small neighborhood of the optimal value function at a polynomial rate. We further provide experimental results that corroborate our theoretical findings.
This content will become publicly available on May 1, 2023
Nonasymptotic Analysis of Monte Carlo Tree Search
In this work, we consider the popular treebased search strategy within the framework of reinforcement learning, the Monte Carlo tree search (MCTS), in the context of the infinitehorizon discounted cost Markov decision process (MDP). Although MCTS is believed to provide an approximate value function for a given state with enough simulations, the claimed proof of this property is incomplete. This is because the variant of MCTS, the upper confidence bound for trees (UCT), analyzed in prior works, uses “logarithmic” bonus term for balancing exploration and exploitation within the treebased search, following the insights from stochastic multiarm bandit (MAB) literature. In effect, such an approach assumes that the regret of the underlying recursively dependent nonstationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold, even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of nonstationary MABs. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has a claimed property. Interestingly enough, empirically successful approaches use a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we argue more »
 Award ID(s):
 1955997
 Publication Date:
 NSFPAR ID:
 10381247
 Journal Name:
 Operations Research
 ISSN:
 0030364X
 Sponsoring Org:
 National Science Foundation
More Like this


We consider the following general network design problem. The input is an asymmetric metric (V, c), root [Formula: see text], monotone submodular function [Formula: see text], and budget B. The goal is to find an rrooted arborescence T of cost at most B that maximizes f(T). Our main result is a simple quasipolynomial time [Formula: see text]approximation algorithm for this problem, in which [Formula: see text] is the number of vertices in an optimal solution. As a consequence, we obtain an [Formula: see text]approximation algorithm for directed (polymatroid) Steiner tree in quasipolynomial time. We also extend our main result to a setting with additional length bounds at vertices, which leads to improved [Formula: see text]approximation algorithms for the singlesource buyatbulk and priority Steiner tree problems. For the usual directed Steiner tree problem, our result matches the best previous approximation ratio but improves significantly on the running time. For polymatroid Steiner tree and singlesource buyatbulk, our result improves prior approximation ratios by a logarithmic factor. For directed priority Steiner tree, our result seems to be the first nontrivial approximation ratio. Under certain complexity assumptions, our approximation ratios are the best possible (up to constant factors).

We develop provably efficient reinforcement learning algorithms for twoplayer zerosum finitehorizon Markov games with simultaneous moves. To incorporate function approximation, we consider a family of Markov games where the reward function and transition kernel possess a linear structure. Both the offline and online settings of the problems are considered. In the offline setting, we control both players and aim to find the Nash equilibrium by minimizing the duality gap. In the online setting, we control a single player playing against an arbitrary opponent and aim to minimize the regret. For both settings, we propose an optimistic variant of the leastsquares minimax value iteration algorithm. We show that our algorithm is computationally efficient and provably achieves an [Formula: see text] upper bound on the duality gap and regret, where d is the linear dimension, H the horizon and T the total number of timesteps. Our results do not require additional assumptions on the sampling model. Our setting requires overcoming several new challenges that are absent in Markov decision processes or turnbased Markov games. In particular, to achieve optimism with simultaneous moves, we construct both upper and lower confidence bounds of the value function, and then compute the optimistic policy by solvingmore »

We study the dynamic assortment planning problem, where for each arriving customer, the seller offers an assortment of substitutable products and the customer makes the purchase among offered products according to an uncapacitated multinomial logit (MNL) model. Because all the utility parameters of the MNL model are unknown, the seller needs to simultaneously learn customers’ choice behavior and make dynamic decisions on assortments based on the current knowledge. The goal of the seller is to maximize the expected revenue, or, equivalently, to minimize the expected regret. Although dynamic assortment planning problem has received an increasing attention in revenue management, most existing policies require the estimation of mean utility for each product and the final regret usually involves the number of products [Formula: see text]. The optimal regret of the dynamic assortment planning problem under the most basic and popular choice model—the MNL model—is still open. By carefully analyzing a revenue potential function, we develop a trisectionbased policy combined with adaptive confidence bound construction, which achieves an itemindependent regret bound of [Formula: see text], where [Formula: see text] is the length of selling horizon. We further establish the matching lower bound result to show the optimality of our policy. There aremore »

We consider the periodic review dynamic pricing and inventory control problem with fixed ordering cost. Demand is random and price dependent, and unsatisfied demand is backlogged. With complete demand information, the celebrated [Formula: see text] policy is proved to be optimal, where s and S are the reorder point and orderupto level for ordering strategy, and [Formula: see text], a function of onhand inventory level, characterizes the pricing strategy. In this paper, we consider incomplete demand information and develop online learning algorithms whose average profit approaches that of the optimal [Formula: see text] with a tight [Formula: see text] regret rate. A number of salient features differentiate our work from the existing online learning researches in the operations management (OM) literature. First, computing the optimal [Formula: see text] policy requires solving a dynamic programming (DP) over multiple periods involving unknown quantities, which is different from the majority of learning problems in OM that only require solving singleperiod optimization questions. It is hence challenging to establish stability results through DP recursions, which we accomplish by proving uniform convergence of the profittogo function. The necessity of analyzing actiondependent state transition over multiple periods resembles the reinforcement learning question, considerably more difficult thanmore »