skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Friday, July 12 until 2:00 AM ET on Saturday, July 13 due to maintenance. We apologize for the inconvenience.


Title: Nonasymptotic Analysis of Monte Carlo Tree Search
In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo tree search (MCTS), in the context of the infinite-horizon discounted cost Markov decision process (MDP). Although MCTS is believed to provide an approximate value function for a given state with enough simulations, the claimed proof of this property is incomplete. This is because the variant of MCTS, the upper confidence bound for trees (UCT), analyzed in prior works, uses “logarithmic” bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multiarm bandit (MAB) literature. In effect, such an approach assumes that the regret of the underlying recursively dependent nonstationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold, even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of nonstationary MABs. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has a claimed property. Interestingly enough, empirically successful approaches use a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we argue that MCTS, combined with nearest neighbor supervised learning, acts as a “policy improvement” operator; that is, it iteratively improves value function approximation for all states because of combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an ε approximation of the value function with respect to [Formula: see text] norm, MCTS combined with nearest neighbor requires a sample size scaling as [Formula: see text], where d is the dimension of the state space. This is nearly optimal because of a minimax lower bound of [Formula: see text], suggesting the strength of the variant of MCTS we propose here and our resulting analysis.  more » « less
Award ID(s):
1955997
NSF-PAR ID:
10381247
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Operations Research
ISSN:
0030-364X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Monte-Carlo planning, as exemplified by Monte-Carlo Tree Search (MCTS), has demonstrated remarkable performance in applications with finite spaces. In this paper, we consider Monte-Carlo planning in an environment with continuous state-action spaces, a much less understood problem with important applications in control and robotics. We introduce POLY-HOOT , an algorithm that augments MCTS with a continuous armed bandit strategy named Hierarchical Optimistic Optimization (HOO) (Bubeck et al., 2011). Specifically, we enhance HOO by using an appropriate polynomial, rather than logarithmic, bonus term in the upper confidence bounds. Such a polynomial bonus is motivated by its empirical successes in AlphaGo Zero (Silver et al., 2017b), as well as its significant role in achieving theoretical guarantees of finite space MCTS (Shah et al., 2019). We investigate, for the first time, the regret of the enhanced HOO algorithm in non-stationary bandit problems. Using this result as a building block, we establish non-asymptotic convergence guarantees for POLY-HOOT : the value estimate converges to an arbitrarily small neighborhood of the optimal value function at a polynomial rate. We further provide experimental results that corroborate our theoretical findings. 
    more » « less
  2. We consider the following general network design problem. The input is an asymmetric metric (V, c), root [Formula: see text], monotone submodular function [Formula: see text], and budget B. The goal is to find an r-rooted arborescence T of cost at most B that maximizes f(T). Our main result is a simple quasi-polynomial time [Formula: see text]-approximation algorithm for this problem, in which [Formula: see text] is the number of vertices in an optimal solution. As a consequence, we obtain an [Formula: see text]-approximation algorithm for directed (polymatroid) Steiner tree in quasi-polynomial time. We also extend our main result to a setting with additional length bounds at vertices, which leads to improved [Formula: see text]-approximation algorithms for the single-source buy-at-bulk and priority Steiner tree problems. For the usual directed Steiner tree problem, our result matches the best previous approximation ratio but improves significantly on the running time. For polymatroid Steiner tree and single-source buy-at-bulk, our result improves prior approximation ratios by a logarithmic factor. For directed priority Steiner tree, our result seems to be the first nontrivial approximation ratio. Under certain complexity assumptions, our approximation ratios are the best possible (up to constant factors). 
    more » « less
  3. In this study, we simulated the algorithmic performance of a small neutral atom quantum computer and compared its performance when operating with all-to-all versus nearest-neighbor connectivity. This comparison was made using a suite of algorithmic benchmarks developed by the Quantum Economic Development Consortium. Circuits were simulated with a noise model consistent with experimental data from [Nature 604, 457 (2022)]. We find that all-to-all connectivity improves simulated circuit fidelity by [Formula: see text]–[Formula: see text], compared to nearest-neighbor connectivity.

     
    more » « less
  4. A braided fusion category is said to have Property F if the associated braid group representations factor through a finite group. We verify integral metaplectic modular categories have property F by showing these categories are group-theoretical. For the special case of integral categories [Formula: see text] with the fusion rules of [Formula: see text] we determine the finite group [Formula: see text] for which [Formula: see text] is braided equivalent to [Formula: see text]. In addition, we determine the associated classical link invariant, an evaluation of the 2-variable Kauffman polynomial at a point. 
    more » « less
  5. Let [Formula: see text] be a directed graph associated with a weight [Formula: see text]. For an edge-cut [Formula: see text] of [Formula: see text], the average weight of [Formula: see text] is denoted and defined as [Formula: see text]. An optimal edge-cut with average weight is an edge-cut [Formula: see text] such that [Formula: see text] is maximum among all edge-cuts (or minimum, symmetrically). In this paper, a polynomial algorithm for this problem is proposed for finding an optimal edge-cut in a rooted tree separating the root and the set of all leafs. This algorithm enables us to develop an automatic clustering method with more accurate detection of community output. 
    more » « less