In this work, we consider the popular treebased search strategy within the framework of reinforcement learning, the Monte Carlo tree search (MCTS), in the context of the infinitehorizon discounted cost Markov decision process (MDP). Although MCTS is believed to provide an approximate value function for a given state with enough simulations, the claimed proof of this property is incomplete. This is because the variant of MCTS, the upper confidence bound for trees (UCT), analyzed in prior works, uses “logarithmic” bonus term for balancing exploration and exploitation within the treebased search, following the insights from stochastic multiarm bandit (MAB) literature. In effect, such an approach assumes that the regret of the underlying recursively dependent nonstationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold, even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of nonstationary MABs. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has a claimed property. Interestingly enough, empirically successful approaches use a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we arguemore »
This content will become publicly available on March 29, 2023
Hidden Integrality and Semirandom Robustness of SDP Relaxation for SubGaussian Mixture Model
We consider the problem of estimating the discrete clustering structures under the subGaussian mixture model. Our main results establish a hidden integrality property of a semidefinite programming (SDP) relaxation for this problem: while the optimal solution to the SDP is not integervalued in general, its estimation error can be upper bounded by that of an idealized integer program. The error of the integer program, and hence that of the SDP, are further shown to decay exponentially in the signaltonoise ratio. In addition, we show that the SDP relaxation is robust under the semirandom setting in which an adversary can modify the data generated from the mixture model. In particular, we generalize the hidden integrality property to the semirandom model and thereby show that SDP achieves the optimal error bound in this setting. These results together highlight the “globaltolocal” mechanism that drives the performance of the SDP relaxation. To the best of our knowledge, our result is the first exponentially decaying error bound for convex relaxations of mixture models. A corollary of our results shows that in certain regimes, the SDP solutions are in fact integral and exact. More generally, our results establish sufficient conditions for the SDP to correctly recover more »
 Award ID(s):
 2047910
 Publication Date:
 NSFPAR ID:
 10327714
 Journal Name:
 Mathematics of Operations Research
 ISSN:
 0364765X
 Sponsoring Org:
 National Science Foundation
More Like this


We study the dynamic assortment planning problem, where for each arriving customer, the seller offers an assortment of substitutable products and the customer makes the purchase among offered products according to an uncapacitated multinomial logit (MNL) model. Because all the utility parameters of the MNL model are unknown, the seller needs to simultaneously learn customers’ choice behavior and make dynamic decisions on assortments based on the current knowledge. The goal of the seller is to maximize the expected revenue, or, equivalently, to minimize the expected regret. Although dynamic assortment planning problem has received an increasing attention in revenue management, most existing policies require the estimation of mean utility for each product and the final regret usually involves the number of products [Formula: see text]. The optimal regret of the dynamic assortment planning problem under the most basic and popular choice model—the MNL model—is still open. By carefully analyzing a revenue potential function, we develop a trisectionbased policy combined with adaptive confidence bound construction, which achieves an itemindependent regret bound of [Formula: see text], where [Formula: see text] is the length of selling horizon. We further establish the matching lower bound result to show the optimality of our policy. There aremore »

Pricebased revenue management is an important problem in operations management with many practical applications. The problem considers a seller who sells one or multiple products over T consecutive periods and is subject to constraints on the initial inventory levels of resources. Whereas, in theory, the optimal pricing policy could be obtained via dynamic programming, computing the exact dynamic programming solution is often intractable. Approximate policies, such as the resolving heuristics, are often applied as computationally tractable alternatives. In this paper, we show the following two results for pricebased network revenue management under a continuous price set. First, we prove that a natural resolving heuristic attains O(1) regret compared with the value of the optimal policy. This improves the [Formula: see text] regret upper bound established in the prior work by Jasin in 2014. Second, we prove that there is an [Formula: see text] gap between the value of the optimal policy and that of the fluid model. This complements our upper bound result by showing that the fluid is not an adequate informationrelaxed benchmark when analyzing pricebased revenue management algorithms. Funding: This work was supported in part by the National Science Foundation [Grant CMMI2145661].

Queueing models that are used to capture various service settings typically assume that customers require a single unit of resource (server) to be processed. However, there are many service settings where such an assumption may fail to capture the heterogeneity in resource requirements of different customers. We propose a multiserver queueing model with multiple customer classes in which customers from different classes may require different amounts of resources to be served. We study the optimal scheduling policy for such systems. To balance holding costs, service rates, resource requirement, and priorityinduced idleness, we develop an indexbased policy that we refer to as the idleavoid [Formula: see text] rule. For a twoclass twoserver model, where policyinduced idleness can have a big impact on system performance, we characterize cases where the idleavoid [Formula: see text] rule is optimal. In other cases, we establish a uniform performance bound on the amount of suboptimality incurred by the idleavoid [Formula: see text] rule. For general multiclass multiserver queues, we establish the asymptotic optimality of the idleavoid [Formula: see text] rule in the manyserver regime. For longtime horizons, we show that the idleavoid [Formula: see text] is throughput optimal. Our theoretical results, along with numerical experiments, providemore »

We consider a revenuemaximizing seller with m heterogeneous items and a single buyer whose valuation for the items may exhibit both substitutes and complements. We show that the better of selling the items separately and bundling them together— guarantees a [Formula: see text]fraction of the optimal revenue, where d is a measure of the degree of complementarity; it extends prior work showing that the same simple mechanism achieves a constantfactor approximation when buyer valuations are subadditive (the most general class of complementfree valuations). Our proof is enabled by a recent duality framework, which we use to obtain a bound on the optimal revenue in the generalized setting. Our technical contributions are domain specific to handle the intricacies of settings with complements. One key modeling contribution is a tractable notion of “degree of complementarity” that admits meaningful results and insights—we demonstrate that previous definitions fall short in this regard.