skip to main content

This content will become publicly available on November 1, 2023

Title: Constant Regret Resolving Heuristics for Price-Based Revenue Management
Price-based revenue management is an important problem in operations management with many practical applications. The problem considers a seller who sells one or multiple products over T consecutive periods and is subject to constraints on the initial inventory levels of resources. Whereas, in theory, the optimal pricing policy could be obtained via dynamic programming, computing the exact dynamic programming solution is often intractable. Approximate policies, such as the resolving heuristics, are often applied as computationally tractable alternatives. In this paper, we show the following two results for price-based network revenue management under a continuous price set. First, we prove that a natural resolving heuristic attains O(1) regret compared with the value of the optimal policy. This improves the [Formula: see text] regret upper bound established in the prior work by Jasin in 2014. Second, we prove that there is an [Formula: see text] gap between the value of the optimal policy and that of the fluid model. This complements our upper bound result by showing that the fluid is not an adequate information-relaxed benchmark when analyzing price-based revenue management algorithms. Funding: This work was supported in part by the National Science Foundation [Grant CMMI-2145661].
Award ID(s):
Publication Date:
Journal Name:
Operations Research
Page Range or eLocation-ID:
3538 to 3557
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the periodic review dynamic pricing and inventory control problem with fixed ordering cost. Demand is random and price dependent, and unsatisfied demand is backlogged. With complete demand information, the celebrated [Formula: see text] policy is proved to be optimal, where s and S are the reorder point and order-up-to level for ordering strategy, and [Formula: see text], a function of on-hand inventory level, characterizes the pricing strategy. In this paper, we consider incomplete demand information and develop online learning algorithms whose average profit approaches that of the optimal [Formula: see text] with a tight [Formula: see text] regret rate. A number of salient features differentiate our work from the existing online learning researches in the operations management (OM) literature. First, computing the optimal [Formula: see text] policy requires solving a dynamic programming (DP) over multiple periods involving unknown quantities, which is different from the majority of learning problems in OM that only require solving single-period optimization questions. It is hence challenging to establish stability results through DP recursions, which we accomplish by proving uniform convergence of the profit-to-go function. The necessity of analyzing action-dependent state transition over multiple periods resembles the reinforcement learning question, considerably more difficult thanmore »existing bandit learning algorithms. Second, the pricing function [Formula: see text] is of infinite dimension, and approaching it is much more challenging than approaching a finite number of parameters as seen in existing researches. The demand-price relationship is estimated based on upper confidence bound, but the confidence interval cannot be explicitly calculated due to the complexity of the DP recursion. Finally, because of the multiperiod nature of [Formula: see text] policies the actual distribution of the randomness in demand plays an important role in determining the optimal pricing strategy [Formula: see text], which is unknown to the learner a priori. In this paper, the demand randomness is approximated by an empirical distribution constructed using dependent samples, and a novel Wasserstein metric-based argument is employed to prove convergence of the empirical distribution. This paper was accepted by J. George Shanthikumar, big data analytics.« less
  2. We study the dynamic assortment planning problem, where for each arriving customer, the seller offers an assortment of substitutable products and the customer makes the purchase among offered products according to an uncapacitated multinomial logit (MNL) model. Because all the utility parameters of the MNL model are unknown, the seller needs to simultaneously learn customers’ choice behavior and make dynamic decisions on assortments based on the current knowledge. The goal of the seller is to maximize the expected revenue, or, equivalently, to minimize the expected regret. Although dynamic assortment planning problem has received an increasing attention in revenue management, most existing policies require the estimation of mean utility for each product and the final regret usually involves the number of products [Formula: see text]. The optimal regret of the dynamic assortment planning problem under the most basic and popular choice model—the MNL model—is still open. By carefully analyzing a revenue potential function, we develop a trisection-based policy combined with adaptive confidence bound construction, which achieves an item-independent regret bound of [Formula: see text], where [Formula: see text] is the length of selling horizon. We further establish the matching lower bound result to show the optimality of our policy. There aremore »two major advantages of the proposed policy. First, the regret of all our policies has no dependence on [Formula: see text]. Second, our policies are almost assumption-free: there is no assumption on mean utility nor any “separability” condition on the expected revenues for different assortments. We also extend our trisection search algorithm to capacitated MNL models and obtain the optimal regret [Formula: see text] (up to logrithmic factors) without any assumption on the mean utility parameters of items.« less
  3. The prevalence of e-commerce has made customers’ detailed personal information readily accessible to retailers, and this information has been widely used in pricing decisions. When using personalized information, the question of how to protect the privacy of such information becomes a critical issue in practice. In this paper, we consider a dynamic pricing problem over T time periods with an unknown demand function of posted price and personalized information. At each time t, the retailer observes an arriving customer’s personal information and offers a price. The customer then makes the purchase decision, which will be utilized by the retailer to learn the underlying demand function. There is potentially a serious privacy concern during this process: a third-party agent might infer the personalized information and purchase decisions from price changes in the pricing system. Using the fundamental framework of differential privacy from computer science, we develop a privacy-preserving dynamic pricing policy, which tries to maximize the retailer revenue while avoiding information leakage of individual customer’s information and purchasing decisions. To this end, we first introduce a notion of anticipating [Formula: see text]-differential privacy that is tailored to the dynamic pricing problem. Our policy achieves both the privacy guarantee and the performancemore »guarantee in terms of regret. Roughly speaking, for d-dimensional personalized information, our algorithm achieves the expected regret at the order of [Formula: see text] when the customers’ information is adversarially chosen. For stochastic personalized information, the regret bound can be further improved to [Formula: see text]. This paper was accepted by J. George Shanthikumar, big data analytics.« less
  4. In this work, we consider the popular tree-based search strategy within the framework of reinforcement learning, the Monte Carlo tree search (MCTS), in the context of the infinite-horizon discounted cost Markov decision process (MDP). Although MCTS is believed to provide an approximate value function for a given state with enough simulations, the claimed proof of this property is incomplete. This is because the variant of MCTS, the upper confidence bound for trees (UCT), analyzed in prior works, uses “logarithmic” bonus term for balancing exploration and exploitation within the tree-based search, following the insights from stochastic multiarm bandit (MAB) literature. In effect, such an approach assumes that the regret of the underlying recursively dependent nonstationary MABs concentrates around their mean exponentially in the number of steps, which is unlikely to hold, even for stationary MABs. As the key contribution of this work, we establish polynomial concentration property of regret for a class of nonstationary MABs. This in turn establishes that the MCTS with appropriate polynomial rather than logarithmic bonus term in UCB has a claimed property. Interestingly enough, empirically successful approaches use a similar polynomial form of MCTS as suggested by our result. Using this as a building block, we arguemore »that MCTS, combined with nearest neighbor supervised learning, acts as a “policy improvement” operator; that is, it iteratively improves value function approximation for all states because of combining with supervised learning, despite evaluating at only finitely many states. In effect, we establish that to learn an ε approximation of the value function with respect to [Formula: see text] norm, MCTS combined with nearest neighbor requires a sample size scaling as [Formula: see text], where d is the dimension of the state space. This is nearly optimal because of a minimax lower bound of [Formula: see text], suggesting the strength of the variant of MCTS we propose here and our resulting analysis.« less
  5. This work concerns the asymptotic behavior of solutions to a (strictly) subcritical fluid model for a data communication network, where file sizes are generally distributed and the network operates under a fair bandwidth-sharing policy. Here we consider fair bandwidth-sharing policies that are a slight generalization of the [Formula: see text]-fair policies introduced by Mo and Walrand [Mo J, Walrand J (2000) Fair end-to-end window-based congestion control. IEEE/ACM Trans. Networks 8(5):556–567.]. Since the year 2000, it has been a standing problem to prove stability of the data communications network model of Massoulié and Roberts [Massoulié L, Roberts J (2000) Bandwidth sharing and admission control for elastic traffic. Telecommunication Systems 15(1):185–201.], with general file sizes and operating under fair bandwidth sharing policies, when the offered load is less than capacity (subcritical conditions). A crucial step in an approach to this problem is to prove stability of subcritical fluid model solutions. In 2012, Paganini et al. [Paganini F, Tang A, Ferragut A, Andrew LLH (2012) Network stability under alpha fair bandwidth allocation with general file size distribution. IEEE Trans. Automatic Control 57(3):579–591.] introduced a Lyapunov function for this purpose and gave an argument, assuming that fluid model solutions are sufficiently smooth in timemore »and space that they are strong solutions of a partial differential equation and assuming that no fluid level on any route touches zero before all route levels reach zero. The aim of the current paper is to prove stability of the subcritical fluid model without these strong assumptions. Starting with a slight generalization of the Lyapunov function proposed by Paganini et al., assuming that each component of the initial state of a measure-valued fluid model solution, as well as the file size distributions, have no atoms and have finite first moments, we prove absolute continuity in time of the composition of the Lyapunov function with any subcritical fluid model solution and describe the associated density. We use this to prove that the Lyapunov function composed with such a subcritical fluid model solution converges to zero as time goes to infinity. This implies that each component of the measure-valued fluid model solution converges vaguely on [Formula: see text] to the zero measure as time goes to infinity. Under the further assumption that the file size distributions have finite pth moments for some p > 1 and that each component of the initial state of the fluid model solution has finite pth moment, it is proved that the fluid model solution reaches the measure with all components equal to the zero measure in finite time and that the time to reach this zero state has a uniform bound for all fluid model solutions having a uniform bound on the initial total mass and the pth moment of each component of the initial state. In contrast to the analysis of Paganini et al., we do not need their strong smoothness assumptions on fluid model solutions and we rigorously treat the realistic, but singular situation, where the fluid level on some routes becomes zero, whereas other route levels remain positive.« less