In many operations management problems, we need to make decisions sequentially to minimize the cost, satisfying certain constraints. One modeling approach to such problems is the constrained Markov decision process (CMDP). In this work, we develop a data-driven primal-dual algorithm to solve CMDPs. Our approach alternatively applies regularized policy iteration to improve the policy and subgradient ascent to maintain the constraints. Under mild regularity conditions, we show that the algorithm converges at rate [Formula: see text], where T is the number of iterations, for both the discounted and long-run average cost formulations. Our algorithm can be easily combined with advanced deep learning techniques to deal with complex large-scale problems with the additional benefit of straightforward convergence analysis. When the CMDP has a weakly coupled structure, our approach can further reduce the computational complexity through an embedded decomposition. We apply the algorithm to two operations management problems: multiclass queue scheduling and multiproduct inventory management. Numerical experiments demonstrate that our algorithm, when combined with appropriate value function approximations, generates policies that achieve superior performance compared with state-of-the-art heuristics. This paper was accepted by Baris Ata, stochastic models and simulation. Funding: Y. Chen was supported by the Hong Kong Research Grants Council, Early Career Scheme Fund [Grant 26508924], and partially supported by a grant from the National Natural Science Foundation of China [Grant 72495125]. J. Dong was supported by the National Science Foundation [Grant 1944209]. Supplemental Material: The data files are available at https://doi.org/10.1287/mnsc.2022.03736 .
more »
« less
This content will become publicly available on June 9, 2026
Optimal Policy for Inventory Management with Periodic and Controlled Resets
Problem definition: Inventory management problems with periodic and controllable resets occur in the context of managing water storage in the developing world and dynamically optimizing endcap promotion duration in retail outlets. In this paper, we consider a set of sequential decision problems in which the decision maker must not only balance holding and shortage costs but discard all inventory before a fixed number of decision epochs with the option for an early inventory reset. Methodology/results: Finding optimal policies for these problems through dynamic programming presents unique challenges because of the nonconvex nature of the resulting value functions. Moreover, this structure cannot be readily analyzed even with extended convexity definitions, such as K-convexity. Managerial implications: Our key contribution is to present sufficient conditions that ensure the optimal policy has an easily interpretable structure, which generalizes the well-known [Formula: see text] policy from the operations management literature. Furthermore, we demonstrate that, under these rather mild conditions, the optimal policy exhibits a four-threshold structure. We then conclude with computational experiments, thereby illustrating the policy structures that can be extracted in various inventory management scenarios. Funding: This work was supported by the National Science Foundation [Grant CMMI-1847666] and the Division of Graduate Education [Grant DGE-2125913]. Supplemental Material: The online appendix is available at https://doi.org/10.1287/msom.2022.0318 .
more »
« less
- Award ID(s):
- 2125913
- PAR ID:
- 10627356
- Publisher / Repository:
- Manufacturing & Service Operations Management
- Date Published:
- Journal Name:
- Manufacturing & Service Operations Management
- ISSN:
- 1523-4614
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We consider optimal transport-based distributionally robust optimization (DRO) problems with locally strongly convex transport cost functions and affine decision rules. Under conventional convexity assumptions on the underlying loss function, we obtain structural results about the value function, the optimal policy, and the worst-case optimal transport adversarial model. These results expose a rich structure embedded in the DRO problem (e.g., strong convexity even if the non-DRO problem is not strongly convex, a suitable scaling of the Lagrangian for the DRO constraint, etc., which are crucial for the design of efficient algorithms). As a consequence of these results, one can develop efficient optimization procedures that have the same sample and iteration complexity as a natural non-DRO benchmark algorithm, such as stochastic gradient descent.more » « less
-
Ozay, Necmiye; Balzano, Laura; Panagou, Dimitra; Abate, Alessandro (Ed.)Many optimal and robust control problems are nonconvex and potentially nonsmooth in their policy optimization forms. In this paper, we introduce the Extended Convex Lifting (ECL) framework, which reveals hidden convexity in classical optimal and robust control problems from a modern optimization perspective. Our ECL framework offers a bridge between nonconvex policy optimization and convex reformulations. Despite non-convexity and non-smoothness, the existence of an ECL for policy optimization not only reveals that the policy optimization problem is equivalent to a convex problem, but also certifies a class of first-order non-degenerate stationary points to be globally optimal. We further show that this ECL framework encompasses many benchmark control problems, including LQR, state-feedback and output-feedback H-infinity robust control. We believe that ECL will also be of independent interest for analyzing nonconvex problems beyond control.more » « less
-
We consider a stochastic inventory control problem under censored demand, lost sales, and positive lead times. This is a fundamental problem in inventory management, with significant literature establishing near optimality of a simple class of policies called “base-stock policies” as well as the convexity of long-run average cost under those policies. We consider a relatively less studied problem of designing a learning algorithm for this problem when the underlying demand distribution is unknown. The goal is to bound the regret of the algorithm when compared with the best base-stock policy. Our main contribution is a learning algorithm with a regret bound of [Formula: see text] for the inventory control problem. Here, [Formula: see text] is the fixed and known lead time, and D is an unknown parameter of the demand distribution described roughly as the expected number of time steps needed to generate enough demand to deplete one unit of inventory. Notably, our regret bounds depend linearly on L, which significantly improves the previously best-known regret bounds for this problem where the dependence on L was exponential. Our techniques utilize the convexity of the long-run average cost and a newly derived bound on the “bias” of base-stock policies to establish an almost black box connection between the problem of learning in Markov decision processes (MDPs) with these properties and the stochastic convex bandit problem. The techniques presented here may be of independent interest for other settings that involve large structured MDPs but with convex asymptotic average cost functions.more » « less
-
This letter studies how a stochastic gradient algorithm (SG) can be controlled to hide the estimate of the local stationary point from an eavesdropper. Such prob- lems are of significant interest in distributed optimization settings like federated learning and inventory management. A learner queries a stochastic oracle and incentivizes the oracle to obtain noisy gradient measurements and per- form SG. The oracle probabilistically returns either a noisy gradient of the function or a non-informative measure- ment, depending on the oracle state and incentive. The learner’s query and incentive are visible to an eavesdropper who wishes to estimate the stationary point. This letter formulates the problem of the learner performing covert optimization by dynamically incentivizing the stochastic oracle and obfuscating the eavesdropper as a finite-horizon Markov decision process (MDP). Using conditions for interval-dominance on the cost and transition probability structure, we show that the optimal policy for the MDP has a monotone threshold structure. We propose searching for the optimal stationary policy with the threshold structure using a stochastic approximation algorithm and a multi– armed bandit approach. The effectiveness of our methods is numerically demonstrated on a covert federated learning hate-speech classification task.more » « less
An official website of the United States government
