NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Primal-Dual Approach to Constrained Markov Decision Processes with Applications to Queue Scheduling and Inventory Management

https://doi.org/10.1287/mnsc.2022.03736

Chen, Yi; Dong, Jing; Wang, Zhaoran; Zhang, Chuheng (May 2025, Management Science)

In many operations management problems, we need to make decisions sequentially to minimize the cost, satisfying certain constraints. One modeling approach to such problems is the constrained Markov decision process (CMDP). In this work, we develop a data-driven primal-dual algorithm to solve CMDPs. Our approach alternatively applies regularized policy iteration to improve the policy and subgradient ascent to maintain the constraints. Under mild regularity conditions, we show that the algorithm converges at rate [Formula: see text], where T is the number of iterations, for both the discounted and long-run average cost formulations. Our algorithm can be easily combined with advanced deep learning techniques to deal with complex large-scale problems with the additional benefit of straightforward convergence analysis. When the CMDP has a weakly coupled structure, our approach can further reduce the computational complexity through an embedded decomposition. We apply the algorithm to two operations management problems: multiclass queue scheduling and multiproduct inventory management. Numerical experiments demonstrate that our algorithm, when combined with appropriate value function approximations, generates policies that achieve superior performance compared with state-of-the-art heuristics. This paper was accepted by Baris Ata, stochastic models and simulation. Funding: Y. Chen was supported by the Hong Kong Research Grants Council, Early Career Scheme Fund [Grant 26508924], and partially supported by a grant from the National Natural Science Foundation of China [Grant 72495125]. J. Dong was supported by the National Science Foundation [Grant 1944209]. Supplemental Material: The data files are available at https://doi.org/10.1287/mnsc.2022.03736 .
more » « less
Free, publicly-accessible full text available May 20, 2026
What and How does In-Context Learning Learn? Bayesian Model Averaging, Parameterization, and Generalization

Zhang, Yufeng; Zhang, Fengzhuo; Yang, Zhuoran; Wang, Zhaoran (May 2025, Journal of medicinal and chemical sciences)

In-Context Learning (ICL) ability has been found efficient across a wide range of applications, where the Large Language Models (LLM) learn to complete the tasks from the examples in the prompt without tuning the parameters. In this work, we conduct a comprehensive study to understand ICL from a statistical perspective. First, we show that the perfectly pretrained LLMs perform Bayesian Model Averaging (BMA) for ICL under a dynamic model of examples in the prompt. The average error analysis for ICL is then built for the perfectly pretrained LLMs with the analysis of BMA. Second, we demonstrate how the attention structure boosts the BMA implementation. With sufficient examples in the prompt, attention is proven to perform BMA under the Gaussian linear ICL model, which also motivates the explicit construction of the hidden concepts from the attention heads' values. Finally, we analyze the pretraining behavior of LLMs. The pretraining error is decomposed as the generalization error and the approximation error. The generalization error is upper bounded via the PAC-Bayes framework. Then the ICL average error of the pretrained LLMs is shown to be the sum of O(T^{-1}) and the pretraining error. In addition, we analyze the ICL performance of the pretrained LLMs with misspecified examples.
more » « less
Free, publicly-accessible full text available May 6, 2026
Contextual Dynamic Pricing with Strategic Buyers

https://doi.org/10.1080/01621459.2024.2370613

Liu, Pangpang; Yang, Zhuoran; Wang, Zhaoran; Sun, Will Wei (April 2025, Journal of the American Statistical Association)

Free, publicly-accessible full text available April 3, 2026
Reason for Future, Act for Now: A Principled Architecture for Autonomous LLM Agents

Liu, Zhihan; Hu, Hao; Zhang, Shenao; Guo, Hongyi; Ke, Shuqi; Liu, Boyi; Wang, Zhaoran (October 2024, ICML)

Large language models (LLMs) demonstrate impressive reasoning abilities, but translating reasoning into actions in the real world remains challenging. In particular, it is unclear how to complete a given task provably within a minimum number of interactions with the external environment, e.g., through an internal mechanism of reasoning. To this end, we propose the first framework with provable regret guarantees to orchestrate reasoning and acting, which we call “reason for future, act for now” (RAFA). Specifically, we design a prompt template for reasoning that learns from the memory buffer and plans a future trajectory over a long horizon (“reason for future”). At each step, the LLM agent takes the initial action of the planned trajectory (“act for now”), stores the collected feedback in the memory buffer, and reinvokes the reasoning routine to replan the future trajectory from the new state. The key idea is to cast reasoning in LLMs as learning and planning in Bayesian adaptive Markov decision processes (MDPs). Correspondingly, we prompt LLMs with the memory buffer to estimate the unknown environment (learning) and generate an optimal trajectory for multiple future steps that maximize a value function (planning). The learning and planning subroutines are performed in an “incontext” manner to emulate the actor-critic update for MDPs. Our theoretical analysis establishes a √T regret, while our experimental validation demonstrates superior empirical performance. Here, T denotes the number of online interactions.
more » « less
Full Text Available
Wardrop Equilibrium Can Be Boundedly Rational: A New Behavioral Theory of Route Choice

https://doi.org/10.1287/trsc.2023.0132

Li, Jiayang; Wang, Zhaoran; Nie, Yu Marco (July 2024, Transportation Science)

As one of the most fundamental concepts in transportation science, Wardrop equilibrium (WE) has always had a relatively weak behavioral underpinning. To strengthen this foundation, one must reckon with bounded rationality in human decision-making processes, such as the lack of accurate information, limited computing power, and suboptimal choices. This retreat from behavioral perfectionism in the literature, however, was typically accompanied by a conceptual modification of WE. Here, we show that giving up perfect rationality need not force a departure from WE. On the contrary, WE can be reached with global stability in a routing game played by boundedly rational travelers. We achieve this result by developing a day-to-day (DTD) dynamical model that mimics how travelers gradually adjust their route valuations, hence choice probabilities, based on past experiences. Our model, called cumulative logit (CumLog), resembles the classical DTD models but makes a crucial change; whereas the classical models assume that routes are valued based on the cost averaged over historical data, our model values the routes based on the cost accumulated. To describe route choice behaviors, the CumLog model only uses two parameters, one accounting for the rate at which the future route cost is discounted in the valuation relative to the past ones and the other describing the sensitivity of route choice probabilities to valuation differences. We prove that the CumLog model always converges to WE, regardless of the initial point, as long as the behavioral parameters satisfy certain mild conditions. Our theory thus upholds WE’s role as a benchmark in transportation systems analysis. It also explains why equally good routes at equilibrium may be selected with different probabilities, which solves the instability problem posed by Harsanyi. Funding: This research is funded by the National Science Foundation [Grants CMMI #2225087 and ECCS #2048075]. Supplemental Material: The online appendix is available at https://doi.org/10.1287/trsc.2023.0132 .
more » « less
Full Text Available
Nearly Dimension-Independent Sparse Linear Bandit over Small Action Spaces via Best Subset Selection

https://doi.org/10.1080/01621459.2022.2108816

Chen, Yi; Wang, Yining; Fang, Ethan X; Wang, Zhaoran; Li, Runze (January 2024, Journal of the American Statistical Association)

Full Text Available
Fairness-Oriented Learning for Optimal Individualized Treatment Rules

https://doi.org/10.1080/01621459.2021.2008402

Fang, Ethan X; Wang, Zhaoran; Wang, Lan (July 2023, Journal of the American Statistical Association)

Full Text Available
Joint Differentiable Optimization and Verification for Certified Reinforcement Learning

https://doi.org/10.1145/3576841.3585919

Wang, Yixuan; Zhan, Simon; Wang, Zhilu; Huang, Chao; Wang, Zhaoran; Yang, Zhuoran; Zhu, Qi (May 2023, ACM)
Enforcing Hard Constraints with Soft Barriers: Safe Reinforcement Learning in Unknown Stochastic Environments

Wang, Yixuan; Zhan, Simon Sinong; Jiao, Ruochen; Wang, Zhilu; Jin, Wanxin; Yang, Zhuoran; Wang, Zhaoran; Huang, Chao; Zhu, Qi (July 2023, 40th International Conference on Machine Learning (ICML’23))
Accelerate online reinforcement learning for building HVAC control with heterogeneous expert guidances

https://doi.org/10.1145/3563357.3564064

Xu, Shichao; Fu, Yangyang; Wang, Yixuan; Yang, Zhuoran; O'Neill, Zheng; Wang, Zhaoran; Zhu, Qi (November 2022, 9th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys))

Full Text Available

« Prev Next »

Search for: All records