NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Provable Policy Gradient for Robust Average-Reward MDPs Beyond Rectangularity

Wang, Qiuhao; Zha, Yuqi; Ho, Chin_Pang; Petrik, Marek (July 2025, International Conference on Machine Learning)

Robust Markov Decision Processes (MDPs) offer a promising framework for computing reliable policies under model uncertainty. While policy gradient methods have gained increasing popularity in robust discounted MDPs, their application to the average-reward criterion remains largely unexplored. This paper proposes a Robust Projected Policy Gradient (RP2G), the first generic policy gradient method for robust average-reward MDPs (RAMDPs) that is applicable beyond the typical rectangularity assumption on transition ambiguity. In contrast to existing robust policy gradient algorithms, RP2G incorporates an adaptive decreasing tolerance mechanism for efficient policy updates at each iteration. We also present a comprehensive convergence analysis of RP2G for solving ergodic tabular RAMDPs. Furthermore, we establish the first study of the inner worst-case transition evaluation problem in RAMDPs, proposing two gradient-based algorithms tailored for rectangular and general ambiguity sets, each with provable convergence guarantees. Numerical experiments confirm the global convergence of our new algorithm and demonstrate its superior performance.
more » « less
Free, publicly-accessible full text available July 18, 2026
Q-learning for Quantile MDPs: A Decomposition, Performance, and Convergence Analysis

Hau, Jia_Lin; Delage, Erick; Derman, Esther; Ghavamzadeh, Mohammad; Petrik, Marek (May 2025, International Conference on Artificial Intelligence and Statistics)

Free, publicly-accessible full text available May 5, 2026
Risk-averse Total-reward MDPs with ERM and EVaR

https://doi.org/10.1609/aaai.v39i19.34275

Su, Xihong; Petrik, Marek; Grand-Clément, Julien (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

Optimizing risk-averse objectives in discounted MDPs is challenging because most models do not admit direct dynamic programming equations and require complex history-dependent policies. In this paper, we show that the risk-averse total reward criterion, under the Entropic Risk Measure (ERM) and Entropic Value at Risk (EVaR) risk measures, can be optimized by a stationary policy, making it simple to analyze, interpret, and deploy. We propose exponential value iteration, policy iteration, and linear programming to compute optimal policies. Compared with prior work, our results only require the relatively mild condition of transient MDPs and allow for both positive and negative rewards. Our results indicate that the total reward criterion may be preferable to the discounted criterion in a broad range of risk-averse reinforcement learning domains.
more » « less
Free, publicly-accessible full text available April 11, 2026
Non-adaptive Online Finetuning for Offline Reinforcement Learning

Huang, Audrey; Ghavamzadeh, Mohammad; Jiang, Nan; Petrik, Marek (August 2024, RL Conference Proceedings)

Full Text Available
ROIL: Robust Offline Imitation Learning without Trajectories

Doko, Gersi; Yang, Guang; Brown, Daniel S; Petrik, Marek (August 2024, The Proceeding of the RL Conference)

Full Text Available
Bayesian Regret Minimization in Offline Bandits

Petrik, Marek; Tennenholtz, Guy; Ghavamzadeh, Mohammad (July 2024, Proceedings of the International Conference on Machine Learning)

Full Text Available
On Dynamic Programming Decompositions of Static Risk Measures in Markov Decision Processes

Hau, Jia Lin; Delage, Erick; Ghavamzadeh, Mohammad; Petrik, Marek (December 2023, The Advances in Neural Information Processing Systems)

Full Text Available
Percentile Criterion Optimization in Offline Reinforcement Learning

Cousins, Cyrus; Lobo, Elita; Petrik, Marek; Zick, Yair (December 2023, The Advances in Neural Information Processing Systems)

Full Text Available
Reducing Blackwell and Average Optimality to Discounted MDPs via the Blackwell Discount Factor

Grand-Clément, Julien; Petrik, Marek (December 2023, Advances of Neural Information Processing Systems)

Full Text Available
Solving multi-model MDPs by coordinate ascent and dynamic programming

Xihong Su, Marek Petrik (January 2023, Conference on Uncertainty in Artificial Intelligence)

Multi-model Markov decision process (MMDP) is a promising framework for computing policies that are robust to parameter uncertainty in MDPs. MMDPs aim to find a policy that maximizes the expected return over a distribution of MDP models. Because MMDPs are NP-hard to solve, most methods resort to approximations. In this paper, we derive the policy gradient of MMDPs and propose CADP, which combines a coordinate ascent method and a dynamic programming algorithm for solving MMDPs. The main innovation of CADP compared with earlier algorithms is to take the coordinate ascent perspective to adjust model weights iteratively to guarantee monotone policy improvements to a local maximum. A theoretical analysis of CADP proves that it never performs worse than previous dynamic programming algorithms like WSU. Our numerical results indicate that CADP substantially outperforms existing methods on several benchmark problems.
more » « less
Full Text Available

« Prev Next »

Search for: All records