Robust Markov decision processes (MDPs) aim to find a policy that optimizes the worst-case performance over an uncertainty set of MDPs. Existing studies mostly have focused on the robust MDPs under the discounted reward criterion, leaving the ones under the average-reward criterion largely unexplored. In this paper, we develop the first comprehensive and systematic study of robust average-reward MDPs, where the goal is to optimize the long-term average performance under the worst case. Our contributions are four-folds: (1) we prove the uniform convergence of the robust discounted value function to the robust average-reward function as the discount factor γ goes to 1; (2) we derive the robust average-reward Bellman equation, characterize the structure of its solution set, and prove the equivalence between solving the robust Bellman equation and finding the optimal robust policy; (3) we design robust dynamic programming algorithms, and theoretically characterize their convergence to the optimal policy; and (4) we design two model-free algorithms unitizing the multi-level Monte-Carlo approach, and prove their asymptotic convergence
more »
« less
Robust Average-Reward Markov Decision Processes
In robust Markov decision processes (MDPs), the uncertainty in the transition kernel is addressed by finding a policy that optimizes the worst-case performance over an uncertainty set of MDPs. While much of the literature has focused on discounted MDPs, robust average-reward MDPs remain largely unexplored. In this paper, we focus on robust average-reward MDPs, where the goal is to find a policy that optimizes the worst-case average reward over an uncertainty set. We first take an approach that approximates average-reward MDPs using discounted MDPs. We prove that the robust discounted value function converges to the robust average-reward as the discount factor goes to 1, and moreover when it is large, any optimal policy of the robust discounted MDP is also an optimal policy of the robust average-reward. We further design a robust dynamic programming approach, and theoretically characterize its convergence to the optimum. Then, we investigate robust average-reward MDPs directly without using discounted MDPs as an intermediate step. We derive the robust Bellman equation for robust average-reward MDPs, prove that the optimal policy can be derived from its solution, and further design a robust relative value iteration algorithm that provably finds its solution, or equivalently, the optimal robust policy.
more »
« less
- PAR ID:
- 10488459
- Publisher / Repository:
- AAAI
- Date Published:
- Journal Name:
- Proceedings of the AAAI Conference on Artificial Intelligence
- Volume:
- 37
- Issue:
- 12
- ISSN:
- 2159-5399
- Page Range / eLocation ID:
- 15215 to 15223
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Krause, Andreas; Brunskill, Emma; Cho, Kyunghyun; Engelhardt; Barbara; Sabato, Sivan; Scarlett, Jonathan (Ed.)Robust Markov decision processes (MDPs) address the challenge of model uncertainty by optimizing the worst-case performance over an uncertainty set of MDPs. In this paper, we focus on the robust average-reward MDPs under the modelfree setting. We first theoretically characterize the structure of solutions to the robust averagereward Bellman equation, which is essential for our later convergence analysis. We then design two model-free algorithms, robust relative value iteration (RVI) TD and robust RVI Q-learning, and theoretically prove their convergence to the optimal solution. We provide several widely used uncertainty sets as examples, including those def ined by the contamination model, total variation, Chi-squared divergence, Kullback-Leibler (KL) divergence and Wasserstein distance.more » « less
-
Robust Markov Decision Processes (MDPs) offer a promising framework for computing reliable policies under model uncertainty. While policy gradient methods have gained increasing popularity in robust discounted MDPs, their application to the average-reward criterion remains largely unexplored. This paper proposes a Robust Projected Policy Gradient (RP2G), the first generic policy gradient method for robust average-reward MDPs (RAMDPs) that is applicable beyond the typical rectangularity assumption on transition ambiguity. In contrast to existing robust policy gradient algorithms, RP2G incorporates an adaptive decreasing tolerance mechanism for efficient policy updates at each iteration. We also present a comprehensive convergence analysis of RP2G for solving ergodic tabular RAMDPs. Furthermore, we establish the first study of the inner worst-case transition evaluation problem in RAMDPs, proposing two gradient-based algorithms tailored for rectangular and general ambiguity sets, each with provable convergence guarantees. Numerical experiments confirm the global convergence of our new algorithm and demonstrate its superior performance.more » « less
-
We study model-free reinforcement learning (RL) algorithms for infinite-horizon average-reward Markov decision process (MDP), which is more appropriate for applications that involve continuing operations not divided into episodes. In contrast to episodic/discounted MDPs, theoretical understanding of model-free RL algorithms is relatively inadequate for the average-reward setting. In this paper, we consider both the online setting and the setting with access to a simulator. We develop computationally efficient model-free algorithms that achieve sharper guarantees on regret/sample complexity compared with existing results. In the online setting, we design an algorithm, UCB-AVG, based on an optimistic variant of variance-reduced Q-learning. We show that UCB-AVG achieves a regret bound $$\widetilde{O}(S^5A^2sp(h^*)\sqrt{T})$$ after $$T$$ steps, where $$S\times A$$ is the size of state-action space, and $sp(h^*)$ the span of the optimal bias function. Our result provides the first computationally efficient model-free algorithm that achieves the optimal dependence in $$T$$ (up to log factors) for weakly communicating MDPs, which is necessary for low regret. In contrast, prior results either are suboptimal in $$T$$ or require strong assumptions of ergodicity or uniformly mixing of MDPs. In the simulator setting, we adapt the idea of UCB-AVG to develop a model-free algorithm that finds an $$\epsilon$$-optimal policy with sample complexity $$\widetilde{O}(SAsp^2(h^*)\epsilon^{-2} + S^2Asp(h^*)\epsilon^{-1}).$$ This sample complexity is near-optimal for weakly communicating MDPs, in view of the minimax lower bound $$\Omega(SAsp(^*)\epsilon^{-2})$$. Existing work mainly focuses on ergodic MDPs and the results typically depend on $$t_{mix},$$ the worst-case mixing time induced by a policy. We remark that the diameter $$D$$ and mixing time $$t_{mix}$$ are both lower bounded by $sp(h^*)$, and $$t_{mix}$$ can be arbitrarily large for certain MDPs. On the technical side, our approach integrates two key ideas: learning an $$\gamma$$-discounted MDP as an approximation, and leveraging reference-advantage decomposition for variance in optimistic Q-learning. As recognized in prior work, a naive approximation by discounted MDPs results in suboptimal guarantees. A distinguishing feature of our method is maintaining estimates of value-difference between state pairs to provide a sharper bound on the variance of reference advantage. We also crucially use a careful choice of the discounted factor $$\gamma$$ to balance approximation error due to discounting and the statistical learning error, and we are able to maintain a good-quality reference value function with $O(SA)$ space complexity.more » « less
-
The difficulty in specifying rewards for many real world problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand risk-averse behavior. We derive a novel policy gradient-style robust optimization approach, PG-BROIL, that optimizes a soft-robust objective that balances expected performance and risk. To the best of our knowledge, PG-BROIL is the first policy optimization algorithm robust to a distribution of reward hypotheses which can scale to continuous MDPs. Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator’s reward function.more » « less
An official website of the United States government

