We propose an empirical relative value learning (ERVL) algorithm for non-parametric MDPs with continuous state space and finite actions and average reward criterion. The ERVL algorithm relies on function approximation via nearest neighbors, and minibatch samples for value function update. It is universal (will work for any MDP), computationally quite simple and yet provides arbitrarily good approximation with high probability in finite time. This is the first such algorithm for non-parametric (and continuous state space) MDPs with average reward criteria with these provable properties as far as we know. Numerical evaluation on a benchmark problem of optimal replacement suggests good performance.
more »
« less
An Approximately Optimal Relative Value Learning Algorithm for Averaged MDPs with Continuous States and Actions
It has long been a challenging problem to design algorithms for Markov decision processes (MDPs) with continuous states and actions that are provably approximately optimal and can provide arbitrarily good approximation for any MDP. In this paper, we propose an empirical value learning algorithm for average MDPs with continuous states and actions that combines empirical value iteration with n function-parametric approximation and approximation of transition probability distribution with kernel density estimation. We view each iteration as operation of random operator and argue convergence using the probabilistic contraction analysis method that the authors (along with others) have recently developed.
more »
« less
- Award ID(s):
- 1817212
- PAR ID:
- 10128111
- Date Published:
- Journal Name:
- 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton)
- Page Range / eLocation ID:
- 734 to 740
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)We propose a new simple and natural algorithm for learning the optimal Q-value function of a discounted-cost Markov decision process (MDP) when the transition kernels are unknown. Unlike the classical learning algorithms for MDPs, such as Q-learning and actor-critic algorithms, this algorithm does not depend on a stochastic approximation-based method. We show that our algorithm, which we call the empirical Q-value iteration algorithm, converges to the optimal Q-value function. We also give a rate of convergence or a nonasymptotic sample complexity bound and show that an asynchronous (or online) version of the algorithm will also work. Preliminary experimental results suggest a faster rate of convergence to a ballpark estimate for our algorithm compared with stochastic approximation-based algorithms.more » « less
-
In this paper, we propose an approximate rela- tive value learning (ARVL) algorithm for non- parametric MDPs with continuous state space and finite actions and average reward criterion. It is a sampling based algorithm combined with kernel density estimation and function approx- imation via nearest neighbors. The theoreti- cal analysis is done via a random contraction operator framework and stochastic dominance argument. This is the first such algorithm for continuous state space MDPs with average re- ward criteria with these provable properties which does not require any discretization of state space as far as we know. We then eval- uate the proposed algorithm on a benchmark problem numerically.more » « less
-
Robust Markov decision processes (MDPs) compute reliable solutions for dynamic decision problems with partially-known transition probabilities. Unfortunately, accounting for uncertainty in the transition probabilities significantly increases the computational complexity of solving robust MDPs, which limits their scalability. This paper describes new, efficient algorithms for solving the common class of robust MDPs with s- and sa-rectangular ambiguity sets defined by weighted L1 norms. We propose partial policy iteration, a new, efficient, flexible, and general policy iteration scheme for robust MDPs. We also propose fast methods for computing the robust Bellman operator in quasi-linear time, nearly matching the ordinary Bellman operator's linear complexity. Our experimental results indicate that the proposed methods are many orders of magnitude faster than the state-of-the-art approach, which uses linear programming solvers combined with a robust value iteration.more » « less
-
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential im- portance sampling estimators suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introduc- ing future-dependent value functions that take future proxies as inputs and perform a similar role to that of classical value functions in fully-observable MDPs. We derive a new off-policy Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is close to the true policy value under Bellman completeness, as long as futures and histories contain sufficient information about latent states.more » « less
An official website of the United States government

