Modelbased reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the cornerstones of RL. It is especially suitable for multiagent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the nonstationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widelyused, the sample complexity of modelbased MARL algorithms has been investigated relatively much less often. In this paper, we aim to address the fundamental open question about the sample complexity of modelbased MARL. We study arguably the most basic MARLmore »
ModelBased Reinforcement Learning with a Generative Model is Minimax Optimal
This work considers the sample and computational complexity of obtaining an $\epsilon$optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this model, the learner accesses the underlying transition model via a sampling oracle that provides a sample of the next state, when given any stateaction pair as input. We are interested in a basic and unresolved question in model based planning: is this naïve “plugin” approach — where we build the maximum likelihood estimate of the transition model in the MDP from observations and then find an optimal policy in this empirical MDP — nonasymptotically, minimax optimal? Our main result answers this question positively. With regards to computation, our result provides a simpler approach towards minimax optimal planning: in comparison to prior modelfree results, we show that using \emph{any} high accuracy, blackbox planning oracle in the empirical model suffices to obtain the minimax error rate. The key proof technique uses a leaveoneout analysis, in a novel “absorbing MDP” construction, to decouple the statistical dependency issues that arise in the analysis of modelbased planning; this construction may be helpful more generally.
 Award ID(s):
 1703574
 Publication Date:
 NSFPAR ID:
 10177060
 Journal Name:
 Proceedings of Machine Learning Research
 Volume:
 125
 Page Range or eLocationID:
 6783
 ISSN:
 26403498
 Sponsoring Org:
 National Science Foundation
More Like this


This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with modelbased methods (for episodic MDP) and provides a unified framework towards optimal learning for several wellmotivated offline tasks. Uniform OPE supΠQπ−Q̂ π<ϵ is a stronger measure than the pointwise OPE and ensures offline learning when Π contains all policies (the global class). In this paper, we establish an Ω(H2S/dmϵ2) lower bound (over modelbased family) for the global uniform OPE and our main result establishes an upper bound of Õ (H2/dmϵ2) for the \emph{local} uniform convergence that applies to all \emph{nearempirically optimal} policies for themore »

In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline would greatly expand where RL can be applied, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to modelfree RL approaches. In this work, we present MOReL, an algorithmic framework for modelbased offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (PMDP) using the offline dataset; (b) learning a nearoptimal policy in this PMDP. The learnedmore »

In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. This serves as an extreme test for an agent's ability to effectively use historical data which is known to be critical for efficient RL. Prior work in offline RL has been confined almost exclusively to modelfree RL approaches. In this work, we present MOReL, an algorithmic framework for modelbased offline RL. This framework consists of two steps: (a) learning a pessimistic MDP using the offline dataset; (b) learning a nearoptimal policy in this pessimistic MDP.more »

Learning to plan for long horizons is a central challenge in episodic reinforcement learning problems. A fundamental question is to understand how the difficulty of the problem scales as the horizon increases. Here the natural measure of sample complexity is a normalized one: we are interested in the \emph{number of episodes} it takes to provably discover a policy whose value is eps near to that of the optimal value, where the value is measured by the \emph{normalized} cumulative reward in each episode. In a COLT 2018 open problem, Jiang and Agarwal conjectured that, for tabular, episodic reinforcement learning problems, theremore »