In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline would greatly expand where RL can be applied, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to modelfree RL approaches. In this work, we present MOReL, an algorithmic framework for modelbased offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (PMDP) using the offline dataset; (b) learning a nearoptimal policy in this PMDP. The learnedmore »
MOReL: ModelBased Offline Reinforcement Learning
In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. This serves as an extreme test for an agent's ability to effectively use historical data which is known to be critical for efficient RL. Prior work in offline RL has been confined almost exclusively to modelfree RL approaches. In this work, we present MOReL, an algorithmic framework for modelbased offline RL. This framework consists of two steps: (a) learning a pessimistic MDP using the offline dataset; (b) learning a nearoptimal policy in this pessimistic MDP. The design of the pessimistic MDP is such that for any policy, the performance in the real environment is approximately lowerbounded by the performance in the pessimistic MDP. This enables the pessimistic MDP to serve as a good surrogate for purposes of policy evaluation and learning. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Empirically, MOReL matches or exceeds stateoftheart results on widely used offline RL benchmarks. Overall, the modular design of MOReL enables translating advances in its components (for e.g., in model learning, planning etc.) to improvements in offline RL.
 Award ID(s):
 1901168
 Publication Date:
 NSFPAR ID:
 10309945
 Journal Name:
 Advances in neural information processing systems
 ISSN:
 10495258
 Sponsoring Org:
 National Science Foundation
More Like this


This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with modelbased methods (for episodic MDP) and provides a unified framework towards optimal learning for several wellmotivated offline tasks. Uniform OPE supΠQπ−Q̂ π<ϵ is a stronger measure than the pointwise OPE and ensures offline learning when Π contains all policies (the global class). In this paper, we establish an Ω(H2S/dmϵ2) lower bound (over modelbased family) for the global uniform OPE and our main result establishes an upper bound of Õ (H2/dmϵ2) for the \emph{local} uniform convergence that applies to all \emph{nearempirically optimal} policies for themore »

We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matchesmore »

Offline or batch reinforcement learning seeks to learn a nearoptimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of modelbased algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their modelfree counterparts — which do not require explicit model estimation — have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Qlearning inmore »

Banerjee, Arindam and (Ed.)While reinforcement learning has witnessed tremendous success recently in a wide range of domains, robustness–or the lack thereof–remains an important issue that remains inadequately addressed. In this paper, we provide a distributionally robust formulation of offline learning policy in tabular RL that aims to learn a policy from historical data (collected by some other behavior policy) that is robust to the future environment arising as a perturbation of the training environment. We first develop a novel policy evaluation scheme that accurately estimates the robust value (i.e. how robust it is in a perturbed environment) of any given policy and establishmore »