In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. This serves as an extreme test for an agent's ability to effectively use historical data which is known to be critical for efficient RL. Prior work in offline RL has been confined almost exclusively to modelfree RL approaches. In this work, we present MOReL, an algorithmic framework for modelbased offline RL. This framework consists of two steps: (a) learning a pessimistic MDP using the offline dataset; (b) learning a nearoptimal policy in this pessimistic MDP.more »
MOReL: ModelBased Offline Reinforcement Learning
In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline would greatly expand where RL can be applied, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to modelfree RL approaches. In this work, we present MOReL, an algorithmic framework for modelbased offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (PMDP) using the offline dataset; (b) learning a nearoptimal policy in this PMDP. The learned PMDP has the property that for any policy, the performance in the real environment is approximately lowerbounded by the performance in the PMDP. This enables it to serve as a good surrogate for purposes of policy evaluation and learning, and overcome common pitfalls of modelbased RL like model exploitation. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Through experiments, we show that MOReL matches or exceeds stateoftheart results in widely studied offline RL benchmarks. Moreover, the modular design of MOReL enables future advances in its components (e.g., in model learning, planning etc.) to more »
 Award ID(s):
 1740822
 Publication Date:
 NSFPAR ID:
 10196971
 Journal Name:
 Neurips
 Sponsoring Org:
 National Science Foundation
More Like this


This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with modelbased methods (for episodic MDP) and provides a unified framework towards optimal learning for several wellmotivated offline tasks. Uniform OPE supΠQπ−Q̂ π<ϵ is a stronger measure than the pointwise OPE and ensures offline learning when Π contains all policies (the global class). In this paper, we establish an Ω(H2S/dmϵ2) lower bound (over modelbased family) for the global uniform OPE and our main result establishes an upper bound of Õ (H2/dmϵ2) for the \emph{local} uniform convergence that applies to all \emph{nearempirically optimal} policies for themore »

We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matchesmore »

Offline or batch reinforcement learning seeks to learn a nearoptimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of modelbased algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their modelfree counterparts — which do not require explicit model estimation — have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Qlearning inmore »

Reinforcement learning (RL) in lowdata and risksensitive domains requires performant and flexible deployment policies that can readily incorporate constraints during deployment. One such class of policies are the semiparametric Hstep lookahead policies, which select actions using trajectory optimization over a dynamics model for a fixed horizon with a terminal value function. In this work, we investigate a novel instantiation of Hstep lookahead with a learned model and a terminal value function learned by a modelfree offpolicy algorithm, named Learning OffPolicy with Online Planning (LOOP). We provide a theoretical analysis of this method, suggesting a tradeoff between model errors and valuemore »