skip to main content


Title: MOReL: Model-Based Offline Reinforcement Learning
In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline would greatly expand where RL can be applied, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (P-MDP) using the offline dataset; (b) learning a near-optimal policy in this P-MDP. The learned P-MDP has the property that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the P-MDP. This enables it to serve as a good surrogate for purposes of policy evaluation and learning, and overcome common pitfalls of model-based RL like model exploitation. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Through experiments, we show that MOReL matches or exceeds state-of-the-art results in widely studied offline RL benchmarks. Moreover, the modular design of MOReL enables future advances in its components (e.g., in model learning, planning etc.) to directly translate into improvements for offline RL.  more » « less
Award ID(s):
1740822
NSF-PAR ID:
10196971
Author(s) / Creator(s):
Date Published:
Journal Name:
Neurips
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. This serves as an extreme test for an agent's ability to effectively use historical data which is known to be critical for efficient RL. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP using the offline dataset; (b) learning a near-optimal policy in this pessimistic MDP. The design of the pessimistic MDP is such that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the pessimistic MDP. This enables the pessimistic MDP to serve as a good surrogate for purposes of policy evaluation and learning. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Empirically, MOReL matches or exceeds state-of-the-art results on widely used offline RL benchmarks. Overall, the modular design of MOReL enables translating advances in its components (for e.g., in model learning, planning etc.) to improvements in offline RL. 
    more » « less
  2. This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. Uniform OPE supΠ|Qπ−Q̂ π|<ϵ is a stronger measure than the point-wise OPE and ensures offline learning when Π contains all policies (the global class). In this paper, we establish an Ω(H2S/dmϵ2) lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of Õ (H2/dmϵ2) for the \emph{local} uniform convergence that applies to all \emph{near-empirically optimal} policies for the MDPs with \emph{stationary} transition. Here dm is the minimal marginal state-action probability. Critically, the highlight in achieving the optimal rate Õ (H2/dmϵ2) is our design of \emph{singleton absorbing MDP}, which is a new sharp analysis tool that works with the model-based approach. We generalize such a model-based framework to the new settings: offline task-agnostic and the offline reward-free with optimal complexity Õ (H2log(K)/dmϵ2) (K is the number of tasks) and Õ (H2S/dmϵ2) respectively. These results provide a unified solution for simultaneously solving different offline RL problems. 
    more » « less
  3. We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a reward-maximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the information-theoretical lower bounds based on different data-coverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matches $ O\left(\sum_{h=1}^H\sum_{s_h,a_h}d^{\pi^\star}_h(s_h,a_h)\sqrt{\frac{\mathrm{Var}_{P_{s_h,a_h}}{(V^\star_{h+1}+r_h)}}{d^\mu_h(s_h,a_h)}}\sqrt{\frac{1}{n}}\right). $ We also prove an information-theoretical lower bound to show this quantity is required under the weak assumption that $d^\mu_h(s_h,a_h)>0$ if $d^{\pi^\star}_h(s_h,a_h)>0$. Here $\pi^\star$ is a optimal policy, $\mu$ is the behavior policy and $d(s_h,a_h)$ is the marginal state-action probability. We call this adaptive bound the \emph{intrinsic offline reinforcement learning bound} since it directly implies all the existing optimal results: minimax rate under uniform data-coverage assumption, horizon-free setting, single policy concentrability, and the tight problem-dependent results. Later, we extend the result to the \emph{assumption-free} regime (where we make no assumption on $ \mu$) and obtain the assumption-free intrinsic bound. Due to its generic form, we believe the intrinsic bound could help illuminate what makes a specific problem hard and reveal the fundamental challenges in offline RL. 
    more » « less
  4. Reinforcement learning (RL) methods can be used to develop a controller for the heating, ventilation, and air conditioning (HVAC) systems that both saves energy and ensures high occupants’ thermal comfort levels. However, the existing works typically require on-policy data to train an RL agent, and the occupants’ personalized thermal preferences are not considered, which is limited in the real-world scenarios. This paper designs a high-performance model-based offline RL algorithm for personalized HVAC systems. The proposed algorithm can quickly adapt to different occupants’ thermal preferences with a few thermal feedbacks, guaranteeing the high occupants’ personalized thermal comfort levels efficiently. First, we use a meta-supervised learning algorithm to train an occupant's thermal preference model. Then, we train an ensemble neural network to predict the thermal states of the considered zone. In addition, the obtained ensemble networks can indicate the regions in the state and action spaces covered by the offline dataset. With the personalized thermal preference model updated via meta-testing, model-based RL is used to derive the optimal HVAC controller. Since the proposed algorithm only requires offline datasets and a few online thermal feedbacks for training, it contributes to a more practical deployment of the RL algorithm to HVAC systems. We use the ASHRAE database II to verify the effectiveness and advantage of the meta-learning algorithm for modeling different occupants’ thermal preferences. Numerical simulations on the EnergyPlus environment demonstrate that the proposed algorithm can guarantee personalized thermal preferences with a slight increase of power consumption of 1.91% compared with the model-based RL algorithm with on-policy data aggregation. 
    more » « less
  5. We develop a framework to learn bio-inspired foraging policies using human data. We conduct an experiment where humans are virtually immersed in an open field foraging environment and are trained to collect the highest amount of rewards. A Markov Decision Process (MDP) framework is introduced to model the human decision dynamics. Then, Imitation Learning (IL) based on maximum likelihood estimation is used to train Neural Networks (NN) that map human decisions to observed states. The results show that passive imitation substantially underperforms humans. We further refine the human-inspired policies via Reinforcement Learning (RL) using the on-policy Proximal Policy Optimization (PPO) algorithm which shows better stability than other algorithms and can steadily improve the policies pre-trained with IL. We show that the combination of IL and RL match human performance and that the artificial agents trained with our approach can quickly adapt to reward distribution shift. We finally show that good performance and robustness to reward distribution shift strongly depend on combining allocentric information with an egocentric representation of the environment.

     
    more » « less