We study the \emph{offline reinforcement learning} (offline RL) problem, where the goal is to learn a rewardmaximizing policy in an unknown \emph{Markov Decision Process} (MDP) using the data coming from a policy $\mu$. In particular, we consider the sample complexity problems of offline RL for the finite horizon MDPs. Prior works derive the informationtheoretical lower bounds based on different datacoverage assumptions and their upper bounds are expressed by the covering coefficients which lack the explicit characterization of system quantities. In this work, we analyze the \emph{Adaptive Pessimistic Value Iteration} (APVI) algorithm and derive the suboptimality upper bound that nearly matchesmore »
This content will become publicly available on January 1, 2023
Pessimistic QLearning for Offline Reinforcement Learning: Towards Optimal Sample Complexity
Offline or batch reinforcement learning seeks to learn a nearoptimal policy using history data without active exploration of the environment. To counter the insufficient coverage and sample scarcity of many offline datasets, the principle of pessimism has been recently introduced to mitigate high bias of the estimated values. While pessimistic variants of modelbased algorithms (e.g., value iteration with lower confidence bounds) have been theoretically investigated, their modelfree counterparts — which do not require explicit model estimation — have not been adequately studied, especially in terms of sample efficiency. To address this inadequacy, we study a pessimistic variant of Qlearning in the context of finitehorizon Markov decision processes, and characterize its sample complexity under the singlepolicy concentrability assumption which does not require the full coverage of the stateaction space. In addition, a variancereduced pessimistic Qlearning algorithm is proposed to achieve nearoptimal sample complexity. Altogether, this work highlights the efficiency of modelfree algorithms in offline RL when used in conjunction with pessimism and variance reduction.
 Publication Date:
 NSFPAR ID:
 10340928
 Journal Name:
 Proceedings of the 39th International Conference on Machine Learning
 Sponsoring Org:
 National Science Foundation
More Like this


In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. The ability to train RL policies offline would greatly expand where RL can be applied, its data efficiency, and its experimental velocity. Prior work in offline RL has been confined almost exclusively to modelfree RL approaches. In this work, we present MOReL, an algorithmic framework for modelbased offline RL. This framework consists of two steps: (a) learning a pessimistic MDP (PMDP) using the offline dataset; (b) learning a nearoptimal policy in this PMDP. The learnedmore »

In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. This serves as an extreme test for an agent's ability to effectively use historical data which is known to be critical for efficient RL. Prior work in offline RL has been confined almost exclusively to modelfree RL approaches. In this work, we present MOReL, an algorithmic framework for modelbased offline RL. This framework consists of two steps: (a) learning a pessimistic MDP using the offline dataset; (b) learning a nearoptimal policy in this pessimistic MDP.more »

Offline reinforcement learning, which seeks to utilize offline/historical data to optimize sequential decisionmaking strategies, has gained surging prominence in recent studies. Due to the advantage that appropriate function approximators can help mitigate the sample complexity burden in modern reinforcement learning problems, existing endeavors usually enforce powerful function representation models (e.g. neural networks) to learn the optimal policies. However, a precise understanding of the statistical limits with function representations, remains elusive, even when such a representation is linear. Towards this goal, we study the statistical limits of offline reinforcement learning with linear model representations. To derive the tight offline learning bound,more »

Modelbased reinforcement learning (RL), which finds an optimal policy using an empirical model, has long been recognized as one of the cornerstones of RL. It is especially suitable for multiagent RL (MARL), as it naturally decouples the learning and the planning phases, and avoids the nonstationarity problem when all agents are improving their policies simultaneously using samples. Though intuitive and widelyused, the sample complexity of modelbased MARL algorithms has been investigated relatively much less often. In this paper, we aim to address the fundamental open question about the sample complexity of modelbased MARL. We study arguably the most basic MARLmore »