This content will become publicly available on March 25, 2025
Deep reinforcement learning (RL) has shown remarkable success in specific offline decisionmaking scenarios, yet its theoretical guarantees are still under development. Existing works on offline RL theory primarily emphasize a few trivial settings, such as linear MDP or general function approximation with strong assumptions and independent data, which lack guidance for practical use. The coupling of deep learning and Bellman residuals makes this problem challenging, in addition to the difficulty of data dependence. In this paper, we establish a nonasymptotic estimation error of pessimistic offline RL using general neural network approximation with Cmixing data regarding the structure of networks, the dimension of datasets, and the concentrability of data coverage, under mild assumptions. Our result shows that the estimation error consists of two parts: the first converges to zero at a desired rate on the sample size with partially controllable concentrability, and the second becomes negligible if the residual constraint is tight. This result demonstrates the explicit efficiency of deep adversarial offline RL frameworks. We utilize the empirical process tool for Cmixing sequences and the neural network approximation theory for the Holder class to achieve this. We also develop methods to bound the Bellman estimation error caused by function approximation with empirical Bellman constraint perturbations. Additionally, we present a result that lessens the curse of dimensionality using data with low intrinsic dimensionality and function classes with low complexity. Our estimation provides valuable insights into the development of deep offline RL and guidance for algorithm model design.
more » « less NSFPAR ID:
 10509696
 Publisher / Repository:
 AAAI Press, Washington, DC, USA
 Date Published:
 Journal Name:
 Proceedings of the AAAI Conference on Artificial Intelligence
 Volume:
 38
 Issue:
 14
 ISSN:
 21595399
 Page Range / eLocation ID:
 15868 to 15877
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Sampleefficiency guarantees for offline reinforcement learning (RL) often rely on strong assumptions on both the function classes (e.g., Bellmancompleteness) and the data coverage (e.g., allpolicy concentrability). Despite the recent efforts on relaxing these assumptions, existing works are only able to relax one of the two factors, leaving the strong assumption on the other factor intact. As an important open problem, can we achieve sampleefficient offline RL with weak assumptions on both factors? In this paper we answer the question in the positive. We analyze a simple algorithm based on the primaldual formulation of MDPs, where the dual variables (discounted occupancy) are modeled using a densityratio function against offline data. With proper regularization, the algorithm enjoys polynomial sample complexity, under only realizability and singlepolicy concentrability. We also provide alternative analyses based on different assumptions to shed light on the nature of primaldual algorithms for offline RL.more » « less

We propose and analyze a reinforcement learning principle that approximates the Bellman equations by enforcing their validity only along an userdefined space of test functions. Focusing on applications to modelfree offline RL with function approximation, we exploit this principle to derive confidence intervals for offpolicy evaluation, as well as to optimize over policies within a prescribed policy class. We prove an oracle inequality on our policy optimization procedure in terms of a tradeoff between the value and uncertainty of an arbitrary comparator policy. Different choices of test function spaces allow us to tackle different problems within a common framework. We characterize the loss of efficiency in moving from onpolicy to offpolicy data using our procedures, and establish connections to concentrability coefficients studied in past work. We examine in depth the implementation of our methods with linear function approximation, and provide theoretical guarantees with polynomialtime implementations even when Bellman closure does not hold.more » « less

In offline reinforcement learning (RL), a learner leverages prior logged data to learn a good policy without interacting with the environment. A major challenge in applying such methods in practice is the lack of both theoretically principled and practical tools for model selection and evaluation. To address this, we study the problem of model selection in offline RL with value function approximation. The learner is given a nested sequence of model classes to minimize squared Bellman error and must select among these to achieve a balance between approximation and estimation error of the classes. We propose the first model selection algorithm for offline RL that achieves minimax rateoptimal oracle inequalities up to logarithmic factors. The algorithm, MODBE, takes as input a collection of candidate model classes and a generic base offline RL algorithm. By successively eliminating model classes using a novel onesided generalization test, MODBE returns a policy with regret scaling with the complexity of the minimally complete model class. In addition to its theoretical guarantees, it is conceptually simple and computationally efficient, amounting to solving a series of square loss regression problems and then comparing relative square loss between classes. We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.more » « less

We consider a challenging theoretical problem in offline reinforcement learning (RL): obtaining sampleefficiency guarantees with a dataset lacking sufficient coverage, under only realizabilitytype assumptions for the function approximators. While the existing theory has addressed learning under realizability and under nonexploratory data separately, no work has been able to address both simultaneously (except for a concurrent work which we compare in detail). Under an additional gap assumption, we provide guarantees to a simple pessimistic algorithm based on a version space formed by marginalized importance sampling (MIS), and the guarantee only requires the data to cover the optimal policy and the function classes to realize the optimal value and densityratio functions. While similar gap assumptions have been used in other areas of RL theory, our work is the first to identify the utility and the novel mechanism of gap assumptions in offline RL with weak function approximation.more » « less

We study the problem of estimating an unknown function from noisy data using shallow ReLU neural networks. The estimators we study minimize the sum of squared datafitting errors plus a regularization term proportional to the squared Euclidean norm of the network weights. This minimization corresponds to the common approach of training a neural network with weight decay. We quantify the performance (meansquared error) of these neural network estimators when the datagenerating function belongs to the secondorder Radondomain bounded variation space. This space of functions was recently proposed as the natural function space associated with shallow ReLU neural networks. We derive a minimax lower bound for the estimation problem for this function space and show that the neural network estimators are minimax optimal up to logarithmic factors. This minimax rate is immune to the curse of dimensionality. We quantify an explicit gap between neural networks and linear methods (which include kernel methods) by deriving a linear minimax lower bound for the estimation problem, showing that linear methods necessarily suffer the curse of dimensionality in this function space. As a result, this paper sheds light on the phenomenon that neural networks seem to break the curse of dimensionality.more » « less