skip to main content


Title: Offline RL Without Off-Policy Evaluation
Most prior approaches to offline reinforcement learning (RL) have taken an iterative actor-critic approach involving off-policy evaluation. In this paper we show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark. The simple one-step baseline achieves this strong performance without many of the tricks used by previously proposed iterative algorithms and is more robust to hyperparameters. We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing off-policy evaluation and magnified by the repeated optimization of policies against those high-variance estimates. In addition, we hypothesize that the strong performance of the one-step algorithm is due to a combination of favorable structure in the environment and behavior policy.  more » « less
Award ID(s):
1845360 1816753
NSF-PAR ID:
10299544
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Most prior approaches to offline reinforcement learning (RL) have taken an iterative actor-critic approach involving off-policy evaluation. In this paper we show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This onestep algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark. The one-step baseline achieves this strong performance while being notably simpler and more robust to hyperparameters than previously proposed iterative algorithms. We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing off-policy evaluation and magnified by the repeated optimization of policies against those estimates. In addition, we hypothesize that the strong performance of the one-step algorithm is due to a combination of favorable structure in the environment and behavior policy. 
    more » « less
  2. Abstract

    In reinforcement learning, importance sampling is a widely used method for evaluating an expectation under the distribution of data of one policy when the data has in fact been generated by a different policy. Importance sampling requires computing the likelihood ratio between the action probabilities of a target policy and those of the data-producing behavior policy. In this article, we study importance sampling where the behavior policy action probabilities are replaced by their maximum likelihood estimate of these probabilities under the observed data. We show this general technique reduces variance due to sampling error in Monte Carlo style estimators. We introduce two novel estimators that use this technique to estimate expected values that arise in the RL literature. We find that these general estimators reduce the variance of Monte Carlo sampling methods, leading to faster learning for policy gradient algorithms and more accurate off-policy policy evaluation. We also provide theoretical analysis showing that our new estimators are consistent and have asymptotically lower variance than Monte Carlo estimators.

     
    more » « less
  3. null (Ed.)
    This paper presents a policy-driven sequential image augmentation approach for image-related tasks. Our approach applies a sequence of image transformations (e.g., translation, rotation) over a training image, one transformation at a time, with the augmented image from the previous time step treated as the input for the next transformation. This sequential data augmentation substantially improves sample diversity, leading to improved test performance, especially for data-hungry models (e.g., deep neural networks). However, the search for the optimal transformation of each image at each time step of the sequence has high complexity due to its combination nature. To address this challenge, we formulate the search task as a sequential decision process and introduce a deep policy network that learns to produce transformations based on image content. We also develop an iterative algorithm to jointly train a classifier and the policy network in the reinforcement learning setting. The immediate reward of a potential transformation is defined to encourage transformations producing hard samples for the current classifier. At each iteration, we employ the policy network to augment the training dataset, train a classifier with the augmented data, and train the policy net with the aid of the classifier. We apply the above approach to both public image classification benchmarks and a newly collected image dataset for material recognition. Comparisons to alternative augmentation approaches show that our policy-driven approach achieves comparable or improved classification performance while using significantly fewer augmented images. The code is available at https://github.com/Paul-LiPu/rl_autoaug. 
    more » « less
  4. A major challenge in real-world reinforcement learning (RL) is the sparsity of reward feedback. Often, what is available is an intuitive but sparse reward function that only indicates whether the task is completed partially or fully. However, the lack of carefully designed, fine grain feedback implies that most existing RL algorithms fail to learn an acceptable policy in a reasonable time frame. This is because of the large number of exploration actions that the policy has to perform before it gets any useful feedback that it can learn from. In this work, we address this challenging problem by developing an algorithm that exploits the offline demonstration data generated by a sub-optimal behavior policy for faster and efficient online RL in such sparse reward settings. The proposed algorithm, which we call the Learning Online with Guidance Offline (LOGO) algorithm, merges a policy improvement step with an additional policy guidance step by using the offline demonstration data. The key idea is that by obtaining guidance from - not imitating - the offline data, LOGO orients its policy in the manner of the sub-optimal policy, while yet being able to learn beyond and approach optimality. We provide a theoretical analysis of our algorithm, and provide a lower bound on the performance improvement in each learning episode. We also extend our algorithm to the even more challenging incomplete observation setting, where the demonstration data contains only a censored version of the true state observation. We demonstrate the superior performance of our algorithm over state-of-the-art approaches on a number of benchmark environments with sparse rewards and censored state. Further, we demonstrate the value of our approach via implementing LOGO on a mobile robot for trajectory tracking and obstacle avoidance, where it shows excellent performance. 
    more » « less
  5. Abstract Purpose

    The constrained one‐step spectral CT image reconstruction (cOSSCIR) algorithm with a nonconvex alternating direction method of multipliers optimizer is proposed for addressing computed tomography (CT) metal artifacts caused by beam hardening, noise, and photon starvation. The quantitative performance of cOSSCIR is investigated through a series of photon‐counting CT simulations.

    Methods

    cOSSCIR directly estimates basis material maps from photon‐counting data using a physics‐based forward model that accounts for beam hardening. The cOSSCIR optimization framework places constraints on the basis maps, which we hypothesize will stabilize the decomposition and reduce streaks caused by noise and photon starvation. Another advantage of cOSSCIR is that the spectral data need not be registered, so that a ray can be used even if some energy window measurements are unavailable. Photon‐counting CT acquisitions of a virtual pelvic phantom with low‐contrast soft tissue texture and bilateral hip prostheses were simulated. Bone and water basis maps were estimated using the cOSSCIR algorithm and combined to form a virtual monoenergetic image for the evaluation of metal artifacts. The cOSSCIR images were compared to a “two‐step” decomposition approach that first estimated basis sinograms using a maximum likelihood algorithm and then reconstructed basis maps using an iterative total variation constrained least‐squares optimization (MLE+TV). Images were also compared to a nonspectral TV reconstruction of the total number of counts detected for each ray with and without normalized metal artifact reduction (NMAR) applied. The simulated metal density was increased to investigate the effects of increasing photon starvation. The quantitative error and standard deviation in regions of the phantom were compared across the investigated algorithms. The ability of cOSSCIR to reproduce the soft‐tissue texture, while reducing metal artifacts, was quantitatively evaluated.

    Results

    Noiseless simulations demonstrated the convergence of the cOSSCIR and MLE+TV algorithms to the correct basis maps in the presence of beam‐hardening effects. When noise was simulated, cOSSCIR demonstrated a quantitative error of −1 HU, compared to 2 HU error for the MLE+TV algorithm and −154 HU error for the nonspectral TV+NMAR algorithm. For the cOSSCIR algorithm, the standard deviation in the central iodine region of interest was 20 HU, compared to 299 HU for the MLE+TV algorithm, 41 HU for the MLE+TV+Mask algorithm that excluded rays through metal, and 55 HU for the nonspectral TV+NMAR algorithm. Increasing levels of photon starvation did not impact the bias or standard deviation of the cOSSCIR images. cOSSCIR was able to reproduce the soft‐tissue texture when an appropriate regularization constraint value was selected.

    Conclusions

    By directly inverting photon‐counting CT data into basis maps using an accurate physics‐based forward model and a constrained optimization algorithm, cOSSCIR avoids metal artifacts due to beam hardening, noise, and photon starvation. The cOSSCIR algorithm demonstrated improved stability and accuracy compared to a two‐step method of decomposition followed by reconstruction.

     
    more » « less