skip to main content

Title: Policy Learning Using Weak Supervision
Most existing policy learning solutions require the learning agents to receive high-quality supervision signals, e.g., rewards in reinforcement learning (RL) or high-quality expert demonstrations in behavioral cloning (BC). These quality supervisions are either infeasible or prohibitively expensive to obtain in practice. We aim for a unified framework that leverages the available cheap weak supervisions to perform policy learning efficiently. To handle this problem, we treat the weak supervision'' as imperfect information coming from a peer agent, and evaluate the learning agent's policy based on a correlated agreement'' with the peer agent's policy (instead of simple agreements). Our approach explicitly punishes a policy for overfitting to the weak supervision. In addition to theoretical guarantees, extensive evaluations on tasks including RL with noisy reward, BC with weak demonstrations, and standard policy co-training (RL + BC) show that our method leads to substantial performance improvements, especially when the complexity or the noise of the learning environments is high.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Advances in neural information processing systems
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Reinforcement learning (RL) has demonstrated its superiority in solving sequential decision-making problems. However, heavy dependence on immediate reward feedback impedes the wide application of RL. On the other hand, imitation learning (IL) tackles RL without relying on environmental supervision by leveraging external demonstrations. In practice, however, collecting sufficient expert demonstrations can be prohibitively expensive, yet the quality of demonstrations typically limits the performance of the learning policy. To address a practical scenario, in this work, we propose Self-Adaptive Imitation Learning (SAIL), which, provided with a few demonstrations from a sub-optimal teacher, can perform well in RL tasks with extremely delayed rewards, where the only reward feedback is trajectory-wise ranking. SAIL bridges the advantages of IL and RL by interactively exploiting the demonstrations to catch up with the teacher and exploring the environment to yield demonstrations that surpass the teacher. Extensive empirical results show that not only does SAIL significantly improve the sample efficiency, but it also leads to higher asymptotic performance across different continuous control tasks, compared with the state-of-the-art. 
    more » « less
  2. In offline reinforcement learning (RL), the goal is to learn a highly rewarding policy based solely on a dataset of historical interactions with the environment. This serves as an extreme test for an agent's ability to effectively use historical data which is known to be critical for efficient RL. Prior work in offline RL has been confined almost exclusively to model-free RL approaches. In this work, we present MOReL, an algorithmic framework for model-based offline RL. This framework consists of two steps: (a) learning a pessimistic MDP using the offline dataset; (b) learning a near-optimal policy in this pessimistic MDP. The design of the pessimistic MDP is such that for any policy, the performance in the real environment is approximately lower-bounded by the performance in the pessimistic MDP. This enables the pessimistic MDP to serve as a good surrogate for purposes of policy evaluation and learning. Theoretically, we show that MOReL is minimax optimal (up to log factors) for offline RL. Empirically, MOReL matches or exceeds state-of-the-art results on widely used offline RL benchmarks. Overall, the modular design of MOReL enables translating advances in its components (for e.g., in model learning, planning etc.) to improvements in offline RL. 
    more » « less
  3. Applying reinforcement learning (RL) to sparse reward domains is notoriously challenging due to insufficient guiding signals. Common RL techniques for addressing such domains include (1) learning from demonstrations and (2) curriculum learning. While these two approaches have been studied in detail, they have rarely been considered together. This paper aims to do so by introducing a principled task-phasing approach that uses demonstrations to automatically generate a curriculum sequence. Using inverse RL from (suboptimal) demonstrations we define a simple initial task. Our task phasing approach then provides a framework to gradually increase the complexity of the task all the way to the target task, while retuning the RL agent in each phasing iteration. Two approaches for phasing are considered: (1) gradually increasing the proportion of time steps an RL agent is in control, and (2) phasing out a guiding informative reward function. We present conditions that guarantee the convergence of these approaches to an optimal policy. Experimental results on 3 sparse reward domains demonstrate that our task-phasing approaches outperform state-of-the-art approaches with respect to asymptotic performance.

    more » « less
  4. ABSTRACT Introduction

    Remote military operations require rapid response times for effective relief and critical care. Yet, the military theater is under austere conditions, so communication links are unreliable and subject to physical and virtual attacks and degradation at unpredictable times. Immediate medical care at these austere locations requires semi-autonomous teleoperated systems, which enable the completion of medical procedures even under interrupted networks while isolating the medics from the dangers of the battlefield. However, to achieve autonomy for complex surgical and critical care procedures, robots require extensive programming or massive libraries of surgical skill demonstrations to learn effective policies using machine learning algorithms. Although such datasets are achievable for simple tasks, providing a large number of demonstrations for surgical maneuvers is not practical. This article presents a method for learning from demonstration, combining knowledge from demonstrations to eliminate reward shaping in reinforcement learning (RL). In addition to reducing the data required for training, the self-supervised nature of RL, in conjunction with expert knowledge-driven rewards, produces more generalizable policies tolerant to dynamic environment changes. A multimodal representation for interaction enables learning complex contact-rich surgical maneuvers. The effectiveness of the approach is shown using the cricothyroidotomy task, as it is a standard procedure seen in critical care to open the airway. In addition, we also provide a method for segmenting the teleoperator’s demonstration into subtasks and classifying the subtasks using sequence modeling.

    Materials and Methods

    A database of demonstrations for the cricothyroidotomy task was collected, comprising six fundamental maneuvers referred to as surgemes. The dataset was collected by teleoperating a collaborative robotic platform—SuperBaxter, with modified surgical grippers. Then, two learning models are developed for processing the dataset—one for automatic segmentation of the task demonstrations into a sequence of surgemes and the second for classifying each segment into labeled surgemes. Finally, a multimodal off-policy RL with rewards learned from demonstrations was developed to learn the surgeme execution from these demonstrations.


    The task segmentation model has an accuracy of 98.2%. The surgeme classification model using the proposed interaction features achieved a classification accuracy of 96.25% averaged across all surgemes compared to 87.08% without these features and 85.4% using a support vector machine classifier. Finally, the robot execution achieved a task success rate of 93.5% compared to baselines of behavioral cloning (78.3%) and a twin-delayed deep deterministic policy gradient with shaped rewards (82.6%).


    Results indicate that the proposed interaction features for the segmentation and classification of surgical tasks improve classification accuracy. The proposed method for learning surgemes from demonstrations exceeds popular methods for skill learning. The effectiveness of the proposed approach demonstrates the potential for future remote telemedicine on battlefields.

    more » « less
  5. Learning from Demonstrations (LfD) via Behavior Cloning (BC) works well on multiple complex tasks. However, a limitation of the typical LfD approach is that it requires expert demonstrations for all scenarios, including those in which the algorithm is already well-trained. The recently proposed Learning from Interventions (LfI) overcomes this limitation by using an expert overseer. The expert overseer only intervenes when it suspects that an unsafe action is about to be taken. Although LfI significantly improves over LfD, the state-of-the-art LfI fails to account for delay caused by the expert's reaction time and only learns short-term behavior. We address these limitations by 1) interpolating the expert's interventions back in time, and 2) by splitting the policy into two hierarchical levels, one that generates sub-goals for the future and another that generates actions to reach those desired sub-goals. This sub-goal prediction forces the algorithm to learn long-term behavior while also being robust to the expert's reaction time. Our experiments show that LfI using sub-goals in a hierarchical policy framework trains faster and achieves better asymptotic performance than typical LfD. 
    more » « less