skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Learning reward functions from diverse sources of human feedback: Optimally integrating demonstrations and preferences
Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human’s ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework.  more » « less
Award ID(s):
1941722
PAR ID:
10312803
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
The International Journal of Robotics Research
ISSN:
0278-3649
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Humans can leverage physical interaction to teach robot arms. This physical interaction takes multiple forms depending on the task, the user, and what the robot has learned so far. State-of-the-art approaches focus on learning from a single modality, or combine some interaction types. Some methods do so by assuming that the robot has prior information about the features of the task and the reward structure. By contrast, in this article, we introduce an algorithmic formalism that unites learning from demonstrations, corrections, and preferences. Our approach makes no assumptions about the tasks the human wants to teach the robot; instead, we learn a reward model from scratch by comparing the human’s input to nearby alternatives, i.e., trajectories close to the human’s feedback. We first derive a loss function that trains an ensemble of reward models to match the human’s demonstrations, corrections, and preferences. The type and order of feedback is up to the human teacher: We enable the robot to collect this feedback passively or actively. We then apply constrained optimization to convert our learned reward into a desired robot trajectory. Through simulations and a user study, we demonstrate that our proposed approach more accurately learns manipulation tasks from physical human interaction than existing baselines, particularly when the robot is faced with new or unexpected objectives. Videos of our user study are available at https://youtu.be/FSUJsTYvEKU 
    more » « less
  2. We study the problem of cross-embodiment inverse reinforcement learning, where we wish to learn a reward function from video demonstrations in one or more embodiments and then transfer the learned reward to a different embodiment (e.g., different action space, dynamics, size, shape, etc.). Learning reward functions that transfer across embodiments is important in settings such as teaching a robot a policy via human video demonstrations or teaching a robot to imitate a policy from another robot with a different embodiment. However, prior work has only focused on cases where near-optimal demonstrations are available, which is often difficult to ensure. By contrast, we study the setting of cross-embodiment reward learning from mixed-quality demonstrations. We demonstrate that prior work struggles to learn generalizable reward representations when learning from mixed-quality data. We then analyze several techniques that leverage human feedback for representation learning and alignment to enable effective cross-embodiment learning. Our results give insight into how different representation learning techniques lead to qualitatively different reward shaping behaviors and the importance of human feedback when learning from mixed-quality, mixed-embodiment data. 
    more » « less
  3. Robots can learn to imitate humans by inferring what the human is optimizing for. One common framework for this is Bayesian reward learning, where the robot treats the human's demonstrations and corrections as observations of their underlying reward function. Unfortunately, this inference is doubly-intractable: the robot must reason over all the trajectories the person could have provided and all the rewards the person could have in mind. Prior work uses existing robotic tools to approximate this normalizer. In this letter, we group previous approaches into three fundamental classes and analyze the theoretical pros and cons of their approach. We then leverage recent research from the statistics community to introduce Double MH reward learning, a Monte Carlo method for asymptotically learning the human's reward in continuous spaces. We extend Double MH to conditionally independent settings (where each human correction is viewed as completely separate) and conditionally dependent environments (where the human's current correction may build on previous inputs). Across simulations and user studies, our proposed approach infers the human's reward parameters more accurately than the alternate approximations when learning from either demonstrations or corrections. 
    more » « less
  4. Reinforcement Learning from Human Feedback (RLHF) is a powerful paradigm for aligning foundation models to human values and preferences. However, current RLHF techniques cannot account for the naturally occurring differences in individual human preferences across a diverse population. When these differences arise, traditional RLHF frameworks simply average over them, leading to inaccurate rewards and poor performance for individual subgroups. To address the need for pluralistic alignment, we develop a class of multimodal RLHF methods. Our proposed techniques are based on a latent variable formulation - inferring a novel user-specific latent and learning reward models and policies conditioned on this latent without additional user-specific data. While conceptually simple, we show that in practice, this reward modeling requires careful algorithmic considerations around model architecture and reward scaling. To empirically validate our proposed technique, we first show that it can provide a way to combat under- specification in simulated control problems, inferring and optimizing user-specific reward functions. Next, we conduct experiments on pluralistic language datasets representing diverse user preferences and demonstrate improved reward function accuracy. We additionally show the benefits of this probabilistic framework in terms of measuring uncertainty, and actively learning user preferences. This work enables learning from diverse populations of users with divergent preferences, an important challenge that naturally occurs in problems from robot learning to foundation model alignment. 
    more » « less
  5. This paper addresses the problem of preference learning, which aims to align robot behaviors through learning user specific preferences (e.g. "good pull-over location") from visual demonstrations. Despite its similarity to learning factual concepts (e.g. "red door"), preference learning is a fundamentally harder problem due to its subjective nature and the paucity of person-specific training data. We address this problem using a novel framework called SYNAPSE, which is a neuro-symbolic approach designed to efficiently learn preferential concepts from limited data. SYNAPSE represents preferences as neuro-symbolic programs, facilitating inspection of individual parts for alignment, in a domain-specific language (DSL) that operates over images and leverages a novel combination of visual parsing, large language models, and program synthesis to learn programs representing individual preferences. We perform extensive evaluations on various preferential concepts as well as user case studies demonstrating its ability to align well with dissimilar user preferences. Our method significantly outperforms baselines, especially when it comes to out of distribution generalization. We show the importance of the design choices in the framework through multiple ablation studies. 
    more » « less