skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Assisted Robust Reward Design
Real-world robotic tasks require complex reward functions. When we define the problem the robot needs to solve, we pretend that a designer specifies this complex reward exactly, and it is set in stone from then on. In practice, however, reward design is an iterative process: the designer chooses a reward, eventually encounters an "edge-case" environment where the reward incentivizes the wrong behavior, revises the reward, and repeats. What would it mean to rethink robotics problems to formally account for this iterative nature of reward design? We propose that the robot not take the specified reward for granted, but rather have uncertainty about it, and account for the future design iterations as future evidence. We contribute an Assisted Reward Design method that speeds up the design process by anticipating and influencing this future evidence: rather than letting the designer eventually encounter failure cases and revise the reward then, the method actively exposes the designer to such environments during the development phase. We test this method in a simplified autonomous driving task and find that it more quickly improves the car's behavior in held-out environments by proposing environments that are "edge cases" for the current reward.  more » « less
Award ID(s):
1734633
PAR ID:
10480085
Author(s) / Creator(s):
;
Publisher / Repository:
Conference on Robot Learning
Date Published:
Journal Name:
Conference on Robot Learning
Format(s):
Medium: X
Location:
London
Sponsoring Org:
National Science Foundation
More Like this
  1. Designing a good reward function is essential to robot planning and reinforcement learning, but it can also be challenging and frustrating. The reward needs to work across multiple different environments, and that often requires many iterations of tuning. We introduce a novel divide-and- conquer approach that enables the designer to specify a reward separately for each environment. By treating these separate reward functions as observations about the underlying true reward, we derive an approach to infer a common reward across all environments. We conduct user studies in an abstract grid world domain and in a motion planning domain for a 7-DOF manipulator that measure user effort and solution quality. We show that our method is faster, easier to use, and produces a higher quality solution than the typical method of designing a reward jointly across all environments. We additionally conduct a series of experiments that measure the sensitivity of these results to different properties of the reward design task, such as the number of environments, the number of feasible solutions per environment, and the fraction of the total features that vary within each environment. We find that independent reward design outperforms the standard, joint, reward design process but works best when the design problem can be divided into simpler subproblems. 
    more » « less
  2. Robot design is a complex cognitive activity that requires the designer to iteratively navigate multiple engineering disciplines and the relations between them. In this paper, we explore how people approach robot design and how trends in design strategy vary with the level of expertise of the designer. Using our interactive Build-a-Bot software tool, we recruited 39 participants from the 2022 IEEE International Conference on Robotics and Automation. These participants varied in age from 19 to 56 years, and had between 0 and 17 years of robotics experience. We tracked the participants’ design decisions over the course of a 15 min. task of designing a ground robot to cross an uneven environment. Our results showed that participants engaged in iterative testing and modification of their designs, but unlike previous studies, there was no statistically significant effect of participant’s expertise on the frequency of iterations. We additionally found that, across levels of expertise, participants were vulnerable to design fixation, in which they latched onto an initial design concept and insufficiently adjusted the design, even when confronted with difficulties developing the concept into a satisfactory solution. The results raise interesting questions for how future engineers can avoid fixation and how design tools can assist in both efficient assessment and optimization of design workflow for complex design tasks. 
    more » « less
  3. We describe a physical interactive system for human-robot collaborative design (HRCD) consisting of a tangible user interface (TUI) and a robotic arm that simultaneously manipulates the TUI with the human designer. In an observational study of 12 participants exploring a complex design problem together with the robot, we find that human designers have to negotiate both the physical and the creative space with the machine. They also often ascribe social meaning to the robot's pragmatic behaviors. Based on these findings, we propose four considerations for future HRCD systems: managing the shared workspace, communicating preferences about design goals, respecting different design styles, and taking into account the social meaning of design acts. 
    more » « less
  4. Real-world choice options have many features or attributes, whereas the reward outcome from those options only depends on a few features or attributes. It has been shown that humans learn and combine feature-based with more complex conjunction-based learning to tackle challenges of learning in naturalistic reward environments. However, it remains unclear how different learning strategies interact to determine what features or conjunctions should be attended to and control choice behavior, and how subsequent attentional modulations influence future learning and choice. To address these questions, we examined the behavior of male and female human participants during a three-dimensional learning task in which reward outcomes for different stimuli could be predicted based on a combination of an informative feature and conjunction. Using multiple approaches, we found that both choice behavior and reward probabilities estimated by participants were most accurately described by attention-modulated models that learned the predictive values of both the informative feature and the informative conjunction. Specifically, in the reinforcement learning model that best fit choice data, attention was controlled by the difference in the integrated feature and conjunction values. The resulting attention weights modulated learning by increasing the learning rate on attended features and conjunctions. Critically, modulating decision-making by attention weights did not improve the fit of data, providing little evidence for direct attentional effects on choice. These results suggest that in multidimensional environments, humans direct their attention not only to selectively process reward-predictive attributes but also to find parsimonious representations of the reward contingencies for more efficient learning. 
    more » « less
  5. Reward engineering has long been a challenge in Reinforcement Learning (RL) research, as it often requires extensive human effort and iterative processes of trial-and-error to design effective reward functions. In this paper, we propose RL-VLM-F, a method that automatically generates reward functions for agents to learn new tasks, using only a text description of the task goal and the agent's visual observations, by leveraging feedbacks from vision language foundation models (VLMs). The key to our approach is to query these models to give preferences over pairs of the agent's image observations based on the text description of the task goal, and then learn a reward function from the preference labels, rather than directly prompting these models to output a raw reward score, which can be noisy and inconsistent. We demonstrate that RL-VLM-F successfully produces effective rewards and policies across various domains - including classic control, as well as manipulation of rigid, articulated, and deformable objects - without the need for human supervision, outperforming prior methods that use large pretrained models for reward generation under the same assumptions. 
    more » « less