NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Confronting Reward Model Overoptimization with Constrained RLHF

Moskovitz, T; Singh, A; Strouse, DJ; Sandholm, T; Salakhutdinov, R; Dragan, A; McAleer, S (May 2024, ICLR)

Full Text Available
Confronting Reward Model Overoptimization with Constrained RLHF

Moskovitz, T; Singh, A; Strouse, DJ; Sandholm, T; Salakhutdinov, R; Dragan, A; McAleer, S (May 2024, ICLR24)

Large language models are typically aligned with human preferences by optimizing reward models (RMs) fitted to human feedback. However, human preferences are multi-faceted, and it is increasingly common to derive reward from a composition of simpler reward models which each capture a different aspect of language quality. This itself presents a challenge, as it is difficult to appropriately weight these component RMs when combining them. Compounding this difficulty, because any RM is only a proxy for human evaluation, this process is vulnerable to overoptimization, wherein past a certain point, accumulating higher reward is associated with worse human ratings. In this paper, we perform, to our knowledge, the first study on overoptimization in composite RMs, showing that correlation between component RMs has a significant effect on the locations of these points. We then introduce an approach to solve this issue using constrained reinforcement learning as a means of preventing the agent from exceeding each RM’s threshold of usefulness. Our method addresses the problem of weighting component RMs by learning dynamic weights, naturally expressed by Lagrange multipliers. As a result, each RM stays within the range at which it is an effective proxy, improving evaluation performance. Finally, we introduce an adaptive method using gradient-free optimization to identify and optimize towards these points during a single run.
more » « less
Full Text Available
Learning Optimal Advantage from Preferences and Mistaking it for Reward.

Knox, WB; Hatgis-Kessell, S; Adalgeirsson, SO; Booth, S; Dragan, A; Stone, P; Niekum, S (February 2024, Annual AAAI Conference)

Full Text Available
Learning Optimal Advantage from Preferences and Mistaking it for Reward.

https://doi.org/10.1609/aaai.v38i9.28870

Knox, A; Hatgis-Kessell, S; Adalgeirsson, S; Booth, S; Dragan, A; Stone, P; Niekum, S (February 2024, Proceedings of the AAAI Conference on Artificial Intelligence)

Full Text Available
On Complementing End-To-End Human Behavior Predictors with Planning.

https://doi.org/10.15607/RSS.2021.XVII.037

Sun, L.; Jia, X.; Dragan, A. (January 2021, Robotics science and systems)

High capacity end-to-end approaches for human motion (behavior) prediction have the ability to represent subtle nuances in human behavior, but struggle with robustness to out of distribution inputs and tail events. Planning-based prediction, on the other hand, can reliably output decent-but-not-great predictions: it is much more stable in the face of distribution shift (as we verify in this work), but it has high inductive bias, missing important aspects that drive human decisions, and ignoring cognitive biases that make human behavior suboptimal. In this work, we analyze one family of approaches that strive to get the best of both worlds: use the end-to-end predictor on common cases, but do not rely on it for tail events / out-of-distribution inputs -- switch to the planning-based predictor there. We contribute an analysis of different approaches for detecting when to make this switch, using an autonomous driving domain. We find that promising approaches based on ensembling or generative modeling of the training distribution might not be reliable, but that there very simple methods which can perform surprisingly well -- including training a classifier to pick up on tell-tale issues in predicted trajectories.
more » « less
Full Text Available
Analyzing Human Models that Adapt Online.

https://doi.org/10.1109/ICRA48506.2021.9561652

Bajcsy, A; Siththaranjan, A; Tomlin, C; Dragan, A (January 2021, IEEE International Conference on Robotics and Automation)

Full Text Available
Analyzing Human Models that Adapt Online

Sripathy, A; Bobu, A; Brown, D; Dragan, A. (January 2021, IEEE International Conference on Robotics and Automation)

Full Text Available
B-Pref: Benchmarking Preference-Based Reinforcement Learning

Lee, K; Smith, L; Dragan, A.; Abbeel, P (January 2021, Neural Information Processing Systems (NeurIPS))
null (Ed.)
Full Text Available
A Robust Control Framework for Human Motion Prediction.

https://doi.org/10.1109/LRA.2020.3028049

Bajcsy, A; Bansal, S.; Ratner, E; Tomlin, C; Dragan, A (January 2021, IEEE robotics automation letters)

Full Text Available
Simplifying Reward Design through Divide-and-Conquer

Ratner, E.; Hadfield-Mennell, D.; Dragan, A. (January 2018, Robotics: Science and Systems)

Designing a good reward function is essential to robot planning and reinforcement learning, but it can also be challenging and frustrating. The reward needs to work across multiple different environments, and that often requires many iterations of tuning. We introduce a novel divide-and- conquer approach that enables the designer to specify a reward separately for each environment. By treating these separate reward functions as observations about the underlying true reward, we derive an approach to infer a common reward across all environments. We conduct user studies in an abstract grid world domain and in a motion planning domain for a 7-DOF manipulator that measure user effort and solution quality. We show that our method is faster, easier to use, and produces a higher quality solution than the typical method of designing a reward jointly across all environments. We additionally conduct a series of experiments that measure the sensitivity of these results to different properties of the reward design task, such as the number of environments, the number of feasible solutions per environment, and the fraction of the total features that vary within each environment. We find that independent reward design outperforms the standard, joint, reward design process but works best when the design problem can be divided into simpler subproblems.
more » « less
Full Text Available

« Prev Next »

Search for: All records