Principled Reinforcement Learning with Human Feedback from Pairwise or K-wise Comparisons

Zhu, Banghua and

Citation Details

We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry-Luce (BTL) model (pairwise comparison) and Plackett-Luce (PL) model ($$K$$-wise comparison), MLE converges under certain semi-norm for the family of linear reward. On the other hand, when training a policy based on the learned reward model, we show that MLE fails while a pessimistic MLE provides policies with good performance under certain coverage assumption. We also show that under the PL model, both the true MLE and a different MLE which splits the $$K$$-wise comparison into pairwise comparisons converge, while the true MLE is asymptotically more efficient. Our results validate the empirical success of the existing RLHF algorithms, and provide new insights for algorithm design. Our analysis can also be applied for the problem of online RLHF and inverse reinforcement learning. more »

Award ID(s):: 1901252 1909499

PAR ID:: 10440808

Author(s) / Creator(s):: Zhu, Banghua and

Editor(s):: Krause, Andreas and

Date Published:: 2023-01-01

Journal Name:: Proceedings of Machine Learning Research

Volume:: 202

ISSN:: 2640-3498

Page Range / eLocation ID:: 43037-43067

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this