skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Interactive Linear Regression with Pairwise Comparisons
A general goal of interactive learning is to investigate broad ways of leveraging human feedback, and understand the benefits of learning from potentially complex feedback. We study a special case of linear regression with access to comparisons between pairs of samples. Learning from such queries is motivated by several important applications, where obtaining comparisons can be much easier than direct labels, and/or when comparisons can be more reliable. We develop an interactive algorithm that utilizes both labels and comparisons to obtain a linear estimator, and show that it only requires a very small amount of direct labels to achieve low error. We also give minimax lower bounds for the problem, showing that our algorithm is optimal up to log factors. Finally, experiments show that our algorithm outperforms label-only algorithms when labels are scarce, and it can be practical for real world applications  more » « less
Award ID(s):
1763734
PAR ID:
10091686
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
52nd Asilomar Conference on Signals, Systems, and Computers
Page Range / eLocation ID:
636--640
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Krause, Andreas and (Ed.)
    We provide a theoretical framework for Reinforcement Learning with Human Feedback (RLHF). We show that when the underlying true reward is linear, under both Bradley-Terry-Luce (BTL) model (pairwise comparison) and Plackett-Luce (PL) model ($$K$$-wise comparison), MLE converges under certain semi-norm for the family of linear reward. On the other hand, when training a policy based on the learned reward model, we show that MLE fails while a pessimistic MLE provides policies with good performance under certain coverage assumption. We also show that under the PL model, both the true MLE and a different MLE which splits the $$K$$-wise comparison into pairwise comparisons converge, while the true MLE is asymptotically more efficient. Our results validate the empirical success of the existing RLHF algorithms, and provide new insights for algorithm design. Our analysis can also be applied for the problem of online RLHF and inverse reinforcement learning. 
    more » « less
  2. Interactive decision making, encompassing bandits, contextual bandits, and reinforcement learning, has recently been of interest to theoretical studies of experimentation design and recommender system algorithm research. Recently, it has been shown that the wellknown Graves-Lai constant being zero is a necessary and sufficient condition for achieving bounded (or constant) regret in interactive decision making. As this condition may be a strong requirement for many applications, the practical usefulness of pursuing bounded regret has been questioned. In this paper, we show that the condition of the Graves-Lai constant being zero is also necessary to achieve delay model robustness when reward delays are unknown (i.e., when feedbacks are anonymous). Here, model robustness is measured in terms of ✏-robustness, one of the most widely used and one of the least adversarial robustness concepts in the robust statistics literature. In particular, we show that ✏-robustness cannot be achieved for a consistent (i.e., uniformly sub-polynomial regret) algorithm however small the nonzero ✏ value is when the Grave-Lai constant is not zero. While this is a strongly negative result, we also provide a positive result for linear rewards models (Linear contextual bandits, Reinforcement learning with linear MDP) that the Grave-Lai constant being zero is also sufficient for achieving bounded regret without any knowledge of delay models, i.e., the best of both the efficiency world and the delay robustness world. 
    more » « less
  3. null (Ed.)
    Conveying complex objectives to reinforcement learning (RL) agents can often be difficult, involving meticulous design of reward functions that are sufficiently informative yet easy enough to provide. Human-in-the-loop RL methods allow practitioners to instead interactively teach agents through tailored feedback; however, such approaches have been challenging to scale since human feedback is very expensive. In this work, we aim to make this process more sample- and feedback-efficient. We present an off-policy, interactive RL algorithm that capitalizes on the strengths of both feedback and off-policy learning. Specifically, we learn a reward model by actively querying a teacher’s preferences between two clips of behavior and use it to train an agent. To enable off-policy learning, we relabel all the agent’s past experience when its reward model changes. We additionally show that pre-training our agents with unsupervised exploration substantially increases the mileage of its queries. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods, including a variety of locomotion and robotic manipulation skills. We also show that our method is able to utilize real-time human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard reward functions. 
    more » « less
  4. null (Ed.)
    Conveying complex objectives to reinforcement learning (RL) agents can often be difficult, involving meticulous design of reward functions that are sufficiently informative yet easy enough to provide. Human-in-the-loop RL methods allow practitioners to instead interactively teach agents through tailored feedback; however, such approaches have been challenging to scale since human feedback is very expensive. In this work, we aim to make this process more sample- and feedback-efficient. We present an off-policy, interactive RL algorithm that capitalizes on the strengths of both feedback and off-policy learning. Specifically, we learn a reward model by actively querying a teacher’s preferences between two clips of behavior and use it to train an agent. To enable off-policy learning, we relabel all the agent’s past experience when its reward model changes. We additionally show that pre-training our agents with unsupervised exploration substantially increases the mileage of its queries. We demonstrate that our approach is capable of learning tasks of higher complexity than previously considered by human-in-the-loop methods, including a variety of locomotion and robotic manipulation skills. We also show that our method is able to utilize real-time human feedback to effectively prevent reward exploitation and learn new behaviors that are difficult to specify with standard reward functions. 
    more » « less
  5. We consider the bandit problem of selecting K out of N arms at each time step. The joint reward can be a non-linear function of the rewards of the selected individual arms. The direct use of a multi-armed bandit algorithm requires choosing among all possible combinations, making the action space large. To simplify the problem, existing works on combinatorial bandits typically assume feedback as a linear function of individual rewards. In this paper, we prove the lower bound for top-K subset selection with bandit feedback with possibly correlated rewards. We present a novel algorithm for the combinatorial setting without using individual arm feedback or requiring linearity of the reward function. Additionally, our algorithm works on correlated rewards of individual arms. Our algorithm, aDaptive Accept RejecT (DART), sequentially finds good arms and eliminates bad arms based on confidence bounds. DART is computationally efficient and uses storage linear in N. Further, DART achieves a regret bound of Õ(K√KNT) for a time horizon T, which matches the lower bound in bandit feedback up to a factor of √log 2NT. When applied to the problem of cross-selling optimization and maximizing the mean of individual rewards, the performance of the proposed algorithm surpasses that of state-of-the-art algorithms. We also show that DART significantly outperforms existing methods for both linear and non-linear joint reward environments. 
    more » « less