skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Learning behaviors via human-delivered discrete feedback: modeling implicit feedback strategies to speed up learning
Award ID(s):
1643413
PAR ID:
10026429
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
Autonomous Agents and Multi-Agent Systems
Volume:
30
Issue:
1
ISSN:
1387-2532
Page Range / eLocation ID:
30 to 59
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We explore unconstrained natural language feedback as a learning signal for artificial agents. Humans use rich and varied language to teach, yet most prior work on interactive learning from language assumes a particular form of input (e.g., commands). We propose a general framework which does not make this assumption, instead using aspect-based sentiment analysis to decompose feedback into sentiment over the features of a Markov decision process. We then infer the teacher's reward function by regressing the sentiment on the features, an analogue of inverse reinforcement learning. To evaluate our approach, we first collect a corpus of teaching behavior in a cooperative task where both teacher and learner are human. We implement three artificial learners: sentiment-based "literal" and "pragmatic" models, and an inference network trained end-to-end to predict rewards. We then re-run our initial experiment, pairing human teachers with these artificial learners. All three models successfully learn from interactive human feedback. The inference network approaches the performance of the "literal" sentiment model, while the "pragmatic" model nears human performance. Our work provides insight into the information structure of naturalistic linguistic feedback as well as methods to leverage it for reinforcement learning. 
    more » « less
  2. The potential for pre-trained large language models (LLMs) to use natural language feedback at inference time has been an exciting recent development. We build upon this observation by formalizing an algorithm for learning from natural language feedback at training time instead, which we call Imitation learning from Language Feedback (ILF). ILF requires only a small amount of human-written feedback during training and does not require the same feedback at test time, making it both user-friendly and sample-efficient. We further show that ILF can be seen as a form of minimizing the KL divergence to the target distribution and demonstrate proof-of-concepts on text summarization and program synthesis tasks. For code generation, ILF improves a Codegen-Mono 6.1B model’s pass@1 rate from 22% to 36% on the MBPP benchmark, outperforming both fine-tuning on MBPP and on human- written repaired programs. For summarization, we show that ILF can be combined with learning from human preferences to improve a GPT-3 model’s summarization performance to be comparable to human quality, outperforming fine-tuning on human-written summaries. Overall, our results suggest that ILF is both more effective and sample-efficient than training exclusively on demonstrations for improving an LLM’s performance on a variety of tasks. 
    more » « less
  3. The efficiency with which a learner processes external feedback has implications for both learning speed and performance. A growing body of literature suggests that the feedback-related negativity (FRN) event-related potential (ERP) and the fronto-central positivity (FCP) ERP reflect the extent to which feedback is used by a learner to improve performance. To determine whether the FRN and FCP predict learning speed, 82 participants aged 7:6 - 11:0 learned the non-word names of 20 novel objects in a two-choice feedback-based declarative learning task. Participants continued the task until reaching the learning criterion of 2 consecutive training blocks with accuracy greater than 90%, or until 10 blocks were completed. Learning speed was determined by the total number of incorrect responses before reaching the learning criterion. Using linear regression models, the FRN amplitude in response to positive feedback was found to be a significant predictor of learning speed when controlling for age. The FCP amplitude in response to negative feedback was significantly negatively associated with learning speed, meaning that large FCP amplitudes in response to negative feedback predicted faster learning. An interaction between FCP and age suggested that for older children in this sample, smaller FCP amplitude in response to positive feedback was associated with increased speed, while for younger children, larger FCP amplitude predicted faster learning. These results suggest that the feedback related ERP components are associated with learning speed, and can reflect developmental changes in feedback-based learning. 
    more » « less
  4. Recent work introduced the model of learning from discriminative feature feedback, in which a human annotator not only provides labels of instances, but also identifies discriminative features that highlight important differences between pairs of instances. It was shown that such feedback can be conducive to learning, and makes it possible to efficiently learn some concept classes that would otherwise be in- tractable. However, these results all relied upon perfect annotator feedback. In this pa- per, we introduce a more realistic, robust ver- sion of the framework, in which the annotator is allowed to make mistakes. We show how such errors can be handled algorithmically, in both an adversarial and a stochastic setting. In particular, we derive regret bounds in both settings that, as in the case of a perfect an- notator, are independent of the number of features. We show that this result cannot be obtained by a naive reduction from the robust setting to the non-robust setting. 
    more » « less