skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, October 10 until 2:00 AM ET on Friday, October 11 due to maintenance. We apologize for the inconvenience.


Title: Capturing failures of large language models via human cognitive biases
Large language models generate complex, open-ended outputs: instead of outputting a class label they write summaries, generate dialogue, or produce working code. In order to asses the reliability of these open-ended generation systems, we aim to identify qualitative categories of erroneous behavior, beyond identifying individual errors. To hypothesize and test for such qualitative errors, we draw inspiration from human cognitive biases -- systematic patterns of deviation from rational judgement. Specifically, we use cognitive biases as motivation to (i) generate hypotheses for problems that models may have, and (ii) develop experiments that elicit these problems. Using code generation as a case study, we find that OpenAI's Codex errs predictably based on how the input prompt is framed, adjusts outputs towards anchors, and is biased towards outputs that mimic frequent training examples. We then use our framework to elicit high-impact errors such as incorrectly deleting files. Our results indicate that experimental methodology from cognitive science can help characterize how machine learning systems behave.  more » « less
Award ID(s):
2031899
NSF-PAR ID:
10432656
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Advances in neural information processing systems
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this work, we explore a useful but often neglected methodology for robustness analysis of text generation evaluation metrics: stress tests with synthetic data. Basically, we design and synthesize a wide range of potential errors and check whether they result in a commensurate drop in the metric scores. We examine a range of recently proposed evaluation metrics based on pretrained language models, for the tasks of open-ended generation, translation, and summarization. Our experiments reveal interesting insensitivities, biases, or even loopholes in existing metrics. For example, we find that BERTScore is confused by truncation errors in summarization, and MAUVE (built on top of GPT-2) is insensitive to errors at the beginning or middle of generations. Further, we investigate the reasons behind these blind spots and suggest practical workarounds for a more reliable evaluation of text generation. We have released our code and data at https://github.com/cloudygoose/blindspot_nlg. 
    more » « less
  2. null (Ed.)
    Open-ended programming increases students' motivation by allowing them to solve authentic problems and connect programming to their own interests. However, such open-ended projects are also challenging, as they often encourage students to explore new programming features and attempt tasks that they have not learned before. Code examples are effective learning materials for students and are well-suited to supporting open-ended programming. However, there is little work to understand how novices learn with examples during open-ended programming, and few real-world deployments of such tools. In this paper, we explore novices' learning barriers when interacting with code examples during open-ended programming. We deployed Example Helper, a tool that offers galleries of code examples to search and use, with 44 novice students in an introductory programming classroom, working on an open-ended project in Snap. We found three high-level barriers that novices encountered when using examples: decision, search, and integration barriers. We discuss how these barriers arise and design opportunities to address them. 
    more » « less
  3. Open-ended programming engages students by connecting computing with their real-world experience and personal interest. However, such open-ended programming tasks can be challenging, as they require students to implement features that they may be unfamiliar with. Code examples help students to generate ideas and implement program features, but students also encounter many learning barriers when using them. We explore how to design code examples to support novices' effective example use by presenting our experience of building and deploying Example Helper, a system that supports students with a gallery of code examples during open-ended programming. We deployed Example Helper in an undergraduate CS0 classroom to investigate students' example usage experience, finding that students used different strategies to browse, understand, experiment with, and integrate code examples and that students who make more sophisticated plans also used more examples in their projects. 
    more » « less
  4. A growing swath of NLP research is tackling problems related to generating long text, including tasks such as open-ended story generation, summarization, dialogue, and more. However, we currently lack appropriate tools to evaluate these long outputs of generation models: classic automatic metrics such as ROUGE have been shown to perform poorly, and newer learned metrics do not necessarily work well for all tasks and domains of text. Human rating and error analysis remains a crucial component for any evaluation of long text generation. In this paper, we introduce FALTE, a web-based annotation toolkit designed to address this shortcoming. Our tool allows researchers to collect fine-grained judgments of text quality from crowdworkers using an error taxonomy specific to the downstream task. Using the task interface, annotators can select and assign error labels to text span selections in an incremental paragraph-level annotation workflow. The latter functionality is designed to simplify the document-level task into smaller units and reduce cognitive load on the annotators. Our tool has previously been used to run a large-scale annotation study that evaluates the coherence of long generated summaries, demonstrating its utility. 
    more » « less
  5. null (Ed.)
    Mixed-initiative visual analytics systems incorporate well-established design principles that improve users' abilities to solve problems. As these systems consider whether to take initiative towards achieving user goals, many current systems address the potential for cognitive bias in human initiatives statically, relying on fixed initiatives they can take instead of identifying, communicating and addressing the bias as it occurs. We argue that mixed-initiative design principles can and should incorporate cognitive bias mitigation strategies directly through development of mitigation techniques embedded in the system to address cognitive biases in situ. We identify domain experts in machine learning adopting visual analytics techniques and systems that incorporate existing mixed-initiative principles and examine their potential to support bias mitigation strategies. This examination considers the unique perspective these experts bring to visual analytics and is situated in existing user-centered systems that make exemplary use of design principles informed by cognitive theory. We then suggest informed opportunities for domain experts to take initiative toward addressing cognitive biases in light of their existing contributions to the field. Finally, we contribute open questions and research directions for designers seeking to adopt visual analytics techniques that incorporate bias-aware initiatives in future systems. 
    more » « less