NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

(QA)^2: Question Answering with Questionable Assumptions

Kim, Najoung; Htut, Phu Mon; Bowman, Samuel R.; Petty, Jackson (July 2023, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics)

Naturally-occurring information-seeking questions often contain questionable assumptions -- assumptions that are false or unverifiable. Questions containing questionable assumptions are challenging because they require a distinct answer strategy that deviates from typical answers to information-seeking questions. For instance, the question "When did Marie Curie discover Uranium?" cannot be answered as a typical when question without addressing the false assumption "Marie Curie discovered Uranium". In this work, we propose (QA)2 (Question Answering with Questionable Assumptions), an open-domain evaluation dataset consisting of naturally-occurring search engine queries that may or may not contain questionable assumptions. To be successful on (QA)2, systems must be able to detect questionable assumptions and also be able to produce adequate responses for both typical information-seeking questions and ones with questionable assumptions. We find that current models do struggle with handling questionable assumptions -- the best performing model achieves 59% human rater acceptability on abstractive QA with (QA)2 questions, leaving substantial headroom for progress.
more » « less
Full Text Available
Pretraining Language Models with Human Preferences

Korbak, Tomasz; Shi, Kejian; Chen, Angelica; Bhalerao, Rasika; Buckley, Christopher L.; Phang, Jason; Bowman, Samuel R.; Perez, Ethan (July 2023, International Conference on Machine Learning)

Language models (LMs) are pretrained to imitate text from large and diverse datasets that contain content that would violate human preferences if generated by an LM: falsehoods, offensive comments, personally identifiable information, low-quality or buggy code, among others. Here, we explore alternative objectives for pretraining LMs in a way that also guides them to generate text aligned with human preferences. We benchmark five objectives for pretraining with human feedback across three tasks and study how they affect the alignment and capabilities of pretrained LMs. We find a Pareto-optimal and simple approach among those we explored: conditional training, or learning distribution over tokens conditional on their human preference scores. Conditional training reduces the rate of undesirable content by up to an order of magnitude, both when generating without a prompt and with an adversarially-chosen prompt. Moreover, conditional training maintains the downstream task performance of standard LM pretraining, both before and after task-specific finetuning. Pretraining with human feedback results in much better preference satisfaction than standard LM pretraining followed by finetuning with feedback, i.e., learning and then unlearning undesirable behavior. Our results suggest that we should move beyond imitation learning when pretraining LMs and incorporate human preferences from the start of training.
more » « less
Full Text Available
The Dangers of Underclaiming: Reasons for Caution When Reporting How NLP Systems Fail

https://doi.org/10.18653/v1/2022.acl-long.516

Bowman, Samuel (January 2022, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers)

Full Text Available
SQuALITY: Building a Long-Document Summarization Dataset the Hard Way

https://doi.org/10.18653/v1/2022.emnlp-main.75

Wang, Alex; Yuanzhe Pang, Richard; Chen, Angelica; Phang, Jason; Bowman, Samuel R. (May 2022, Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing)

Summarization datasets are often assembled either by scraping naturally occurring public-domain summaries -- which are nearly always in difficult-to-work-with technical domains -- or by using approximate heuristics to extract them from everyday text -- which frequently yields unfaithful summaries. In this work, we turn to a slower but more straightforward approach to developing summarization benchmark data: We hire highly-qualified contractors to read stories and write original summaries from scratch. To amortize reading time, we collect five summaries per document, with the first giving an overview and the subsequent four addressing specific questions. We use this protocol to collect SQuALITY, a dataset of question-focused summaries built on the same public-domain short stories as the multiple-choice dataset QuALITY (Pang et al., 2021). Experiments with state-of-the-art summarization systems show that our dataset is challenging and that existing automatic evaluation metrics are weak indicators of quality.
more » « less
Full Text Available
What Do NLP Researchers Believe? Results of the NLP Community Metasurvey

https://doi.org/10.18653/v1/2023.acl-long.903

Michael, Julian; Holtzman, Ari; Parrish, Alicia; Mueller, Aaron; Wang, Alex; Chen, Angelica; Madaan, Divyam; Nangia, Nikita; Pang, Richard Yuanzhe; Phang, Jason; et al (January 2023, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))

Full Text Available
What Makes Reading Comprehension Questions Difficult?

https://doi.org/10.18653/v1/2022.acl-long.479

Sugawara, Saku; Nangia, Nikita; Warstadt, Alex; Bowman, Samuel (January 2022, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Volume 1: Long Papers)

Full Text Available
Single-Turn Debate Does Not Help Humans Answer Hard Reading-Comprehension Questions

https://doi.org/10.18653/v1/2022.lnls-1.3

Parrish, Alicia; Trivedi, Harsh; Perez, Ethan; Chen, Angelica; Nangia, Nikita; Phang, Jason; Bowman, Samuel (January 2022, Proceedings of the First Workshop on Learning with Natural Language Supervision)

Full Text Available
BBQ: A hand-built bias benchmark for question answering

https://doi.org/10.18653/v1/2022.findings-acl.165

Parrish, Alicia; Chen, Angelica; Nangia, Nikita; Padmakumar, Vishakh; Phang, Jason; Thompson, Jana; Htut, Phu Mon; Bowman, Samuel (January 2022, Findings of the Association for Computational Linguistics: ACL 2022)

Full Text Available
QuALITY: Question Answering with Long Input Texts, Yes!

Bowman, Samuel R.; Chen, Angelica; He, He; Joshi, Nitish; Ma, Johnny; Nangia, Nikita; Padmakumar, Vishakh; Pang, Richard Yuanzhe; Parrish, Alicia; Phang, Jason; et al (May 2022, NAACL 2022)

To enable building and testing models on long-document comprehension, we introduce QuALITY, a multiple-choice QA dataset with context passages in English that have an average length of about 5,000 tokens, much longer than typical current models can process. Unlike in prior work with passages, our questions are written and validated by contributors who have read the entire passage, rather than relying on summaries or excerpts. In addition, only half of the questions are answerable by annotators working under tight time constraints, indicating that skimming and simple search are not enough to consistently perform well. Our baseline models perform poorly on this task (55.4%) and significantly lag behind human performance (93.5%).
more » « less
Full Text Available
Fine-Tuned Transformers Show Clusters of Similar Representations Across Layers

https://doi.org/10.18653/v1/2021.blackboxnlp-1.42

Phang, Jason; Liu, Haokun; Bowman, Samuel R. (January 2021, Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP)

Full Text Available

« Prev Next »

Search for: All records