skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: (QA)^2: Question Answering with Questionable Assumptions
Naturally-occurring information-seeking questions often contain questionable assumptions -- assumptions that are false or unverifiable. Questions containing questionable assumptions are challenging because they require a distinct answer strategy that deviates from typical answers to information-seeking questions. For instance, the question "When did Marie Curie discover Uranium?" cannot be answered as a typical when question without addressing the false assumption "Marie Curie discovered Uranium". In this work, we propose (QA)2 (Question Answering with Questionable Assumptions), an open-domain evaluation dataset consisting of naturally-occurring search engine queries that may or may not contain questionable assumptions. To be successful on (QA)2, systems must be able to detect questionable assumptions and also be able to produce adequate responses for both typical information-seeking questions and ones with questionable assumptions. We find that current models do struggle with handling questionable assumptions -- the best performing model achieves 59% human rater acceptability on abstractive QA with (QA)2 questions, leaving substantial headroom for progress.  more » « less
Award ID(s):
2046556
PAR ID:
10441675
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics
Volume:
1
Page Range / eLocation ID:
8466-8487
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. During the preschool years, children’s question-explanation exchanges with teachers serve as a powerful mechanism for their early STEM knowledge acquisition. Utilizing naturalistic longitudinal classroom data, we examined how such conversations in an inquiry-based preschool classroom change during an extended scientific inquiry unit. We were particularly interested in information-seeking questions (causal, e.g. “How will you construct a pathway?”; fact-based, e.g., “Where’s the marble?”). Videos (n = 18; 14 hours) were collected during a three-week inquiry unit on forces and motion and transcribed in CLAN-CHILDES software at the utterance level. Utterances were coded for delivery (question vs. statement) and content (e.g., fact-based, causal). Although teachers ask more questions than children, we found a significant increase in information-seeking questions during Weeks 2 and 3. We explored the content of information-seeking questions and found that the majority of these questions were asked by teachers, and focused on facts. However, the timing of fact-based and causal questions varied. Whereas more causal questions occurred in earlier weeks, more fact-based questions were asked towards the end of the inquiry. These findings provide insight into how children’s and teacher’s questions develop during an inquiry, informing our understanding of early science learning. Even in an inquiry-learning environment, teachers guide interactions, asking questions to support children’s learning. Children’s information-seeking questions increase during certain weeks, suggesting that providing opportunities to ask questions may allow children to be more active in constructing knowledge. Such findings are important for considering how science questions are naturally embedded in an inquiry-based learning classroom. 
    more » « less
  2. To build robust question answering systems, we need the ability to verify whether answers to questions are truly correct, not just “good enough” in the context of imperfect QA datasets. We explore the use of natural language inference (NLI) as a way to achieve this goal, as NLI inherently requires the premise (document context) to contain all necessary information to support the hypothesis (proposed answer to the question). We leverage large pre-trained models and recent prior datasets to construct powerful question conversion and decontextualization modules, which can reformulate QA instances as premise-hypothesis pairs with very high reliability. Then, by combining standard NLI datasets with NLI examples automatically derived from QA training data, we can train NLI models to evaluate QA models’ proposed answers. We show that our approach improves the confidence estimation of a QA model across different domains, evaluated in a selective QA setting. Careful manual analysis over the predictions of our NLI model shows that it can further identify cases where the QA model produces the right answer for the wrong reason, i.e., when the answer sentence cannot address all aspects of the question. 
    more » « less
  3. Understanding and characterizing how people interact in information-seeking conversations will be a crucial component in developing effective conversational search systems. In this paper, we introduce a new dataset designed for this purpose and use it to analyze information-seeking conversations by user intent distribution, co-occurrence, and flow patterns. The MSDialog dataset is a labeled conversation dataset of question answering (QA) interactions between information seekers and providers from an online forum on Microsoft products. The dataset contains more than 2,000 multi-turn QA dialogs with 10,000 utterances that are annotated with user intents on the utterance level. Annotations were done using crowdsourcing. With MSDialog, we find some highly recurring patterns in user intent during an information-seeking process. They could be useful for designing conversational search systems. We will make our dataset freely available to encourage exploration of information-seeking conversation models. 
    more » « less
  4. Clinical question answering (QA) aims to automatically answer questions from medical professionals based on clinical texts. Studies show that neural QA models trained on one corpus may not generalize well to new clinical texts from a different institute or a different patient group, where large-scale QA pairs are not readily available for model retraining. To address this challenge, we propose a simple yet effective framework, CliniQG4QA, which leverages question generation (QG) to synthesize QA pairs on new clinical contexts and boosts QA models without requiring manual annotations. In order to generate diverse types of questions that are essential for training QA models, we further introduce a seq2seq-based question phrase prediction (QPP) module that can be used together with most existing QG models to diversify the generation. Our comprehensive experiment results show that the QA corpus generated by our framework can improve QA models on the new contexts (up to 8% absolute gain in terms of Exact Match), and that the QPP module plays a crucial role in achieving the gain. 
    more » « less
  5. Question Answering (QA) naturally reduces to an entailment problem, namely, verifying whether some text entails the answer to a question. However, for multi-hop QA tasks, which require reasoning with \textit{multiple} sentences, it remains unclear how best to utilize entailment models pre-trained on large scale datasets such as SNLI, which are based on sentence pairs. We introduce Multee, a general architecture that can effectively use entailment models for multi-hop QA tasks. Multee uses (i) a local module that helps locate important sentences, thereby avoiding distracting information, and (ii) a global module that aggregates information by effectively incorporating importance weights. Importantly, we show that both modules can use entailment functions pre-trained on a large scale NLI datasets. We evaluate performance on MultiRC and OpenBookQA, two multihop QA datasets. When using an entailment function pre-trained on NLI datasets, Multee outperforms QA models trained only on the target QA datasets and the OpenAI transformer models. 
    more » « less