skip to main content

Title: Weakly-Supervised Open-Retrieval Conversational Question Answering
Recent work on Question Answering (QA) and Conversational QA (ConvQA) emphasizes the role of retrieval: a system first retrieves evidence from a large collection and then extracts answers. This open-retrieval setting typically assumes that each question is answerable by a single span of text within a particular passage (a span answer). The supervision signal is thus derived from whether or not the system can recover an exact match of this ground-truth answer span from the retrieved passages. This method is referred to as span-match weak supervision. However, information-seeking conversations are challenging for this span-match method since long answers, especially freeform answers, are not necessarily strict spans of any passage. Therefore, we introduce a learned weak supervision approach that can identify a paraphrased span of the known answer in a passage. Our experiments on QuAC and CoQA datasets show that although a span-match weak supervisor can handle conversations with span answers, it is not sufficient for freeform answers generated by people. We further demonstrate that our method is more flexible since it can handle both span answers and freeform answers. In particular, our method outperforms the span-match method on conversations with freeform answers, and it can be more powerful when combined more » with the span-match method. We also conduct in-depth analyses to show more insights on open-retrieval ConvQA under a weak supervision setting. « less
Authors:
; ; ; ; ;
Award ID(s):
1715095
Publication Date:
NSF-PAR ID:
10277182
Journal Name:
Proceedings of the 43rd European Conference on Information Retrieval (ECIR 2021)
Page Range or eLocation-ID:
529-543
Sponsoring Org:
National Science Foundation
More Like this
  1. Open-domain question answering answers a question based on evidence retrieved from a large corpus. State-of-the-art neural approaches require intermediate evidence annotations for training. However, such intermediate annotations are expensive, and methods that rely on them cannot transfer to the more common setting, where only question– answer pairs are available. This paper investigates whether models can learn to find evidence from a large corpus, with only distant supervision from answer labels for model training, thereby generating no additional annotation cost. We introduce a novel approach (DISTDR) that iteratively improves over a weak retriever by alternately finding evidence from the up-to-date modelmore »and encouraging the model to learn the most likely evidence. Without using any evidence labels, DISTDR is on par with fully-supervised state-of-theart methods on both multi-hop and singlehop QA benchmarks. Our analysis confirms that DISTDR finds more accurate evidence over iterations, which leads to model improvements. The code is available at https:// github.com/henryzhao5852/DistDR.« less
  2. Conversational search is one of the ultimate goals of information retrieval. Recent research approaches conversational search by simplified settings of response ranking and conversational question answering, where an answer is either selected from a given candidate set or extracted from a given passage. These simplifications neglect the fundamental role of retrieval in conversational search. To address this limitation, we introduce an open-retrieval conversational question answering (ORConvQA) setting, where we learn to retrieve evidence from a large collection before extracting answers, as a further step towards building functional conversational search systems. We create a dataset, OR-QuAC, to facilitate research on ORConvQA.more »We build an end-to-end system for ORConvQA, featuring a retriever, a reranker, and a reader that are all based on Transformers. Our extensive experiments on OR-QuAC demonstrate that a learnable retriever is crucial for ORConvQA. We further show that our system can make a substantial improvement when we enable history modeling in all system components. Moreover, we show that the reranker component contributes to the model performance by providing a regularization effect. Finally, further in-depth analyses are performed to provide new insights into ORConvQA.« less
  3. This work studies product question answering (PQA) which aims to answer product-related questions based on customer reviews. Most recent PQA approaches adopt end2end semantic matching methodologies, which map questions and answers to a latent vector space to measure their relevance. Such methods often achieve superior performance but it tends to be difficult to interpret why. On the other hand, simple keyword-based search methods exhibit natural interpretability through matched keywords, but often suffer from the lexical gap problem. In this work, we develop a new PQA framework (named Riker) that enjoys the benefits of both interpretability and effectiveness. Riker mines richmore »keyword representations of a question with two major components, internal word re-weighting and external word association, which predict the importance of each question word and associate the question with outside relevant keywords respectively, and can be jointly trained under weak supervision with large-scale QA pairs. The keyword representations from Riker can be directly used as input to a keyword-based search module, enabling the whole process to be effective while preserving good interpretability. We conduct extensive experiments using Amazon QA and review datasets from 5 different departments, and our results show that Riker substantially outperforms previous state-of-the-art methods in both synthetic settings and real user evaluations. In addition, we compare keyword representations from Riker and those from attention mechanisms popularly used for deep neural networks through case studies, showing that the former are more effective and interpretable.« less
  4. Current textual question answering (QA) models achieve strong performance on in-domain test sets, but often do so by fitting surface-level patterns, so they fail to generalize to out-of-distribution settings. To make a more robust and understandable QA system, we model question answering as an alignment problem. We decompose both the question and context into smaller units based on off-the-shelf semantic representations (here, semantic roles), and align the question to a subgraph of the context in order to find the answer. We formulate our model as a structured SVM, with alignment scores computed via BERT, and we can train end-to-end despitemore »using beam search for approximate inference. Our use of explicit alignments allows us to explore a set of constraints with which we can prohibit certain types of bad model behavior arising in cross-domain settings. Furthermore, by investigating differences in scores across different potential answers, we can seek to understand what particular aspects of the input lead the model to choose the answer without relying on post-hoc explanation techniques. We train our model on SQuAD v1.1 and test it on several adversarial and out-of-domain datasets. The results show that our model is more robust than the standard BERT QA model, and constraints derived from alignment scores allow us to effectively trade off coverage and accuracy.« less
  5. Relevance feedback techniques assume that users provide relevance judgments for the top k (usually 10) documents and then re-rank using a new query model based on those judgments. Even though this is effective, there has been little research recently on this topic because requiring users to provide substantial feedback on a result list is impractical in a typical web search scenario. In new environments such as voice-based search with smart home devices, however, feedback about result quality can potentially be obtained during users' interactions with the system. Since there are severe limitations on the length and number of results thatmore »can be presented in a single interaction in this environment, the focus should move from browsing result lists to iterative retrieval and from retrieving documents to retrieving answers. In this paper, we study iterative relevance feedback techniques with a focus on retrieving answer passages. We first show that iterative feedback can be at least as effective as the top-k approach on standard TREC collections, and more effective on answer passage collections. We then propose an iterative feedback model for answer passages based on semantic similarity at passage level and show that it can produce significant improvements compared to both word-based iterative feedback models and those based on term-level semantic similarity.« less