<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Can NLI Models Verify QA Systems’ Predictions?</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10380028</idno>
					<idno type="doi">10.18653/v1/2021.findings-emnlp.324</idno>
					<title level='j'>Findings of the Association for Computational Linguistics: EMNLP 2021</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Jifan Chen</author><author>Eunsol Choi</author><author>Greg Durrett</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[To build robust question answering systems, we need the ability to verify whether answers to questions are truly correct, not just “good enough” in the context of imperfect QA datasets. We explore the use of natural language inference (NLI) as a way to achieve this goal, as NLI inherently requires the premise (document context) to contain all necessary information to support the hypothesis (proposed answer to the question). We leverage large pre-trained models and recent prior datasets to construct powerful question conversion and decontextualization modules, which can reformulate QA instances as premise-hypothesis pairs with very high reliability. Then, by combining standard NLI datasets with NLI examples automatically derived from QA training data, we can train NLI models to evaluate QA models’ proposed answers. We show that our approach improves the confidence estimation of a QA model across different domains, evaluated in a selective QA setting. Careful manual analysis over the predictions of our NLI model shows that it can further identify cases where the QA model produces the right answer for the wrong reason, i.e., when the answer sentence cannot address all aspects of the question.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Recent question answering systems perform well on benchmark datasets <ref type="bibr">(Seo et al., 2017;</ref><ref type="bibr">Devlin et al., 2019;</ref><ref type="bibr">Guu et al., 2020)</ref>, but these models often lack the ability to verify whether an answer is correct or not; they can correctly reject some unanswerable questions <ref type="bibr">(Rajpurkar et al., 2018;</ref><ref type="bibr">Kwiatkowski et al., 2019;</ref><ref type="bibr">Asai and Choi, 2021)</ref>, but are not always well-calibrated to spot spurious answers under distribution shifts <ref type="bibr">(Jia and Liang, 2017;</ref><ref type="bibr">Kamath et al., 2020)</ref>. Natural language inference (NLI) <ref type="bibr">(Dagan et al., 2005;</ref><ref type="bibr">Bowman et al., 2015)</ref> suggests one way to address this shortcoming: logical entailment provides a more rigorous</p><p>Decontextualization of the answer sentence</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Question conversion to a declarative statement</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>NLI Model</head><p>Answer is correct, but information about Michael being the bad guy is missing in the premise</p><p>Not entailed, answer rejected Figure <ref type="figure">1</ref>: An example from the Natural Questions dataset demonstrating how to convert a (question, context, answer) triplet to a (premise, hypothesis) pair. The underlined text denotes the sentence containing the answer Ted Danson, which is then decontextualized by replacing The series with The series The Good Place. Although Ted Danson is the right answer, an NLI model determines that the hypothesis is not entailed by the premise due to missing information. notion for when a hypothesis statement is entailed by a premise statement. By viewing the answer sentence in context as the premise, paired with the question and its proposed answer as a hypothesis (see Figure <ref type="figure">1</ref>), we can use NLI systems to verify that the answer proposed by a QA model satisfies the entailment criterion <ref type="bibr">(Harabagiu and Hickl, 2006;</ref><ref type="bibr">Richardson et al., 2013)</ref>.</p><p>Prior work has paved the way for this application of NLI. Pieces of our pipeline like converting a question to a declarative sentence <ref type="bibr">(Wang et al., 2018;</ref><ref type="bibr">Demszky et al., 2018)</ref> and reformulating an answer sentence to stand on its own <ref type="bibr">(Choi et al., 2021)</ref> have been explored. Moreover, an abundance of NLI datasets <ref type="bibr">(Bowman et al., 2015;</ref><ref type="bibr">Williams et al., 2018)</ref> and related fact verification datasets <ref type="bibr">(Thorne et al., 2018)</ref> provide ample resources to train reliable models. We draw on these tools to enable NLI models to verify the answers from QA systems, and critically investigate the benefits and pitfalls of such a formulation.</p><p>Mapping QA to NLI enables us to exploit both NLI and QA datasets for answer verification, but as Figure <ref type="figure">1</ref> shows, it relies on a pipeline for mapping a (question, answer, context) triplet to a (premise, hypothesis) NLI pair. We implement a strong pipeline here: we extract a concise yet sufficient premise through decontextualization <ref type="bibr">(Choi et al., 2021)</ref>, which rewrites a single sentence from a document such that it can retain the semantics when presented alone without the document. We improve a prior question conversion model <ref type="bibr">(Demszky et al., 2018)</ref> with a stronger pre-trained seq2seq model, namely T5 <ref type="bibr">(Raffel et al., 2020)</ref>. Our experimental results show that both steps are critical for mapping QA to NLI. Furthermore, our error analysis shows that these two steps of the process are quite reliable and only account for a small fraction of the NLI verification model's errors.</p><p>Our evaluation focuses on two factors. First, can NLI models be used to improve calibration of QA models or boost their confidence in their decisions? Second, how does the entailment criterion of NLI, which is defined somewhat coarsely by crowd annotators <ref type="bibr">(Williams et al., 2018)</ref>, transfer to QA? We train a QA model on Natural Questions <ref type="bibr">(Kwiatkowski et al., 2019, NQ)</ref> and test whether using an NLI model helps it better generalize to four out-of-domain datasets from the MRQA shared task <ref type="bibr">(Fisch et al., 2019)</ref>. We show that by using the question converter, the decontextualization model, and the automatically generated NLI pairs from QA datasets, our NLI model improves the calibration over the base QA model across five different datasets. 1 For example, in the selective QA setting <ref type="bibr">(Kamath et al., 2020)</ref>, our approach improves the F1 score of the base QA model from 81.6 to 87.1 when giving answers on the 20% of questions it is most confident about. Our pipeline further identifies the cases where there exists an information mismatch between the premise and the hypothesis. We find that existing QA datasets encourage models to return answers when the context does not actually contain sufficient information, suggesting that fully verifying the answers is a challenging endeavor. 1 The converted NLI datasets, the question converter, the decontextualizer, and the NLI model are available at <ref type="url">https://github.com/jifan-chen/ QA-Verification-Via-NLI</ref> 2 Using NLI as a QA Verifier</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Background and Motivation</head><p>Using entailment for QA is an old idea; our highlevel approach resembles the approach discussed in <ref type="bibr">Harabagiu and Hickl (2006)</ref>. Yet, the execution of this idea differs substantially as we exploit modern neural systems and newly proposed annotated data for passage and question reformulation. <ref type="bibr">Richardson et al. (2013)</ref> explore a similar pipeline, but find that it works quite poorly, possibly due to the low performance of entailment systems at the time <ref type="bibr">(Stern and Dagan, 2011)</ref>. We believe that a combination of recent advances in natural language generation <ref type="bibr">(Demszky et al., 2018;</ref><ref type="bibr">Choi et al., 2021)</ref> and strong models for NLI <ref type="bibr">(Liu et al., 2019)</ref> equip us to re-evaluate this approach.</p><p>Moreover, the focus of other recent work in this space has been on transforming QA datasets into NLI datasets, which is a different end. <ref type="bibr">Demszky et al. (2018)</ref> and <ref type="bibr">Mishra et al. (2021)</ref> argue that QA datasets feature more diverse reasoning and can lead to stronger NLI models, particularly those better suited to strong contexts, but less attention has been paid to whether this agrees with classic definitions of entailment <ref type="bibr">(Dagan et al., 2005)</ref> or short-context NLI settings <ref type="bibr">(Williams et al., 2018)</ref>.</p><p>Our work particularly aims to shed light on information sufficiency in question answering. Other work in this space has focused on validating answers to unanswerable questions <ref type="bibr">(Rajpurkar et al., 2018;</ref><ref type="bibr">Kwiatkowski et al., 2019)</ref>, but such questions may be nonsensical in context; these efforts do not address whether all aspects of a question have been covered. Methods to handle adversarial SQuAD examples <ref type="bibr">(Jia and Liang, 2017)</ref> attempt to do this <ref type="bibr">(Chen and Durrett, 2021)</ref>, but these are again geared towards detecting specific kinds of mismatches between examples and contexts, like a changed modifier of a noun phrase. <ref type="bibr">Kamath et al. (2020)</ref> frame their selective question answering techniques in terms of spotting out-of-domain questions that the model is likely to get wrong rather than more general confidence estimation. What is missing in these threads of literature is a formal criterion like entailment: when is an answer truly sufficient and when are we confident that it addresses the question?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Our Approach</head><p>Our pipeline consists of an answer candidate generator, a question converter, and a decontextualizer, which form the inputs to the final entailment model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer Generation</head><p>In this work, we focus our attention on extractive QA <ref type="bibr">(Hermann et al., 2015;</ref><ref type="bibr">Rajpurkar et al., 2016)</ref>, for which we can get an answer candidate by running a pre-trained QA model. <ref type="foot">2</ref> We use the Bert-joint model proposed by <ref type="bibr">Alberti et al. (2019)</ref> for its simplicity and relatively high performance.</p><p>Question Conversion Given a question q and an answer candidate a, our goal is to convert the (q, a) pair to a declarative answer sentence d which can be treated as the hypothesis in an NLI system <ref type="bibr">(Demszky et al., 2018;</ref><ref type="bibr">Khot et al., 2018)</ref>. While rulebased approaches have long been employed for this purpose <ref type="bibr">(Cucerzan and Agichtein, 2005)</ref>, the work of <ref type="bibr">Demszky et al. (2018)</ref> showed a benefit from more sophisticated neural modeling of the distribution P (d | q, a). We fine-tune a seq2seq model, T5-3B <ref type="bibr">(Raffel et al., 2020)</ref>, using the (a, q, d) pairs annotated by <ref type="bibr">Demszky et al. (2018)</ref>.</p><p>While the conversion is trivial on many examples (e.g., replacing the wh-word with the answer and inverting the wh-movement), we see improvement on challenging examples like the following NQ question: the first vice president of India who became the president later was? The rule-based system from <ref type="bibr">Demszky et al. (2018)</ref> just replaces who with the answer Venkaiah Naidu. Our neural model successfully appends the answer to end of the question and gets the correct hypothesis.</p><p>Decontextualization Ideally, the full context containing the answer candidate could be treated as the premise to make the entailment decision. But the full context often contains many irrelevant sentences and is much longer than the premises in single-sentence NLI datasets <ref type="bibr">(Williams et al., 2018;</ref><ref type="bibr">Bowman et al., 2015)</ref>. This length has several drawbacks. First, it makes transferring models from the existing datasets challenging. Second, performing inference over longer forms of text requires a multitude of additional reasoning skills like coreference resolution, event detection, and abduction <ref type="bibr">(Mishra et al., 2021)</ref>. Finally, the presence of extraneous information makes it harder to evaluate the entailment model's judgments for correctness; in the extreme, we might have to judge whether a fact about an entity is true based on its entire Wikipedia article, which is impractical.</p><p>We tackle this problem by decontextualizing the sentence containing the answer from the full context to make it stand alone. Recent work <ref type="bibr">(Choi et al., 2021)</ref> proposed a sentence decontextualization task in which a sentence together with its context are taken and the sentence is rewritten to be interpretable out of context if feasible, while preserving its meaning. This procedure can involve name completion (e.g., Stewart &#8594; Kristen Stewart), noun phrase/pronoun swap, bridging anaphora resolution, and more.</p><p>More formally, given a sentence S a containing the answer and its corresponding context C, decontextualization learns a model P (S d | S a , C), where S d is the decontextualized form of S a . We train a decontextualizer by fine-tuning the T5-3B model to decode S d from a concatenation of (S a , C) pair, following the original work. More details about the models we discuss here can be found in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Experimental Settings</head><p>Our experiments seek to validate the utility of NLI for verifying answers primarily under distribution shifts, following recent work on selective question answering <ref type="bibr">(Kamath et al., 2020)</ref>. We transfer an NQ-trained QA model to a range of datasets and evaluate whether NLI improves answer confidence.</p><p>Datasets We use five English-language spanextractive QA datasets: Natural Questions <ref type="bibr">(Kwiatkowski et al., 2019, NQ)</ref>, TriviaQA <ref type="bibr">(Joshi et al., 2017)</ref>, BioASQ <ref type="bibr">(Tsatsaronis et al., 2015)</ref>, Adversarial SQuAD <ref type="bibr">(Jia and Liang, 2017, SQuADadv)</ref>, and SQuAD 2.0 <ref type="bibr">(Rajpurkar et al., 2018)</ref>. For TriviaQA and BioASQ, we use processed versions from MRQA <ref type="bibr">(Fisch et al., 2019)</ref>. These datasets cover a wide range of domains including biology (BioASQ), trivia questions (TriviaQA), real user questions (NQ), and human-synthetic challenging sets (SQuAD2.0 and SQuAD-adv). For NQ, we filter out the examples in which the questions are narrative statements rather than questions by the rulebased system proposed by <ref type="bibr">Demszky et al. (2018)</ref>. We also exclude the examples based on tables because they are not compatible with the task formulation of NLI.<ref type="foot">foot_1</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Question</head><p>Where was Dyrrachium located? (Answerable) What naval base fell to the Normans? (Unanswerable)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>QA Prediction</head><p>Adriatic Dyrrachium</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Hypothesis</head><p>Dyrrachium was located in Adriatic.</p><p>The naval base Dyrrachium fell to the Normans.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Premise</head><p>Dyrrachium -one of the most important naval bases of the Adriatic -fell again to Byzantine hands.</p><p>Dyrrachium -one of the most important naval bases of the Adriatic -fell again to Byzantine hands.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>NLI Prediction</head><p>Entail Not Entail Base QA Model We train our base QA model <ref type="bibr">(Alberti et al., 2019)</ref> with the NQ dataset.</p><p>To study robustness across different datasets, we fix the base QA model and investigate its capacity to transfer. We chose NQ for its high quality and the diverse topics it covers.</p><p>Base NLI Model We use the RoBERTa-based NLI model trained using Multi-Genre Natural Language Inference <ref type="bibr">(Williams et al., 2018, MNLI)</ref> from AllenNLP <ref type="bibr">(Gardner et al., 2018)</ref> for its broad coverage and high accuracy.</p><p>QA-enhanced NLI Model As there might exist different reasoning patterns in the QA datasets which are not covered by the MNLI model <ref type="bibr">(Mishra et al., 2021)</ref>, we study whether NLI pairs generated from QA datasets can be used jointly with the MNLI data to improve the performance of an NLI model. To do so, we run the QA instances in the NQ training set through our QA-to-NLI conversion pipeline, resulting in a dataset we call NQ-NLI, containing (premise, hypothesis) pairs from NQ with binary labels. As answer candidates, we use the predictions of the base QA model. If the predicted answer is correct, we label the (premise, hypothesis) as positive (entailed), otherwise negative (not entailed). To combine NQ-NLI with MNLI, we treat the examples in MNLI labeled with "entailment" as positive and the others as negative.</p><p>We take the same number of examples as of NQ-NLI from MNLI and shuffle them to get a mixed dataset which we call NQ-NLI+MNLI. We use these dataset names to indicate NLI models trained on these datasets. Some basic statistics for each dataset after processing with our pipeline are shown in Appendix A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Improving QA Calibration with NLI</head><p>In this section, we explore to what extent either off-the-shelf or QA-augmented NLI models work as verifiers across a range of QA datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Rejecting Unanswerable Questions</head><p>We start by testing how well a pre-trained MNLI model, with an accuracy of 90.2% on held-out MNLI examples, can identify unanswerable questions in SQuAD2.0. We run our pre-trained QA model on the unanswerable questions to produce answer candidates, then convert them to the NLI pairs through our pipeline, including question conversion and decontextualization. We run the entailment model trained on MNLI to see how frequently it is able to reject the answer by predicting either "neutral" or "contradiction". For questions with annotated answers, we also generate the NLI pairs with the gold answer and see if the entailment model trained on MNLI can accept the answer.</p><p>The MNLI model successfully rejects 78.5% of the unanswerable examples and accepts 82.5% of the answerable examples. Two examples taken from SQuAD2.0 are shown in Figure <ref type="figure">2</ref>. We can see the MNLI model is quite sensitive to the information mismatch between the hypothesis and the premise. In the case where there is no information about Normans in the premise, it rejects the answer. Without seeing any data from SQuAD2.0, MNLI can already act as a strong verifier in the unanswerable setting where it is hard for a QA model to generalize <ref type="bibr">(Rajpurkar et al., 2018)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Calibration</head><p>To analyze the effectiveness of the NLI models in a more systematic way, we test whether they can improve calibration of QA models or improve model performance in a "selective" QA setting <ref type="bibr">(Kamath et al., 2020)</ref>. That is, if our model can choose to answer only the k percentage of examples it is most confident about (the coverage), what F1 can it achieve? We first rank the examples by the confidence score of a model; for our base QA models, this score is the posterior probability of the answer span, and for our NLI-augmented models, it is the posterior probability associated with the "entailment" class. We then compute F1 scores at different coverage values.  <ref type="formula">2021</ref>) since it follows their procedure. All of the models are initialized using RoBERTa-large <ref type="bibr">(Liu et al., 2019)</ref> and trained using the same configurations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1">Comparison Systems NLI model variants</head><p>NLI+QA We explore combining complementary strengths of the NLI posteriors and the base QA posteriors. We take the posterior probability of the two models as features and learn a binary classifier y = logistic(w 1 p QA + w 2 p NLI ) as the combined entailment model and tune the model on 100 heldout NQ examples. +QA denotes this combination with any of our NLI models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>QA-Ensemble</head><p>To compare with NLI+QA, we train another identical QA model, Bert-joint, using the same configurations and ensemble the two QA models using the same way as NLI+QA.</p><p>Selective QA <ref type="bibr">Kamath et al. (2020)</ref> train a calibrator to make models better able to selectively answer questions in new domains. The calibrator is a binary classifier with seven features: passage length, the length of the predicted answer, and the top five softmax probabilities output by the QA model. We use the same configuration as (Kamath </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2">Results and Analysis</head><p>Figure <ref type="figure">3</ref> shows the macro-averaged results over the five QA datasets. Please refer to Appendix C for per dataset breakdown.</p><p>Our NQ-NLI+QA system, which combines the QA models' posteriors with an NQ-NLI-trained system, already shows improvement over using the base QA posteriors. Surprisingly, additionally training the NLI model on MNLI (NQ-NLI+MNLI+QA) gives even stronger results. The NLI models appear to be complementary to the QA model, improving performance even on out-of-domain data. We also see that our our NQ-NLI+MNLI+QA outperforms <ref type="bibr">Mishra et al. (2021)</ref>+QA by a large margin. By inspecting the performance breakdown in Appendix C, we see the gap is mainly on SQuAD2.0 and SQuADadv. This is because these datasets often introduce subtle mismatches by slight modification of the question or context; even if the NLI model is able to overcome other biases, these are challenging contrastive examples from the standpoint of the NLI model. This observation also indicates that to better utilize the complementary strength of MNLI, the proposed decontextualization phase in our pipeline is quite important.</p><p>Selective QA shows similar performance to using the posterior from QA model, which is the most important feature for the calibrator.</p><p>Combining NLI model with the base QA models'  posteriors is necessary for this strong performance.</p><p>Figure <ref type="figure">4</ref> shows the low performance achieved by the NLI models alone, indicating that NLI models trained exclusively on NLI dataset (FEVER-NLI, MNLI) cannot be used by themselves as effective verifiers for QA. This also indicates a possible domain or task mismatch between FEVER, MNLI, and the other QA datasets. NQ-NLI helps bridge the gap between the QA datasets and MNLI. In Figure <ref type="figure">4</ref>, both NQ-NLI and NQ-NLI+MNLI achieve similar performance to the original QA model. We also find that training using both NQ-NLI and MNLI achieves slightly better performance than training using NQ-NLI alone. This suggests that we are not simply training a QA model of a different form by using the NQ-NLI data; rather, the NQ-NLI pairs are compatible with the MNLI pairs, and the MNLI examples are useful for the model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Effectiveness of the Proposed Pipeline</head><p>We present an ablation study on our pipeline to see how each component contributes to the final performance. For simplicity, we use the off-the-shelf MNLI model since it does not involve training using the data generated through the pipeline. Figure <ref type="figure">5</ref> shows the average results across five datasets and Figure <ref type="figure">6</ref> presents individual performance on three datasets.</p><p>We see that both the question converter and the decontextualizer contribute to the performance of the MNLI model. In both figures, removing either module harms the performance for all datasets. On NQ and BioASQ, using the full context is better than the decontextualized sentence, which hints that there are cases where the full context provides necessary information. We have a more comprehensive analysis in Section 6.2.</p><p>Moreover, we see that MNLI outperforms the base QA posteriors on SQuAD2.0 and SQuADadv. Figure <ref type="figure">6</ref>(a) also shows that the largest gap between the QA and NLI model is on NQ, which is unsurprising since the QA model is trained on NQ. These results show how the improvement in the last section is achieved: the complementary strengths of MNLI and NQ datasets lead to the best overall performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Understanding the Behavior of NQ-NLI</head><p>We perform manual analysis on 300 examples drawn from NQ, TriviaQA, and SQuAD2.0 datasets where NQ-NLI+MNLI model produced an error. We classify errors into one of 7 classes, described in Section 6.1 and 6.2. All of the authors of this paper conducted the annotation. The annotations agree with a Fleiss' kappa value of 0.78, with disagreements usually being between closely related categories among our 7 error classes, e.g., annotation error vs. span shifting, wrong context vs. insufficient context, as we will see later. The breakdown of the errors in each dataset is shown in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Errors from the Pipeline</head><p>We see that across the three different datasets, the number of errors attributed to our pipeline approach is below 10%. This demonstrates that the question converter and the decontextualization model are quite effective to convert a (question, answer, context) triplet to a (premise, hypothesis) NLI pair. For the question converter, errors mainly happen in two scenarios as shown in Figure <ref type="figure">7</ref>.</p><p>(1) The question converter gives an answer of the wrong type to a question. For example, the question asks "How old...", but the answer returned is "Mike Pence" which does not fit the question. The question converter puts Mike Pence back into the question and  yields an unrelated statement. Adding a presupposition checking stage to the question converter could further improve its performance <ref type="bibr">(Kim et al., 2021)</ref>. ( <ref type="formula">2</ref>) The question is long and syntactically complex; the question converter just copies a long question without answer replacement.</p><p>For the decontextualization model, errors usually happen when the model fails to recall one of the required modifications. As shown in the example in Figure <ref type="figure">7</ref>, the model fails to replace The work with its full entity name The Art of War.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Errors from the NLI Model</head><p>Most of the errors are attributed to the entailment model. We investigate these cases closely and ask ourselves if these really are errors. We categorize them into the following categories.</p><p>Entailment These errors are truly mistakes by the entailment model: in our view, the pair of sentences should exhibit a different relationship than what was predicted.</p><p>Wrong Context The QA model gets the right answer for the wrong reason. The example in Figure <ref type="figure">8</ref> shows that John Von Neumann is the annotated answer but it is not entailed by the premise because no information about CPU is provided. Although the answer is correct, we argue it is better for the model to reject this case. This again demonstrates one of the key advantages of using an NLI model as a verifier for QA models: it can identify cases of information mismatch like this where the model didn't retrieve suitable context to show to the user of the QA system.</p><p>Insufficient Context (out of scope for decontextualization) The premise lacks essential information that could be found in the full context, typically later in the context. In Figure <ref type="figure">8</ref>, the answer Roxette is in the first sentence. However, we do not know that she wrote the song It Must Have Been Love until we go further in the context. The need to add future information is beyond the scope of the decontextualization <ref type="bibr">(Choi et al., 2021)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Span Shifting</head><p>The predicted answer of the QA model overlaps with the gold answer and it is acceptable as a correct answer. For example, a question asks What Missouri town calls itself the Live Music Show Capital? Both Branson and Branson, Missouri can be accepted as the right answer.</p><p>Annotation Error Introduced by the incomplete or wrong annotations -some acceptable answers are missing or the annotated answer is wrong.</p><p>From Table <ref type="table">1</ref>, we see that "wrong context" cases consist of 25% and 40% of the errors for NQ and TriviaQA, respectively, while they rarely happen on SQuAD2.0. This is because the supporting snippets for NQ and TriviaQA are retrieved from Wikipedia and web documents, so the information contained may not be sufficient to support the question. For SQuAD2.0, the supporting document is given to the annotators, so no such errors happen. This observation indicates that the NLI model can be particularly useful in the open-domain setting where it can reject answers that are not well supported. In particular, we believe that this raises a question about answers in TriviaQA. The supporting evidence for the answer is often insufficient to validate all aspects of the question. What should a QA model do in this case: make an educated guess based on partial evidence, or reject the answer outright? This choice is applicationspecific, but our approach can help system designers make these decisions explicit.</p><p>Around 10% to 15% of errors happens due to insufficient context. Such errors could be potentially fixed in future work by learning a questionconditioned decontextualizer which aims to gather all information related to the question.  <ref type="formula">2019</ref>) used NLI models to detect factual errors in abstractive summaries. For question answering, <ref type="bibr">Harabagiu and Hickl (2006)</ref> showed that textual entailment can be used to enhance the accuracy of the opendomain QA systems; <ref type="bibr">Trivedi et al. (2019)</ref> used a pretrained NLI model to select relevant sentences for multi-hop question answering; <ref type="bibr">Yin et al. (2020)</ref> tested whether NLI models generalize to QA setting in a few-shot learning scenario.</p><p>Our work is most relevant to <ref type="bibr">Mishra et al. (2021)</ref>; they also learn an NLI model using examples generated from QA datasets. Our work differs from theirs in a few chief ways. First, we improve the conversion pipeline significantly with decontextualization and a better question converter. Second, we use framework to improve QA performance by using NLI as a verifier, which is only possible because the decontextualization allows us to focus on a single sentence. We also study whether the converted dataset is compatible with other off-the-shelf NLI datasets. By contrast, <ref type="bibr">Mishra et al. (2021)</ref> use their converted NLI dataset to aid other tasks such as fact-checking. Finally, the contrast we establish here allows us to conduct a thorough human analysis over the converted NLI data and show how the task specifications of NLI and QA are different (Section 6.2).</p><p>Robust Question Answering Modern QA systems often give incorrect answers in challenging settings that require generalization <ref type="bibr">(Rajpurkar et al., 2018;</ref><ref type="bibr">Chen and Durrett, 2019;</ref><ref type="bibr">Wallace et al., 2019;</ref><ref type="bibr">Gardner et al., 2020;</ref><ref type="bibr">Kaushik et al., 2019)</ref>. Models focusing on robustness and generalizability have been proposed in recent years: <ref type="bibr">Wang and Bansal (2018)</ref>; <ref type="bibr">Khashabi et al. (2020)</ref>; <ref type="bibr">Liu et al. (2020)</ref> use perturbation based methods and adversarial training; <ref type="bibr">Lewis and Fan (2018)</ref> propose generative QA to prevent the model from overfitting to simple patterns; Yeh and Chen ( <ref type="formula">2019</ref> Another line of work to make models more robust is by introducing answer verification <ref type="bibr">(Hu et al., 2019;</ref><ref type="bibr">Kamath et al., 2020;</ref><ref type="bibr">Wang et al., 2020;</ref><ref type="bibr">Zhang et al., 2021)</ref> as a final step for question answering models. Our work is in the same vein, but has certain advantages from using an NLI model.</p><p>First, the answer verification process is more ex-Entailment Error (NLI Prediction: Not Entail) Question: What were the results of the development of Florida's railroads? Predicted / Gold Answer: towns grew and farmland was cultivated / towns grew and farmland was cultivated Hypothesis: The results of the development of Florida's railroads were that towns grew and farmland was cultivated. Premise: Henry Flagler built a railroad along the east coast of Florida and eventually to Key West; towns grew and farmland was cultivated along the rail line.  plicit so that one is able to spot where the error emerges. Second, we can incorporate NLI datasets from other domains into the training of our verifier, reducing reliance on in-domain labeled QA data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Wrong</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Conclusion</head><p>This work presents a strong pipeline for converting QA examples into NLI examples, with the intent of verifying the answer with NLI predictions. The answer to the question posed in the title is yes (NLI models can validate these examples), with two caveats. First, it is helpful to create QA-specific data for the NLI model. Second, the information that is sufficient for a question to be fully answered may not align with annotations in the QA dataset. We encourage further explorations of the interplay between these tasks and careful analysis of the predictions of QA models. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Statistics of the Converted Datasets</head><p>The statistics of the datasets after processing through our pipeline is shown in Table <ref type="table">2</ref>. Both the premise length and the hypothesis length are quite similar except for the premise length of Triv-iaQA, despite their original context length differs greatly <ref type="bibr">(Fisch et al., 2019)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Model Details B.1 Answer Generator</head><p>We train our Bert-joint on the full NQ training set for 1 epoch. We initialize the model with bert-large-uncased-wwm. 4 The batch size is set to 8, window size is set to 512, and the optimizer we use is Adam (Kingma and Ba, 2015) with initial learning rate setting to 3e-5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 Question Converter</head><p>Each instance of the input is constructed as</p><p>, where [CLS] and [S] are the classification and separator tokens of the T5 model respectively. The output is the target sentence d.</p><p>The model is trained using the seq2seq framework of Huggingface <ref type="bibr">(Wolf et al., 2020)</ref>. The max source sequence length is set to 256 and the target sequence length is set to 512. Batch size is set to 12 and we use Deepspeed for memory optimization <ref type="bibr">(Rasley et al., 2020)</ref>. We train the model with 86k question-answer pairs for 1 epoch with Adam optimizer and an initial learning rate set to 3e-5. 95% of question answer pairs come from SQuAD and the remaining 5% come from four other question answering datasets <ref type="bibr">(Demszky et al., 2018</ref> Here, "Prem len" and "Hyp len" denote the average number of words with stop words removed in the premise and hypothesis respectively; "Word Overlap" denotes the Jaccard similarity between the premise and the hypothesis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 Decontextualizer</head><p>Each instance of the input is constructed as follows:</p><p>[CLS]T[S]x 1 , ..., x t-1 [S]x t [S]x t+1 , ..., x n [S] where [CLS] and [S] are the classification and separator tokens of the T5 model respectively. T denotes the context title which could be empty. x i denotes the ith sentence in the context and x t is the target sentence to decontextualize.</p><p>The model is trained using the seq2seq framework of Huggingface <ref type="bibr">(Wolf et al., 2020)</ref>. The max sequence length for both source and target is set to 512. Batch size is set to 4 and we use Deepspeed for memory optimization <ref type="bibr">(Rasley et al., 2020)</ref>. We train the model with 11k questionanswer pairs <ref type="bibr">(Choi et al., 2021)</ref> for 5 epoch with Adam optimizer and an initial learning rate set to 3e-5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.4 NQ-NLI</head><p>The generated NQ-NLI training and development set contain 191k and 4,855 (premise, hypothesis) pairs from NQ respectively. We initialize the model with roberta-large <ref type="bibr">(Liu et al., 2019)</ref> and train the model for 5 epochs. Batch size is set to 16, with Adam as the optimizer and initial learning rate set to 2e-6.   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Performance Breakdown on All Datasets</head></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>Our approach could be adapted to multiple choice QA, in which case this step could be omitted.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>After filtering, we have 191,022/4,855 examples for the training and development sets respectively. For comparison, the original NQ contains 307,373/7,842 examples for training and development.</p></note>
		</body>
		</text>
</TEI>
