<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Natural Language Deduction through Search over Statement Compositions</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/01/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10423390</idno>
					<idno type="doi"></idno>
					<title level='j'>Findings of the Association for Computational Linguistics: EMNLP 2022</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Kaj Bostrom</author><author>Zayne Sprague</author><author>Swarat Chaudhuri</author><author>Greg Durrett</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In settings from fact-checking to question answering, we frequently want to know whether a collection of evidence (premises) entails a hypothesis. Existing methods primarily focus on the end-to-end discriminative version of this task, but less work has treated the generative version in which a model searches over the space of statements entailed by the premises to constructively derive the hypothesis. We propose a system for doing this kind of deductive reasoning in natural language by decomposing the task into separate steps coordinated by a search procedure, producing a tree of intermediate conclusions that faithfully reflects the system’s reasoning process. Our experiments on the EntailmentBank dataset (Dalvi et al., 2021) demonstrate that the proposed system can successfully prove true statements while rejecting false ones. Moreover, it produces natural language explanations with a 17% absolute higher step validity than those produced by an end-to-end T5 model.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>When we read a passage from a novel, a Wikipedia entry, or any other piece of text, we gather meaning from it beyond what is written on the page. We make inferences based on the text by combining information across multiple statements and by applying our background knowledge. This ability to synthesize meaning and determine the consequences of a set of statements is a significant part of natural language understanding. Humans are able to give step-by-step explanations of the reasoning that they do as part of these processes. However, approaches that involve endto-end discriminative fine-tuning of pre-trained language models have no such notion of step-bystep deduction; these models are black boxes and do not offer explanations for their predictions. This limitation prevents users from understanding and accommodating models' affordances <ref type="bibr">(Hase and Bansal, 2020;</ref><ref type="bibr">Bansal et al., 2021)</ref>.</p><p>A simple way of representing this kind of reasoning process is through an entailment tree <ref type="bibr">(Dalvi et al., 2021)</ref>: a derivation indicating how each intermediate conclusion was composed from its premises, exemplified in Figure <ref type="figure">1</ref>. Generative sequence-to-sequence models can be fine-tuned to carry out this task given example trees <ref type="bibr">(Tafjord et al., 2021;</ref><ref type="bibr">Dalvi et al., 2021)</ref>. However, we argue in this paper that this conflates the generation of individual reasoning steps with the planning of the overall reasoning process. An end-to-end model is encouraged by its training objective to generate steps that arrive at the goal, but it is not constrained to do so by following a sound structure. As we will show, the outputs of such methods may skip steps or draw unsound conclusions from unrelated premises while claiming a hypothesis is proven. Generating an explanation is not enough; we need explanations to be consistent and faithfully reflect a reasoning process to which the model commits <ref type="bibr">(Jacovi and Goldberg, 2020)</ref>.</p><p>This paper proposes a system that factors the reasoning process into discrete search over intermediate steps. The core of our system is a</p><p>Winter has the least sunlight x &lt; l a t e x i t s h a 1 _ b a s e 6 4 = " w j Q 1 y x Z M P d + R X I + e n 4 I l S a s f r C c = " &gt; A A A C h n i c d V F d a 9 s w F J X d f W T e R 7 P t s S 9 i I Z B C M b b T 1 X s Z h L U P 2 9 M y W N p C b I w s y 6 m o J B t J H j H C P 2 V / a m / 7 N 5 X d D L p 2 u y B x d O 6 5 n K t 7 8 5 p R p Y P g t + P u P X r 8 5 O n o m f f 8 x c t X + + P X b 8 5 V 1 U h M V r h i l b z M k S K M C r L S V D N y W U u C e M 7 I R X 5 9 2 u c v f h C p a C W + 6 7 Y m K U c b Q U u K k b Z U N v 4 5 X c 4 S j v R V X p q 2 g w m n B f z z 3 n a H 3 j T R Z K s H H 5 O z h n T m L P v a 3 a X X c p O n J j g K / O M T e 8 X R y X 8 1 f n z U 6 + J e 8 M U K 2 o + h N 1 1 m g 0 5 y Q 0 T R z V A W W V e U z b 1 t F m b j S e A H Q 8 C H I N y B C d j F M h v / S o o K N 5 w I j R l S a h 0 G t U 4 N k p p i R j o v a R S p E b 5 G G 7 K 2 U C B O V G q G H j s 4 t U w B y 0 r a I z Q c 2 L s V B n G l W p 5 b Z T 8 h d T / X k / / K r R t d f k g N F X W j i c C 3 R m X D o K 5 g v x N Y U E m w Z q 0 F C E t q e 4 X 4 C k m E t d 2 c Z 4 c Q 3 v / y Q 3 A e + e H c j 7 4 d T x a f d u M Y g Q P w D s x A C G K w A J / B E q w A d v a c Q y d y 5 u 7 I 9 d 3 3 b n w r d Z 1 d z V v w V 7 i L G 9 t I v w c = &lt; / l a t e x i t &gt; December is during the winter in the northern hemisphere </p><p>New York is a state in the United States of America    </p><p>a S R 4 z x j 9 t f 2 N v + z W Q n g 6 7 d P p A 4 O u d 8 f B c l J W d K e 9 4 v y 7 5 z 9 9 7 9 B w c P n U e P n z x 9 N j h 8 v l J</p><p>Z g C Y j 1 y v p g r a y 1 P b E / 2 9 / s c G e 1 r X 3 O E f g r 7 P Q 3 i j / L B w = = &lt; / l a t e x i t &gt;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>x1, x4</head><p>&lt; l a t e x i t s h a 1 _ b a s e 6 4 = " c J</p><p>a S R 4 z x j 9 t f 2 N v + z W Q n g 6 7 d P p A 4 O u d 8 f B c l J W d K e 9 4 v y 7 5 z 9 9 7 9 B w c P n U e P n z x 9 N j h 8 v l J </p><p>a S R 4 z x j 9 t f 2 N v + z W Q n g 6 7 d P p A 4 O u d 8 f B c l J W d K e 9 4 v y 7 5 z 9 9 7 9 B w c P n U e P n z x 9 N j h 8 v l J</p><p>&lt; l a t e x i t s h a 1 _ b a s e 6 4 = " o n K P q T 9 E p E 7 S p 0 3 k U 6 / D P J X q k G s = " &gt;</p><p>&lt; l a t e x i t s h a 1 _ b a s e 6 4 = " j 7 9 5 e 0 N f f  </p><p>&#8627; True, so return   step deduction module that generates the direct consequence of composing a pair of statements. This module is used as a primitive in a search procedure over entailed statements, guided by a learned heuristic function. By decoupling the deduction itself from the search over statement combinations, our system's design ensures that each step's conclusion builds on its inputs, avoiding the pitfalls of end-to-end generation. We evaluate our method on the EntailmentBank dataset <ref type="bibr">(Dalvi et al., 2021)</ref>. Thanks to our system's factored design and its ability to capitalize on additional semi-synthetic single step data <ref type="bibr">(Bostrom et al., 2021)</ref>, we observe that 82% of the reasoning steps produced by our approach are sound, a 17% absolute increase compared to an end-to-end model replicated from prior work.</p><p>Our contributions are: (1) A factored, interpretable system for natural language deduction, separating the concerns of generating intermediate conclusions from those of planning entailment tree structures; (2) Exploration of several search heuristics, including a learned goal-oriented heuristic; (3) Comparison to an end-to-end model from prior work <ref type="bibr">(Dalvi et al., 2021)</ref> in two settings of varying difficulty.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Problem Description and Motivation</head><p>The general setting we consider is shown in Figure <ref type="figure">2</ref>. We assume we are given a collection of evidence sentences X = {x 1 . . . x n } 1 and a goal statement g. We want to construct an entailment tree deriving g from X. An entailment tree is a tree of statements with the property that each statement</p><p>In the settings we consider in this paper, n does not exceed 25. However, this is not a hard limit imposed by our approach, and for very large premise sets, a retrieval system could be used to prune the premise set to a manageable size. is directly entailed by its children. Thus, if we can produce a tree with root g and leaves X, it follows that g is transitively entailed by the premises in X.</p><p>Crucially, we assume that the intermediate nodes of this tree must be generated and are not present in the input premise set X. This condition differs from prior work on question answering with multihop reasoning <ref type="bibr">(Yang et al., 2018;</ref><ref type="bibr">Chen et al., 2019)</ref> or models that build a proof structure but do not generate new statements <ref type="bibr">(Saha et al., 2020</ref><ref type="bibr">(Saha et al., , 2021))</ref>. We therefore require a generative step deduction model S to produce intermediate conclusions given their immediate children, a concept explored by <ref type="bibr">Tafjord et al. (2021)</ref> and <ref type="bibr">Bostrom et al. (2021)</ref>.</p><p>S yields a distribution p S (y | x 1 . . . x m ) over step conclusions y conditioned on inputs x i . In our approach, we assume that each step has exactly m = 2 inputs. Some nodes in our evaluation dataset, EntailmentBank, have more than two children, but in preliminary investigation we found it possible to express the reasoning in these steps in a binary branching format. We do not apply this arity constraint to baseline models.</p><p>3 Methods</p><p>Our proposed system consists of a step model S, a search procedure involving a heuristic h, and an entailment model which judges whether a generated conclusion entails the goal. An overview of the responsibilities and task data required by each module is presented in Table <ref type="table">1</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Step Deduction Model</head><p>Our step models are instances of the T5 pretrained sequence-to-sequence model <ref type="bibr">(Raffel et al., 2020)</ref> fine-tuned on a combination of deduction step datasets from prior work <ref type="bibr">(Dalvi et al., 2021</ref>; Table <ref type="table">1</ref>: Each module in our proposed system operates independently, and most components can leverage existing resources without the need for full tree data. *Our best-performing parametric heuristic uses full tree supervision, but nonparametric heuristics are supported. <ref type="bibr">Bostrom et al., 2021)</ref>, which we discuss further in Section 4.4. At inference time, we decode from these models using nucleus sampling <ref type="bibr">(Holtzman et al., 2020)</ref> with p = 0.9.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Search</head><p>We define a search procedure over sentence compositions that we call SCSEARCH, described in Algorithm 1. SCSEARCH is a best-first search procedure where the search fringe data structure is a max-heap of statement pairs ordered by the heuristic scoring function h. SCSEARCH iteratively combines statement pairs using the step deduction model, adding generated conclusions to the forest of entailed statements. The heuristic function h({x 1 . . . x m }, g) &#8594; R accepts candidate step inputs x i and a goal hypothesis g and returns a real-valued score indicating the priority of the potential step in the expansion order.</p><p>These priority values reflect multiple factors. We want to prioritize compatible compositions: combining statements from which we can make a meaningful inference, as opposed to unrelated sentences. We also want to prioritize useful steps: among compatible compositions, we should prefer those that are likely to derive statements that help prove the hypothesis.</p><p>We consider several potential realizations of the heuristic function h: Breadth-first Naively, the earliest fringe items are explored first; all compositions of initial premises will be explored before any composition involving an intermediate conclusion.</p><p>Overlap This heuristic scores potential steps according to the number of tokens shared between Algorithm 1 procedure SCSEARCH(X = {x1 . . . xn}, g):</p><p>input sentences. This heuristic is focused on compatibility, as overlap indicates expressions that might be unifiable (e.g., paper in the first composition step of Figure <ref type="figure">1</ref>). In the Overlap+Goal version, token overlap with the goal hypothesis is also incorporated into the score.</p><p>Repetition Past work on step deduction models <ref type="bibr">(Bostrom et al., 2021)</ref> has identified that these models tend to "back off" to copying the input when given incompatible premises.</p><p>This heuristic aims to exploit this behavior as a measure of premise compatibility. Potential steps are scored according to -p S (x 1 |x 1 ...x n ), i.e., the negative likelihood of repeating the first input.</p><p>Learned This heuristic uses an additional pretrained model fine-tuned to predict whether input statements are part of a gold explanation of the hypothesis or not. We train this model on sets of step inputs drawn from a collection of valid steps augmented with negative samples produced by replacing one input with a random statement.</p><p>The Learned+Goal version of this heuristic is also trained with the goal hypothesis in its input, so as to be able to select useful premises and guide search towards the goal. Note that the step model S which generates statements still does not see the goal; the goal only informs which compositions are explored first during the search. See Appendix A for training details of our learned heuristic models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Goal Entailment</head><p>In order to determine when the search has succeeded, we need a module to judge the entailment relationship between each generated conclusion and the goal. We use a DeBERTa hypot: the earth revolving around the sun causes leo to appear in different areas in the sky at different times of year. sent1: leo is a kind of constellation. sent2: the earth revolving around the sun causes stars to appear in different areas in the sky at different times of year. sent3: a constellation contains stars.</p><p>sent1 &amp; sent3 &#8594; int1: leo is a constellation containing stars. int1 &amp; sent2 &#8594; hypot T5 Figure <ref type="figure">3</ref>: An example of the EntailmentWriter end-to-end system's linearized input and output format. EntailmentWriter takes the goal hypothesis as input, making it possible to hallucinate content based on it, and generation of &#8594; hypot may occur prematurely, before enough evidence is included to truly derive the goal.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Statement Goal Hypothesis Label</head><p>A fish is a kind of scaled animal that uses their scales to defend themselves. Scales are used for protection by fish. Entailment</p><p>Information in an organism's chromosomes causes an inherited characteristic to be passed from parent to offspring by dna.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Children usually resemble parents. Neutral</head><p>A rheumatoid arthritis will happen first before other diseases affect the body's tissues.</p><p>The immune system becomes disordered first before rheumatoid arthritis occurs. Neutral</p><p>Table <ref type="table">2</ref>: Examples of instances from the EBENTAIL dataset. The goal hypotheses are from EntailmentBank, and the statements are from the SCSEARCH algorithm applied to the original EntailmentBank premises. Although the third example contains a contradiction, the 'neutral' and 'contradiction' labels are merged in EBENTAIL, as both reflect a failure to entail the goal.</p><p>model <ref type="bibr">(He et al., 2021)</ref> fine-tuned on the WANLI dataset <ref type="bibr">(Liu et al., 2022)</ref> to predict the probability that each derived statement entails the goal. In order to mitigate the domain mismatch between WANLI and the scientific facts that make up EntailmentBank, we also fine-tune our goal entailment model on EBENTAIL, a set of 300 examples of generated conclusions sampled from SCSEARCH paired with corresponding goal hypotheses which we manually label for their entailment relationship.</p><p>EBENTAIL To produce a set of reference judgments for threshold selection and entailment model evaluation, we sampled 150 instances of generated conclusions from SCSEARCH inference over EntailmentBank examples with their corresponding gold goals. Three annotators labeled the entailment relationship between these generated inferences and the goal. We select a consensus annotation with a majority vote. These examples form an in-domain evaluation set which we call EBENTAIL. To simultaneously train and evaluate on these judgements, we use 3-fold cross-validation where each cross-validation fold is constructed to contain goals not seen in its respective training fold.</p><p>We extend EBENTAIL with EBENTAIL-ACTIVE, consisting of 150 additional instances with the lowest confidence (highest prediction entropy) following the initial fine-tuning, which are then manually labeled by at least one annotator.</p><p>Thresholding During inference, rather than returning the highest-scoring class, a threshold value &#945; is applied to the predicted probability of the 'entailment' class. This threshold allows for better control over trade-off between precision and recall. For our main experiments, we use an entailment score threshold of &#945; = 0.81 selected via crossvalidation on EBENTAIL.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experimental Setup</head><p>Our experiments assess whether our SCSEARCH system, which factors the deduction process into separate step generation and search modules, can do better than end-to-end baselines on two axes:</p><p>(1) proving correct (and only correct) goals, and (2) producing more consistent entailment steps in the tree.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Evaluation: Goal Discrimination</head><p>We evaluate our models in two settings, both derived from the validation and test sets of the English-language EntailmentBank dataset. Each setting consists of a 1:1 mixture of examples with valid goals (the original EntailmentBank validation examples) and negative examples with invalid goals, produced by replacing the goal of each positive example with a distinct one drawn from another example. Each system is evaluated on whether it can prove the correct goals with valid steps and successfully reject incorrect goals.</p><p>To construct hard negatives, candidate replace-ment goals are ranked according to TF-IDF weighted overlap between tokens in the destination example and tokens in their original example. For each negative example, the replacement goal with the highest overlap score is selected, excluding goals from examples whose premise sets are subsets of the destination example's premises. We manually check negative examples to ensure they cannot be derived from the provided premises.</p><p>In Task 1, examples contain only gold premises (between 2 and 15), while in Task 2, each premise set is expanded to 25 premises through the addition of distractors retrieved from the original premise corpora. We set the maxSteps hyperparameter of the SCSEARCH algorithm to 20. The maximum gold tree step count in EntailmentBank is 17, so our approach can theoretically recover any binarized gold tree given the right heuristic scores.</p><p>Note that we focus our evaluation on this goal discrimination task and validating that individual steps of entailment trees are correct, not on recovering the exact trees in EntailmentBank. Our deduction model frequently constructs correct entailment trees that do not match the reference, particularly since our approach is not trained endto-end on this dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">End-to-end Baseline</head><p>We compare against an End-to-end T5 model that we train following the EntailmentWriter paradigm <ref type="bibr">(Dalvi et al., 2021)</ref>. The EntailmentWriter system involves fine-tuning a sequence-to-sequence language model to generate an entire entailment tree in linearized form, conditioned on the concatenation of a set of premises and a hypothesis. The EntailmentWriter tree linearization format is shown in Figure <ref type="figure">3</ref>. In the original work, <ref type="bibr">Dalvi et al. (2021)</ref> fine-tune T5-11b <ref type="bibr">(Raffel et al., 2020)</ref>; we replicate their training setup using T5-3b instead for parity with our other experiments.</p><p>In order to evaluate whether an end-to-end model intrinsically distinguishes between valid and invalid entailment tree structures, we use the average output confidence over trees generated by the model trained without negative examples, computed as the mean token log-likelihood</p><p>). This is motivated by the hypothesis that a model trained as a density estimator for trees composed of sound steps should assign low likelihood to unsound trees. We fit a linear model to predict the goal validity &#8712; {0, 1} based on this quantity. We refer to this discriminative setup as End-to-end (Intrinsic).</p><p>We also train a variant of the end-to-end baseline, End-to-end T5 (Classify), to explicitly predict whether a given goal is valid by including a flag token T or F at the start of the model's output. We augment the model's training data with an equal number of negative examples by randomly resampling goal hypotheses. We prepend T to the target sequence of positive examples, while the target sequence for negative examples is F. Note that this model can predict T and then output a nonsensical entailment tree, as the trees are posthoc explanations of the decision.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Metrics</head><p>For both Task 1 and Task 2, consistent models should be able to reach gold goals while also failing to prove invalid goals. To measure the former, we report the number of valid goals reached by each system as Goal%. For our search systems, this metric is computed as the proportion of positive examples for which any generated conclusion had a goal entailment score higher than the threshold &#945; = 0.81 (see Section 4.4). Goal% scores for the end-to-end model correspond to its self-reported success rate -the proportion of positive examples for which the model emits the T token.</p><p>We report the average number of steps expanded before reaching a valid goal as #Steps. This metric is averaged over examples for which a system is able to reach the goal.</p><p>To measure whether systems are able to distinguish invalid goals from valid goals, we also compute precision-recall curves and report the area under the receiver operating characteristic (AUROC) of each system for both tasks. We produce these curves for search models by varying the goal entailment threshold &#945;. For the End-toend (Intrinsic) model, we vary a threshold on the average generated token likelihood, and for the End-to-end (Classify) model, we vary a threshold on the value of p(T)/(p(T) + p(F)), the score assigned to the "valid" flag token by the model out the two possible validity flags.</p><p>In addition to evaluating end-to-end performance through the above tasks, we would also like to understand the internal consistency of generated trees. To that end, we conduct a manual study of step validity. We sample 100 steps uniformly across valid-goal Task 2 examples for each of End-to-end (Classify) 100.0 &#177; 0.0 0.97 &#177; 0.00 2.8 &#177; 1.6 100.0 &#177; 0.0 0.95 &#177; 0.00 2.2 &#177; 1.2 End-to-end (Intrinsic) -0.57 &#177; 0.02 --0.62 &#177; 0.02 -Table <ref type="table">3</ref>: Results from our main experiments on the EntailmentBank test sets. Mean &#177; standard deviation is reported for each metric, taken across 10 trials varying the random seed used for nucleus sampling. Goal% indicates the proportion of valid goals reached by a system's generated trees using the &#945; = 0.81 threshold. AUROC indicates the area under the receiver operating characteristic when attempting to distinguish gold goals from invalid goals. #Steps indicates the average number of steps expanded before reaching the goal among trees which reached valid goals; this metric's standard deviation is computed at the example level. See Section 4.3 for more details. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>System</head><p>Step Validity Learned (Goal) 74.0% SCSearch 82.3% End-to-end 65.0% Steps are sampled from inference on Task 2. Macroaverage inter-annotator agreement (Cohen's &#954;) is 0.72.</p><p>three systems: our full system, our system without mid-training, and the end-to-end system. The resulting set of 300 steps is shuffled and triply labeled by three annotators without knowledge of which examples came from which system. For each example, the annotators assess whether a single step's conclusion can be inferred from the input premises with minimal additional world knowledge.</p><p>We discuss these results in Section 5.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Implementation and Training</head><p>All sequence-to-sequence language models we experiment with are instances of T5-3b <ref type="bibr">(Raffel et al., 2020)</ref> with 3 billion parameters. Bridging entailment models and learned heuristic models are derived from DeBERTa Large <ref type="bibr">(He et al., 2021)</ref> with 350M parameters. Fine-tuning is performed with the Hugging Face transformers library <ref type="bibr">(Wolf et al., 2020)</ref>. Further details including fine-tuning hyperparameters are included in the appendix.</p><p>Step deduction model We train our step deduction model using gold steps sampled from trees in the EntailmentBank (EB) training split, totaling 2,762 examples. Our full system, which we refer to as SCSearch, is also "mid-trained" on ParaPattern substitution data <ref type="bibr">(Bostrom et al., 2021)</ref>.</p><p>The ParaPattern data is derived semi-synthetically from English Wikipedia, totaling &#8764;120k examples.</p><p>In the mid-training configuration, an instance of T5 is fine-tuned for one epoch on the ParaPattern data and then for one epoch on the EntailmentBank data, after which optimal validation loss is reached.</p><p>Learned heuristic models Data for our learned heuristic models is constructed by taking step inputs from the EntailmentBank training set's gold trees. For each positive step example, a corresponding negative example is produced by replacing one input statement with a sentence drawn at random from all statements in the training set that do not appear in the original step's subtree.</p><p>Examples used to train the Learned (Goal) heuristic additionally contain the original step's gold goal concatenated to their input. Our full SCSearch system uses the Learned (Goal) heuristic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Results</head><p>The results of our experiments on Task 1 and Task 2 are shown in Table <ref type="table">3</ref>. Table <ref type="table">4</ref> shows the results of our manual step validity evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Complete Deductions</head><p>The end-to-end approach is not a sound model of entailment tree structure. End-to-end T5 (Classify) is nominally able to "prove" 100% of gold goals by finishing a proof with -&gt; hypot. However, as shown in Figure <ref type="figure">4</ref>, using the generation confidence of End-to-end T5 to discriminate between valid goals and invalid goals (the Intrinsic method) is not much better than random chance, since the model has similar confidence when "proving" invalid goals as it does when generating trees for valid goals. This means that the model's output distribution does not penalize the generation of invalid steps.</p><p>Our approach is able to prove most goals while exhibiting much better internal consistency than the end-to-end approach. Our full SCSearch system nearly matches the performance of the Endto-end (Classify) system on Task 1 and Task 2, losing chiefly in high-recall settings. Critically, Section 5.2 will show that it achieves a much higher rate of step validity in the process. Mid-training the step deduction model also increases the proportion of reachable valid goals by 3-5% (compare Learned (Goal) to SCSearch in Table <ref type="table">3</ref>). It is worth noting that the chosen threshold for our goal entailment module sacrifices recall in favor of avoiding false positives, as shown in Table <ref type="table">5</ref>, meaning that our reported Goal% rate is an underestimate. In Figure <ref type="figure">4</ref>, we see that our SCSearch method can achieve 80% recall at roughly 80% precision with a slightly lower threshold.</p><p>A goal-oriented heuristic is critical. If we compare Breadth-first, Repetition, our two Overlap methods, and our two Learned heuristics, both Table <ref type="table">3</ref> and Figure <ref type="figure">4</ref> show that the incorporating goal information into the planning process is essential for Task 2, as only the Overlap (Goal) and Learned (Goal) heuristics are able to reach a reasonable number of valid goals in the presence of distractor premises. We can see in Figure <ref type="figure">4</ref> how much breadth-first degrades from Task 1 to Task 2, largely due to timing out.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Individual Step Validity</head><p>The most crucial divergence between our method and the end-to-end method arises in the evaluation of individual steps. The End-to-end (Classify) method can produce correct decisions, and as shown in Figure <ref type="figure">3</ref> it can always claim to have produced the goal statement, but is its reasoning sound?</p><p>Table <ref type="table">4</ref> shows that our best SCSearch model produces valid inferences 82% of the time, improving by 17% absolute over the end-to-end model as well as improving over the version of our system without mid-training on ParaPattern. This result confirms our hypothesis about the end-toend method: despite often predicting the right label and predicting an associated entailment tree, these entailment trees are not as likely to be valid due to their post-hoc nature and the conflation of search and generation.</p><p>We can use our step validity rate to approximate the expected number of fully-valid trees based on the observed depth distribution. Under the conservative assumption that observed errors are distributed uniformly w.r.t. depth, the expected number of fully valid trees in a dataset D = {T 1 . . . T |D| } for validity rate v can be computed</p><p>. At a step validity rate of 82% we should expect &#8764;58% of trees generated by our system to be error-free. Under the same assumption, according to the end-to-end model's step validity rate we should expect only &#8764;35% of its trees to be fully valid.</p><p>In Section 5.4 we examine observed error patterns in invalid steps. Crucially, even when our system produces a tree involving an invalid step, it is easy to audit the tree and determine exactly where the reasoning error occurred, since each step is conditioned only on its immediate premises. In contrast, the end-to-end model attends to all premises and the hypothesis at every step, meaning that when an inconsistent step is generated, it is difficult to diagnose the cause. The results in Table <ref type="table">3</ref> depend on an accurate assessment of when we have successfully deduced the hypothesis. To that end, we evaluate our goal entailment model against labeled test data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Goal Entailment</head><p>Table <ref type="table">5</ref> shows the results of our evaluation on the EBENTAIL dataset described in Section 3.3. We view precision as more important than recall, as a stringent criteria for determining whether a tree has reached the goal increases confidence in our evaluation results. Our best F1 score, using EBENTAIL-ACTIVE, is only slightly lower than the lowest F1 score of the annotators when evaluated against the majority vote. Annotator agreement is moderate; macroaveraged inter-annotator F1 is 0.83 and Cohen's &#954; is 0.54. This indicates that the problem of determining when a statement straightforwardly entails the goal is subjective; the boundary between 'trivial' entailment and a case which needs an additional reasoning step is somewhat fuzzy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Error Analysis: Step Model</head><p>Although our step model cannot hallucinate based on the hypothesis, it can still fail to produce valid intermediate steps due to other challenges.</p><p>One error type we see is indiscriminate unification: when given incompatible premises, the step model will sometimes still attempt to combine them, resulting in improper conclusions. For example, given the premises "Earthquake can change the earth's surface. In a short amount of time is similar to rapidly." one conclusion generated by the model is "Earthquake changes the earth's surface rapidly." This could be avoided through the use of more selective heuristics, or by explicitly supervising step models with negative examples in order to encourage conservative conclusions in these cases.</p><p>We also observe compounding errors in conclusions generated from erroneous premises. For example, given the premises "Offspring will inherit a scar on both knees except not both knees. Offspring will inherit a scar on the knee from parents." the model generates "Parents will inherit a scar on both knees." This kind of relation assignment mistake is uncommon outside of instances involving bad premises. These errors could potentially be mitigated by training heuristics to avoid corrupted premises.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Error Analysis: Goal Entailment</head><p>An additional source of error arises from cases where the goal entailment model is unable to predict the correct entailment relationship between an output from the step model and the goal hypothesis.</p><p>One reason this arises is definitional knowledge. When given the premise "Sugar is soluble in water," the model does not predict entailment of the goal hypothesis "Sugar cubes will dissolve in water when they are combined." English speakers who know the definition of 'soluble' and recognize that sugar cubes are made of sugar could reasonably understand this as entailed. However, the degree of definitional knowledge that should be expected of the entailment model is subjective and often a source of annotator disagreement.</p><p>Another cause of errors are ill-formed statements. For example, the model predicts that "Lichens and soil are similar to being produced by breaking down rocks." entails "Lichens breaking down rocks can form soil." However, it is unclear what is "similar" in the generated statement due to poor syntax. Labels for examples like this often vary depending on how the annotator understood the step model's output. Improving the step model to reduce compounding generation errors will mitigate this issue.</p><p>Finally, the entailment may sometimes be predicated on context. The model predicts that "A new moon will occur on june 30 when the moon orbits the earth." entails "The next new moon will occur on june 30." In this case, the model is assuming that 'a new moon' occurring is equivalent to 'the next new moon' occurring. Depending on annotator assumptions, cases like this can also be somewhat subjective.</p><p>Future work could expand the training data of our NLI model to account for the subjectivity of NLI judgments <ref type="bibr">(Pavlick and Kwiatkowski, 2019;</ref><ref type="bibr">Chen et al., 2020;</ref><ref type="bibr">Nie et al., 2020)</ref>, particularly by modifying our data collection procedure <ref type="bibr">(Zhang et al., 2021)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Related work</head><p>Our work reflects an outgrowth of several lines of work in reading comprehension and textual reasoning. Multi-hop question answering models <ref type="bibr">(Chen et al. 2019;</ref><ref type="bibr">Min et al. 2019;</ref><ref type="bibr">Nishida et al. 2019</ref>, inter alia) also build derivations linking multiple statements to support a conclusion. However, these models organize selected premises into a chain or leave them unstructured as opposed to composing them into an explicit tree.</p><p>The NLProlog system <ref type="bibr">(Weber et al., 2019</ref>) frames multi-hop reading comprehension explicitly as a proof process, performing proof search using soft rule unification over vector representations of predicates and arguments. Similar backward search ideas were used in <ref type="bibr">Arabshahi et al. (2021)</ref>. PRover <ref type="bibr">(Saha et al., 2020)</ref> and ProofWriter <ref type="bibr">(Tafjord et al., 2021</ref>) also frame natural language deduction as proof search, although both systems are evaluated in a synthetic domain of limited complexity. <ref type="bibr">Betz and Richardson (2021)</ref> also use synthetic data to improve reasoning models through midtraining, although the improvements they observe are limited to premise selection performance. <ref type="bibr">Hu et al. (2020)</ref> and <ref type="bibr">Chen et al. (2021)</ref> propose systems which perform single-sentence natural language inference through proof search in the natural logic space. Our work also relates to earlier efforts on natural logic <ref type="bibr">(MacCartney and Manning, 2009;</ref><ref type="bibr">Angeli et al., 2016)</ref> but is able to cover far more phenomena by relaxing the strict constraints of this framework. Finally, the Leap of Thought system <ref type="bibr">(Talmor et al., 2020)</ref> tackles some related ideas in a discriminative reasoning framework.</p><p>The recent chain-of-thought <ref type="bibr">(Wei et al., 2022</ref>) and Scratchpads <ref type="bibr">(Nye et al., 2021)</ref> methods also generate intermediate text as part of answer prediction.</p><p>However, like the end-to-end baseline we consider, these techniques are free to generate unsound derivations. Published results with these techniques are strongest for tasks involving mathematical reasoning or programmatic execution, whereas on textual reasoning datasets like StrategyQA <ref type="bibr">(Geva et al., 2021)</ref> they only mildly outperform a few-shot baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Discussion and Conclusion</head><p>In this work, we propose a system that performs natural language reasoning through generative deduction and heuristic-guided search. We demonstrate that our system produces entailment trees that are more internally consistent than those of an end-to-end model, and that its factored design allows it to successfully prove valid goals while being unable to hallucinate trees for invalid goals. We believe that this modular deduction framework can be readily extended to empower future reasoning systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Limitations</head><p>The baseline approach we consider in this work, end-to-end modeling of entailment tree generation, enjoys the convenience of simple inference and quadratic complexity. However, the computational overhead of sequence-to-sequence models places a hard limit on the tree size and premise count that can be handled in the end-to-end setting; moreover, recent results call into question how well end-to-end Transformers can generalize this type of reasoning <ref type="bibr">(Zhang et al., 2022)</ref>. Our structured approach allows arbitrarily large premise sets and step counts. However, by discretizing the reasoning in the SCSearch procedure, we do face a runtime theoretically exponential in proof size to do exhaustive search. In practice, we limit our search to a finite horizon and find that this suffices to provide a practical wall clock runtime, never exceeding 5 seconds for any single example. Future work on higher tree depths may have to reckon with the theoretical limitations of this procedure, possibly through the use of better heuristics.</p><p>Our experiments are conducted exclusively on English datasets. While we hypothesize that our approach would work equally well for another language given a pretrained sequence-to-sequence model for that language with equivalent capacity, such models are not available universally across languages, representing an obstacle for transferring our results to languages beyond English.</p><p>Furthermore, the EntailmentBank dataset on which we train and evaluate targets the elementary science domain, raising a question of domain specificity. In future work, we plan to evaluate deduction models on additional datasets with different style, conceptual content, and types of reasoning in order to verify that the factored approach is equally applicable across diverse settings.</p></div></body>
		</text>
</TEI>
