<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Augmenting Neural Networks with First-order Logic</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2019 July</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10175282</idno>
					<idno type="doi">10.18653/v1/P19-1028</idno>
					<title level='j'>Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Tao Li</author><author>Vivek Srikumar</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Today,  the  dominant  paradigm  for  training neural networks involves minimizing task loss on a large dataset.  Using world knowledge to inform  a  model,  and  yet  retain  the  ability  to perform end-to-end training remains an open question.   In  this  paper,  we  present  a  novel framework for introducing declarative knowledge to neural network architectures in order to guide training and prediction.   Our frame-work  systematically  compiles  logical  statements  into  computation  graphs  that  augment a  neural  network  without  extra  learnable  parameters  or  manual  redesign.We  evaluate our  modeling  strategy  on  three  tasks:   machine comprehension, natural language inference,  and  text  chunking.Our  experiments show that knowledge-augmented networks can strongly improve over baselines, especially in low-data regimes.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Neural models demonstrate remarkable predictive performance across a broad spectrum of NLP tasks: e.g., natural language inference <ref type="bibr">(Parikh et al., 2016)</ref>, machine comprehension <ref type="bibr">(Seo et al., 2017)</ref>, machine translation <ref type="bibr">(Bahdanau et al., 2015)</ref>, and summarization <ref type="bibr">(Rush et al., 2015)</ref>. These successes can be attributed to their ability to learn robust representations from data. However, such end-to-end training demands a large number of training examples; for example, training a typical network for machine translation may require millions of sentence pairs (e.g. <ref type="bibr">Luong et al., 2015)</ref>. The difficulties and expense of curating large amounts of annotated data are well understood and, consequently, massive datasets may not be available for new tasks, domains or languages.</p><p>In this paper, we argue that we can combat the data hungriness of neural networks by taking advantage of domain knowledge expressed as Gaius Julius Caesar (July 100 BC -15 March 44 BC), Roman general, statesman, Consul and notable author of Latin prose, played a critical role in the events that led to the demise of the Roman Republic and the rise of the Roman Empire through his various military campaigns.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Paragraph:</head><p>Question: Which Roman general is known for writing prose? first-order logic. As an example, consider the task of reading comprehension, where the goal is to answer a question based on a paragraph of text (Fig. <ref type="figure">1</ref>). Attention-driven models such as BiDAF <ref type="bibr">(Seo et al., 2017)</ref> learn to align words in the question with words in the text as an intermediate step towards identifying the answer. While alignments (e.g. author to writing) can be learned from data, we argue that models can reduce their data dependence if they were guided by easily stated rules such as: Prefer aligning phrases that are marked as similar according to an external resource, e.g., ConceptNet <ref type="bibr">(Liu and Singh, 2004)</ref>. If such declaratively stated rules can be incorporated into training neural networks, then they can provide the inductive bias that can reduce data dependence for training.</p><p>That general neural networks can represent such Boolean functions is known and has been studied both from the theoretical and empirical perspectives (e.g. <ref type="bibr">Maass et al., 1994;</ref><ref type="bibr">Anthony, 2003;</ref><ref type="bibr">Pan and Srikumar, 2016)</ref>. Recently, <ref type="bibr">Hu et al. (2016)</ref> exploit this property to train a neural network to mimic a teacher network that uses structured rules. In this paper, we seek to directly incorporate such structured knowledge into a neural network architecture without substantial changes to the training methods. We focus on three questions:</p><p>1. Can we integrate declarative rules with endto-end neural network training?</p><p>2. Can such rules help ease the need for data?</p><p>3. How does incorporating domain expertise compare against large training resources powered by pre-trained representations?</p><p>The first question poses the key technical challenge we address in this paper. On one hand, we wish to guide training and prediction with neural networks using logic, which is non-differentiable. On the other hand, we seek to retain the advantages of gradient-based learning without having to redesign the training scheme. To this end, we propose a framework that allows us to systematically augment an existing network architecture using constraints about its nodes by deterministically converting rules into differentiable computation graphs. To allow for the possibility of such rules being incorrect, our framework is designed to admit soft constraints from the ground up. Our framework is compatible with off-the-shelf neural networks without extensive redesign or any additional trainable parameters.</p><p>To address the second and the third questions, we empirically evaluate our framework on three tasks: machine comprehension, natural language inference, and text chunking. In each case, we use a general off-the-shelf model for the task, and study the impact of simple logical constraints on observed neurons (e.g., attention) for different data sizes. We show that our framework can successfully improve an existing neural design, especially when the number of training examples is limited.</p><p>In summary, our contributions are:</p><p>1. We introduce a new framework for incorporating first-order logic rules into neural network design in order to guide both training and prediction.</p><p>2. We evaluate our approach on three different NLP tasks: machine comprehension, textual entailment, and text chunking. We show that augmented models lead to large performance gains in the low training data regimes.<ref type="foot">foot_0</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Problem Setup</head><p>In this section, we will introduce the notation and assumptions that form the basis of our formalism for constraining neural networks.</p><p>Neural networks are directed acyclic computation graphs G = (V, E), consisting of nodes (i.e. neurons) V and weighted directed edges E that represent information flow. Although not all neurons have explicitly grounded meanings, some nodes indeed can be endowed with semantics tied to the task. Node semantics may be assigned during model design (e.g. attention), or incidentally discovered in post hoc analysis (e.g., <ref type="bibr">Le et al., 2012;</ref><ref type="bibr">Radford et al., 2017, and others)</ref>. In either case, our goal is to augment a neural network with such named neurons using declarative rules.</p><p>The use of logic to represent domain knowledge has a rich history in AI (e.g. <ref type="bibr">Russell and Norvig, 2016)</ref>. In this work, to capture such knowledge, we will primarily focus on conditional statements of the form L &#8594; R, where the expression L is the antecedent (or the left-hand side) that can be conjunctions or disjunctions of literals, and R is the consequent (or the right-hand side) that consists of a single literal. Note that such rules include Horn clauses and their generalizations, which are well studied in the knowledge representation and logic programming communities (e.g. <ref type="bibr">Chandra and Harel, 1985)</ref>.</p><p>Integrating rules with neural networks presents three difficulties. First, we need a mapping between the predicates in the rules and nodes in the computation graph. Second, logic is not differentiable; we need an encoding of logic that admits training using gradient based methods. Finally, computation graphs are acyclic, but user-defined rules may introduce cyclic dependencies between the nodes. Let us look at these issues in order.</p><p>As mentioned before, we will assume named neurons are given. And by associating predicates with such nodes that are endowed with symbolic meaning, we can introduce domain knowledge about a problem in terms of these predicates. In the rest of the paper, we will use lower cased letters (e.g., a i , b j ) to denote nodes in a computation graph, and upper cased letters (e.g., A i , B j ) for predicates associated with them.</p><p>To deal with the non-differentiablity of logic, we will treat the post-activation value of a named neuron as the degree to which the associated predicate is true. In &#167;3, we will look at methods </p><p>is cyclic with respect to the graph. On the other hand, the statement</p><p>for compiling conditional statements into differentiable statements that augment a given network.</p><p>Cyclicity of Constraints Since we will augment computation graphs with compiled conditional forms, we should be careful to avoid creating cycles. To formalize this, let us define cyclicity of conditional statements with respect to a neural network.</p><p>Given two nodes a and b in a computation graph, we say that the node a is upstream of node b if there is a directed path from a to b in the graph.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 1 (Cyclic and Acyclic Implications).</head><p>Let G be a computation graph. An implicative statement L &#8594; R is cyclic with respect to G if, for any literal R i &#8712; R, the node r i associated with it is upstream of the node l j associated with some literal L j &#8712; L. An implicative statement is acyclic if it is not cyclic.</p><p>Fig. <ref type="figure">2</ref> and its caption gives examples of cyclic and acyclic implications. A cyclic statement sometimes can be converted to an equivalent acyclic statement by constructing its contrapositive. For example, the constraint B 1 &#8594; A 1 is equivalent to &#172;A 1 &#8594; &#172;B 1 . While the former is cyclic, the later is acyclic. Generally, we can assume that we have acyclic implications. 2</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">A Framework for Augmenting Neural Networks with Constraints</head><p>To create constraint-aware neural networks, we will extend the computation graph of an existing network with additional edges defined by constraints. In &#167;3.1, we will focus on the case where the antecedent is conjunctive/disjunctive and the consequent is a single literal. In &#167;3.2, we will cover more general antecedents.</p><p>2 As we will see in &#167;3.3, the contrapositive does not always help because we may end up with a complex right hand side that we can not yet compile into the computation graph.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Constraints Beget Distance Functions</head><p>Given a computation graph, suppose we have a acyclic conditional statement: Z &#8594; Y , where Z is a conjunction or a disjunction of literals and Y is a single literal. We define the neuron associated with Y to be y = g (Wx), where g denotes an activation function, W are network parameters, x is the immediate input to y. Further, let the vector z represent the neurons associated with the predicates in Z. While the nodes z need to be named neurons, the immediate input x need not necessarily have symbolic meaning.</p><p>Constrained Neural Layers Our goal is to augment the computation of y so that whenever Z is true, the pre-activated value of y increases if the literal Y is not negated (and decreases if it is). To do so, we define a constrained neural layer as y = g (Wx + &#961;d (z)) .</p><p>(1)</p><p>Here, we will refer to the function d as the distance function that captures, in a differentiable way, whether the antecedent of the implication holds. The importance of the entire constraint is decided by a real-valued hyper-parameter &#961; &#8805; 0.</p><p>The definition of the constrained neural layer says that, by compiling an implicative statement into a distance function, we can regulate the preactivation scores of the downstream neurons based on the states of upstream ones.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Designing the distance function</head><p>The key consideration in the compilation step is the choice of an appropriate distance function for logical statements. The ideal distance function we seek is the indicator for the statement Z:</p><p>However, since the function d ideal is not differentiable, we need smooth surrogates.</p><p>In the rest of this paper, we will define and use distance functions that are inspired by probabilistic soft logic (c.f. <ref type="bibr">Klement et al., 2013)</ref> and its use of the &#321;ukasiewicz T-norm and T-conorm to define a soft version of conjunctions and disjunctions. <ref type="foot">3</ref>Table <ref type="table">1</ref> summarizes distance functions corresponding to conjunctions and disjunctions. In all cases, recall that the z i 's are the states of neurons and are assumed to be in the range [0, 1]. Examining the table, we see that with a conjunctive antecedent (first row), the distance becomes zero if even one of the conjuncts is false. For a disjunctive antecedent (second row), the distance becomes zero only when all the disjuncts are false; otherwise, it increases as the disjuncts become more likely to be true.</p><p>Negating Predicates Both the antecedent (the Z's) and the consequent (Y ) could contain negated predicates. We will consider these separately.</p><p>For any negated antecedent predicate, we modify the distance function by substituting the corresponding z i with 1z i in Table <ref type="table">1</ref>. The last two rows of the table list out two special cases, where the entire antecedents are negated, and can be derived from the first two rows.</p><p>To negate consequent Y , we need to reduce the pre-activation score of neuron y. To achieve this, we can simply negate the entire distance function.</p><p>Scaling factor &#961; In Eq. 1, the distance function serves to promote or inhibit the value of downstream neuron. The extent is controlled by the scaling factor &#961;. For instance, with &#961; = +&#8734;, the pre-activation score of the downstream neuron is dominated by the distance function. In this case, we have a hard constraint. In contrast, with a small &#961;, the output state depends on both the Wx and the distance function. In this case, the soft constraint serves more as a suggestion. Ultimately, the network parameters might overrule the constraint. We will see an example in &#167;4 where noisy constraint prefers small &#961;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">General Boolean Antecedents</head><p>So far, we exclusively focused on conditional statements with either conjunctive or disjunctive antecedents. In this section, we will consider general antecedents.</p><p>As an illustrative example, suppose we have an antecedent (&#172;A &#8744; B) &#8743; (C &#8744; D). By introducing auxiliary variables, we can convert it into the conjunctive form P &#8743; Q, where (&#172;A &#8744; B) &#8596; P and (C &#8744; D) &#8596; Q. To perform such operation, we need to: (1) introduce auxiliary neurons associated with the auxiliary predicates P and Q, and, (2) define these neurons to be exclusively determined by the biconditional constraint.</p><p>To be consistent in terminology, when considering biconditional statement (&#172;A &#8744; B) &#8596; P , we will call the auxiliary literal P the consequent, and the original literals A and B the antecedents.</p><p>Because the implication is bidirectional in biconditional statement, it violates our acyclicity requirement in &#167;3.1. However, since the auxiliary neuron state does not depend on any other nodes, we can still create an acyclic sub-graph by defining the new node to be the distance function itself.</p><p>Constrained Auxiliary Layers With a biconditional statement Z &#8596; Y , where Y is an auxiliary literal, we define a constrained auxiliary layer as</p><p>where d is the distance function for the statement, z are upstream neurons associated with Z, y is the downstream neuron associated with Y . Note that, compared to Eq. 1, we do not need activation function since the distance, which is in [0, 1], can be interpreted as producing normalized scores. Note that this construction only applies to auxiliary predicates in biconditional statements. The advantage of this layer definition is that we can use the same distance functions as before (i.e., Table 1). Furthermore, the same design considerations in &#167;3.1 still apply here, including how to negate the left and right hand sides.</p><p>Constructing augmented networks To complete the modeling framework, we summarize the workflow needed to construct an augmented neural network given a conditional statement and a computation graph: (1) Convert the antecedent into a conjunctive or a disjunctive normal form if necessary. (2) Convert the conjunctive/disjunctive antecedent into distance functions using Ta-ble 1 (with appropriate corrections for negations).</p><p>(3) Use the distance functions to construct constrained layers and/or auxiliary layers to augment the computation graph by replacing the original layer with constrained one. (4) Finally, use the augmented network for end-to-end training and inference. We will see complete examples in &#167;4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Discussion</head><p>Not only does our design not add any more trainable parameters to the existing network, it also admits efficient implementation with modern neural network libraries.</p><p>When posing multiple constraints on the same downstream neuron, there could be combinatorial conflicts. In this case, our framework relies on the base network to handle the consistency issue. In practice, we found that summing the constrained pre-activation scores for a neuron is a good heuristic (as we will see in &#167;4.3).</p><p>For a conjunctive consequent, we can decompose it into multiple individual constraints. That is equivalent to constraining downstream nodes independently. Handling more complex consequents is a direction of future research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>In this section, we will answer the research questions raised in &#167;1 by focusing on the effectiveness of our augmentation framework. Specifically, we will explore three types of constraints by augmenting: 1) intermediate decisions (i.e. attentions); 2) output decisions constrained by intermediate states; 3) output decisions constrained using label dependencies.</p><p>To this end, we instantiate our framework on three tasks: machine comprehension, natural language inference, and text chunking. Across all experiments, our goal is to study the modeling flexibility of our framework and its ability to improve performance, especially with decreasing amounts of training data.</p><p>To study low data regimes, our augmented networks are trained using varying amounts of training data to see how performances vary from baselines. For detailed model setup, please refer to the appendices.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Machine Comprehension</head><p>Attention is a widely used intermediate state in several recent neural models. To explore the augmentation over such neurons, we focus on attention-based machine comprehension models on SQuAD (v1.1) dataset <ref type="bibr">(Rajpurkar et al., 2016)</ref>. We seek to use word relatedness from external resources (i.e., ConceptNet) to guide alignments, and thus to improve model performance.</p><p>Model We base our framework on two models: BiDAF <ref type="bibr">(Seo et al., 2017)</ref> and its ELMoaugmented variant <ref type="bibr">(Peters et al., 2018)</ref>. Here, we provide an abstraction of the two models which our framework will operate on: p, q = encoder(p), encoder(q)</p><p>(3) &#8592;a , -&#8594; a = &#963;(layers(p, q)) (4)</p><p>where p and q are the paragraph and query respectively, &#963; refers to the softmax activation, &#8592;a and -&#8594; a are the bidirectional attentions from q to p and vice versa, y and z are the probabilities of answer boundaries. All other aspects are abstracted as encoder and layers.</p><p>Augmentation By construction of the attention neurons, we expect that related words should be aligned. In a knowledge-driven approach, we can use ConceptNet to guide the attention values in the model in Eq. 4.</p><p>We consider two rules to illustrate the flexibility of our framework. Both statements are in firstorder logic that are dynamically grounded to the computation graph for a particular paragraph and query. First, we define the following predicates: K i,j word p i is related to word q j in Concept-Net via edges {Synonym, DistinctFrom, IsA, Related}. &#8592; -A i,j unconstrained model decision that word q j best matches to word p i . &#8592; -A &#8242; i,j constrained model decision for the above alignment. Using these predicates, we will study the impact of the following two rules, defined over a set C of content words in p and q: R 1 : &#8704;i, j &#8712; C, K i,j &#8594; &#8592; -</p><p>. The rule R 1 says that two words should be aligned if they are related. Interestingly, compiling this statement using the distance functions in Table 1 is essentially the same as adding word relatedness as a static feature. The rule R 2 is more conservative as it also depends on the unconstrained %Train BiDAF +R 1 +R 2 +ELMo +ELMo,R 1 10% 57.5 61.5 60.7 71.8 73.0 20% 65.7 67.2 66.6 76.9 77.7 40% 70.6 72.6 71.9 80.3 80.9 100% 75.7 77.4 77.0 83.9</p><p>84.1   Can our framework use rules over named neurons to improve model performance? The answer is yes. We experiment with rules R 1 and R 2 on incrementally larger training data. Performances are reported in How does it compare to pretrained encoders? Pretrained encoders (e.g. ELMo and BERT <ref type="bibr">(Devlin et al., 2018)</ref>) improve neural models with improved representations, while our framework aug-ments the graph using first-order logic. It is important to study the interplay of these two orthogonal directions. We can see in Table <ref type="table">2</ref>, our augmented model consistently outperforms baseline even with the presence of ELMo embeddings.</p><p>Does the conservative constraint R 2 help? We explored two options to incorporate word relatedness; one is a straightforward constraint (i.e. R 1 ), another is its conservative variant (i.e. R 2 ). It is a design choice as to which to use. Clearly in Table 2, constraint R 1 consistently outperforms its conservative alternative R 2 , even though R 2 is better than baseline. In the next task, we will see an example where a conservative constraint performs better with large training data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Natural Language Inference</head><p>Unlike in the machine comprehension task, here we explore logic rules that bridge attention neurons and output neurons. We use the SNLI dataset <ref type="bibr">(Bowman et al., 2015)</ref>, and base our framework on a variant of the decomposable attention (DAtt, <ref type="bibr">Parikh et al., 2016</ref>) model where we replace its projection encoder with bidirectional LSTM (namely L-DAtt).</p><p>Model Again, we abstract the pipeline of L-DAtt model, only focusing on layers which our framework works on. Given a premise p and a hypothesis h, we summarize the model as:</p><p>Here, &#963; is the softmax activation, &#8592;a and -&#8594; a are bidirectional attentions, y are probabilities for labels Entailment, Contradiction, and Neutral.</p><p>Augmentation We will borrow the predicate notation defined in the machine comprehension task ( &#167;4.1), and ground them on premise and hypothesis words, e.g. K i,j now denotes the relatedness between premise word p i and hypothesis word h j .</p><p>In addition, we define the predicate Y l to indicate that the label is l. As in &#167;4.1, we define two rules governing attention:</p><p>. where C is the set of content words. Note that the two constraints apply to both attention directions.</p><p>Intuitively, if a hypothesis content word is not aligned, then the prediction should not be Entailment. To use this knowledge, we define the following rule:</p><p>where Z 1 and Z 2 are auxiliary predicates tied to the Y Entail predicate. The details of N 3 are illustrated in Fig. <ref type="figure">4</ref>.  How does our framework perform with large training data? The SNLI dataset is a large dataset with over half-million examples. We train our models using incrementally larger percentages of data and report the average performance in Table 3. Similar to &#167;4.1, we observe strong improvements from augmented models trained on small percentages (&#8804;10%) of data. The straightforward constraint N 1 performs strongly with &#8804;2% data while its conservative alternative N 2 works better with a larger set. However, with full dataset, our augmented models perform only on par with baseline even with lowered scaling factor &#961;. These observations suggest that if a large dataset is available, it may be better to believe the data, but with smaller datasets, constraints can provide useful inductive bias for the models.  <ref type="bibr">(8,</ref><ref type="bibr">8,</ref><ref type="bibr">8,</ref><ref type="bibr">8,</ref><ref type="bibr">4)</ref> for the five different percentages.</p><p>For the noisy constraint N 3 , &#961; = (2, 2, 1, 1, 1).</p><p>formed even worse than baseline, which suggests it contains noise. In fact, we found a significant amount of counter examples to N 3 during preliminary analysis. Yet, even a noisy rule can improve model performance with &#8804;10% data. The same observation holds for N 1 , which suggests conservative constraints could be a way to deal with noise. Finally, by comparing N 2 and N 2,3 , we find that the good constraint N 2 can not just augment the network, but also amplify the noise in N 3 when they are combined. This results in degrading performance in the N 2,3 column starting from 5% of the data, much earlier than using N 3 alone.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Text Chunking</head><p>Attention layers are a modeling choice that do not always exist in all networks. To illustrate that our framework is not necessarily grounded to attention, we turn to an application where we use knowledge about the output space to constrain predictions. We focus on the sequence labeling task of text chunking using the CoNLL2000 dataset <ref type="bibr">(Tjong Kim Sang and Buchholz, 2000)</ref>.</p><p>In such sequence tagging tasks, global inference is widely used, e.g., BiLSTM-CRF <ref type="bibr">(Huang et al., 2015)</ref>. Our framework, on the other hand, aims to promote local decisions. To explore the interplay of global model and local decision augmentation, we will combine CRF with our framework.</p><p>Model Our baseline is a BiLSTM tagger:</p><p>where x is the input sentence, &#963; is softmax, y are the output probabilities of BIO tags.</p><p>Augmentation We define the following predicates for input and output neurons: The constrained decision that t th word has label l. N t The t th word is a noun. Then we can write rules for pairwise label dependency. For instance, if word t has B/I-tag for a certain label, word t+1 can not have an I-tag with a different label.</p><p>C</p><p>Our second set of rules are also intuitive: A noun should not have non-NP label. C 5 : &#8704;t, N t &#8594; l&#8712;{B-VP,I-VP,B-PP,I-PP} &#172;Y &#8242; t,l While all above rules can be applied as hard constraints in the output space, our framework provides a differentiable way to inform the model during training and prediction.</p><p>How does local augmentation compare with global inference? We report performances in Table <ref type="table">4</ref>. While a first-order Markov model (e.g., the BiLSTM-CRF) can learn pairwise constraints such as C 1:4 , we see that our framework can better inform the model. Interestingly, the CRF model performed even worse than the baseline with &#8804;40% data. This suggests that global inference relies on more training examples to learn its scoring function. In contrast, our constrained models performed strongly even with small training sets. And by combining these two orthogonal methods, our locally augmented CRF performed the best with full data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related Work and Discussion</head><p>Artificial Neural Networks and Logic Our work is related to neural-symbolic learning (e.g. <ref type="bibr">Besold et al., 2017)</ref> which seeks to integrate neural networks with symbolic knowledge. For example, <ref type="bibr">Cingillioglu and Russo (2019)</ref> proposed neural models that multi-hop logical reasoning.</p><p>KBANN <ref type="bibr">(Towell et al., 1990</ref>) constructs artificial neural networks using connections expressed in propositional logic. Along these lines, Fran&#231;a et al. (2014, CILP++) build neural networks from a rule set for relation extraction. Our distinction is that we use first-order logic to augment a given architecture instead of designing a new one. Also, our framework is related to <ref type="bibr">Kimmig et al. (2012, PSL)</ref> which uses a smooth extension of standard Boolean logic. <ref type="bibr">Hu et al. (2016)</ref> introduced an imitation learning framework where a specialized teacher-student network is used to distill rules into network parameters. This work could be seen as an instance of knowledge distillation <ref type="bibr">(Hinton et al., 2015)</ref>. Instead of such extensive changes to the learning procedure, our framework retains the original network design and augments existing interpretable layers.</p><p>Regularization with Logic Several recent lines of research seek to guide training neural networks by integrating logical rules in the form of additional terms in the loss functions (e.g., <ref type="bibr">Rockt&#228;schel et al., 2015)</ref> that essentially promote constraints among output labels (e.g., <ref type="bibr">Du et al., 2019;</ref><ref type="bibr">Mehta et al., 2018)</ref>, promote agreement <ref type="bibr">(Hsu et al., 2018)</ref> or reduce inconsistencies across predictions <ref type="bibr">(Minervini and Riedel, 2018)</ref>.</p><p>Furthermore, <ref type="bibr">Xu et al. (2018)</ref> proposed a general design of loss functions using symbolic knowledge about the outputs. <ref type="bibr">Fischer et al. (2019)</ref> describe a method for for deriving losses that are friendly to gradient-based learning algorithms. <ref type="bibr">Wang and Poon (2018)</ref> proposed a framework for integrating indirect supervision expressed via probabilistic logic into neural networks.</p><p>Learning with Structures Traditional structured prediction models (e.g. Smith, 2011) naturally admit constraints of the kind described in this paper. Indeed, our approach for using logic as a template-language is similar to Markov Logic Networks <ref type="bibr">(Richardson and Domingos, 2006)</ref>, where logical forms are compiled into Markov networks. Our formulation augments model scores with constraint penalties is reminiscent of the Constrained Conditional Model of <ref type="bibr">Chang et al. (2012)</ref>.</p><p>Recently, we have seen some work that allows backpropagating through structures (e.g. <ref type="bibr">Huang et al., 2015;</ref><ref type="bibr">Kim et al., 2017;</ref><ref type="bibr">Yogatama et al., 2017;</ref><ref type="bibr">Niculae et al., 2018;</ref><ref type="bibr">Peng et al., 2018</ref>, and the references within). Our framework differs from them in that structured inference is not mandantory here. We believe that there is room to study the interplay of these two approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusions</head><p>In this paper, we presented a framework for introducing constraints in the form of logical statements to neural networks. We demonstrated the process of converting first-order logic into differentiable components of networks without extra learnable parameters and extensive redesign. Our experiments were designed to explore the flexibility of our framework with different constraints in diverse tasks. As our experiments showed, our framework allows neural models to benefit from external knowledge during learning and prediction, especially when training data is limited.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>The code used for our experiments is archived here: https://github.com/utahnlp/layer_augmentation</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>The definitions of the distance functions here as surrogates for the non-differentiable d ideal is reminiscent of the use of hinge loss as a surrogate for the zero-one loss. In both cases, other surrogates are possible.</p></note>
		</body>
		</text>
</TEI>
