<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>A Regularization Approach for Incorporating Event Knowledge and Coreference Relations into Neural Discourse Parsing</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10141345</idno>
					<idno type="doi">10.18653/v1/D19-1295</idno>
					<title level='j'>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zeyu Dai</author><author>Ruihong Huang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We argue that external commonsense knowledge and linguistic constraints need to be incorporated into neural network models for mitigating data sparsity issues and further improving the performance of discourse parsing. Realizing that external knowledge and linguistic constraints may not always apply in understanding a particular context, we propose a regularization approach that tightly integrates these constraints with contexts for deriving word representations. Meanwhile, it balances attentions over contexts and constraints through adding a regularization term into the objective function. Experiments show that our knowledge regularization approach outperforms all previous systems on the benchmark dataset PDTB for discourse parsing.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Discourse parsing and identifying rhetorical discourse relations between two text spans (i.e., discourse units, either clauses or sentences) is crucial and beneficial for a wide variety of downstream tasks and applications such as machine translation <ref type="bibr">(Webber et al., 2017)</ref>, text generation <ref type="bibr">(Mann, 1984;</ref><ref type="bibr">Bosselut et al., 2018</ref>) and text summarization <ref type="bibr">(Gerani et al., 2014)</ref>.</p><p>In the PDTB-style discourse parsing <ref type="bibr">(Prasad et al., 2008)</ref>, we commonly distinguish implicit discourse relations from explicit relations, depending on whether a discourse connective (e.g., "because", "however") appears between two discourse units. In general, recognizing implicit discourse relations is more challenging due to the lack of connective, which has recently drawn significant attention from the NLP researchers.</p><p>Recent research for implicit discourse relation classification has mostly focused on applying powerful neural network models <ref type="bibr">(Qin et al., 2016a,b;</ref><ref type="bibr">Liu and Li, 2016;</ref><ref type="bibr">Lei et al., 2017;</ref><ref type="bibr">Bai and Zhao, 2018)</ref> for modeling compositional meanings and word-level interactions of two discourse units. More recent research has also exploited utilizing broader contexts <ref type="bibr">(Dai and Huang, 2018)</ref> as well as leveraging external training data <ref type="bibr">(Xu et al., 2018)</ref>. Although progress has been made, the performance of implicit discourse relation identification remains low (macro F1 &lt; 50%).</p><p>We believe that the low performance is mainly due to the data sparsity issue <ref type="bibr">(Braud and Denis, 2015)</ref>, which hinders data-thirsty neural network models from making further improvements. Considering the following example from PDTB with two discourse units (DUs): DU1: The editorial of the WHO notes that tobacco consumption and lung-cancer mortality rates are rising in developing countries. DU2: "No smoking should be established as the norm of social behavior" around the world, the editorial says, through the enactment of laws that limit advertising and promote antismoking education. Discourse Relation: Implicit Contingency.Cause Humans can easily recognize this discourse relation as "Cause" because we know that "smoking" is the key causal factor for "lung-cancer", but it is extremely difficult for neural network models trained with limited amount of data to detect it considering the keyword "lung-cancer" only appears few times in the whole PDTB data.</p><p>We further argue that external knowledge and linguistic constraints need to be considered for improving implicit discourse relation classification since human annotators also rely on these commonsense knowledge (e.g., smoking causes the lung-cancer) to label the discourse relations. First, we consider external event knowledge, because discourse relations (e.g., cause and temporal relations) are often defined as the relation between two events (situations in general) as described in two discourse units. As shown in the above example, the "Cause" discourse relation between the two DUs depends on the relation between two events "smoking" and "lung-cancer" with one event in each DU. Second, we consider entity coreference relations as a useful form of linguistic constraints in inferring discourse relations. This is motivated by prior work <ref type="bibr">(Rutherford and Xue, 2014;</ref><ref type="bibr">Ji and Eisenstein, 2015)</ref> showing that coreference based features can improve entity mention representations within a DU, which facilitates recognizing coherence and discourse relations between DUs.</p><p>In this paper, we investigate how to incorporate external event knowledge and entity coreference relation based linguistic constraints into neural network models for discourse parsing. One key difficulty we want to address is that external knowledge derived event relations or hard linguistic constraints may not always apply for interpreting a particular context, and may hurt performance if used blindly <ref type="bibr">(Kishimoto et al., 2018)</ref>. Therefore, we propose to tightly integrate these constraints into the discourse relation inference process by manipulating hidden word representations to reflect relations between words, and meanwhile balance attentions to contexts and constraints through adding a knowledge regularization term in the final objective function.</p><p>Specifically, we choose the paragraph-level model we proposed <ref type="bibr">(Dai and Huang, 2018</ref>) as the base model, which exploits wider paragraph-level contexts and has been shown effective for PDTBstyle discourse parsing. The model mainly consists of a two-level hierarchical BiLSTMs <ref type="bibr">(Schuster and Paliwal, 1997)</ref> for modeling both wordlevel and DU-level inter-dependencies (with a brief description in section 3.1). To implement the knowledge guided regularization for discourse parsing, we first insert a new knowledge layer between the word-level BiLSTM and DU-level BiL-STM layer. This knowledge layer modifies hidden representations of words that participate in an event or coreference relation, by applying a relation type specific feedforward neural network. Then, we compose a knowledge regularizer based on word representation outputs of the knowledge layer, by adapting a classic knowledge embedding method TransE <ref type="bibr">(Bordes et al., 2013)</ref>. The regularization term is added to the overall objective function and minimized during model training.</p><p>The experiments on PDTB v2.0 demonstrate that our proposed knowledge regularization approach can effectively utilize several types of externally obtained event knowledge and entity coreference relations<ref type="foot">foot_0</ref> , and improves the performance of both implicit and explicit discourse relation recognition compared to all previous work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>2.1 Discourse Parsing on PDTB With the release of Penn Discourse Treebank (PDTB) <ref type="bibr">(Prasad et al., 2008)</ref>, the task of discourse parsing, especially implicit discourse relation recognition, has received a lot of attention from the NLP community and researchers <ref type="bibr">(Pitler and Nenkova, 2009;</ref><ref type="bibr">Lin et al., 2014;</ref><ref type="bibr">Xue et al., 2015;</ref><ref type="bibr">Rutherford and Xue, 2016)</ref>. A large number of previous work attempted to model the semantic meanings of two discourse units using latest and advanced neural network models <ref type="bibr">(Chen et al., 2016;</ref><ref type="bibr">Ji et al., 2016;</ref><ref type="bibr">Rutherford et al., 2017;</ref><ref type="bibr">Qin et al., 2017;</ref><ref type="bibr">Guo et al., 2018;</ref><ref type="bibr">Bai and Zhao, 2018)</ref>. Paragraph-wide contexts were considered for building better discourse unit representations in <ref type="bibr">Dai and Huang (2018)</ref>. Another research direction for improving implicit discourse relation classification is to expand the training data by leveraging explicit relations <ref type="bibr">(Liu et al., 2016;</ref><ref type="bibr">Lan et al., 2017)</ref> or discourse connective informed unlabeled data <ref type="bibr">(Rutherford and Xue, 2015;</ref><ref type="bibr">Xu et al., 2018)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Incorporate Knowledge into Discourse</head><p>Parsing Only a few previous work <ref type="bibr">(Park and Cardie, 2012;</ref><ref type="bibr">Biran and McKeown, 2013;</ref><ref type="bibr">Lei et al., 2018)</ref> has exploited external knowledge, including Word-Net features (e.g., Antonyms and Hypernyms) and Verb Class <ref type="bibr">(Levin, 1993)</ref>, in discourse parsing by deriving discrete indicator features and then feed them into feature-based classifiers. Incorporating knowledge as additional features into neural network models often generalize poorly due to the sparsity of features, as also shown in our experiments. Recently, <ref type="bibr">Kishimoto et al. (2018)</ref> incorporated the whole of ConceptNet into a MAGE-GRU <ref type="bibr">(Dhingra et al., 2017)</ref> based neural networks, but their experiments show that it did not work well for improving implicit discourse relation identification compared with their own base-line. We interpret this negative result as the consequence of using irrelevant (noisy) knowledge types blindly without proper regularization.</p><p>There are also recent work <ref type="bibr">(Yang and Mitchell, 2017;</ref><ref type="bibr">Xu et al., 2017;</ref><ref type="bibr">Zhou et al., 2018)</ref> that incorporate external knowledge into neural network models for improving several other NLP tasks, including information extraction and conversation generation, which mostly followed the twostep approach that first obtained representations of knowledge (with triplet format) from knowledge base using knowledge graph embedding methods such as TransE <ref type="bibr">(Bordes et al., 2013)</ref>, and then utilized attention mechanism (or added gates in a RNN cell <ref type="bibr">(Ma et al., 2018)</ref>) to integrate knowledge representations with hidden word vectors. This approach has two main drawbacks: (1) Knowledge representations learned from the first step are fixed without considering the influences of contexts, which may be suboptimal when used for understanding a particular context. (2) With no filtering or regularization, it is difficult for attention mechanisms to explicitly select and attend to the relevant knowledge. In contrast, our proposed regularization approach can be regarded as an end-toend joint-learning framework for discourse parsing and knowledge representation learning, which not only considers both knowledge and contexts in knowledge-aware word representation learning, but also naturally balances attentions on both contexts and knowledge through regularization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Model</head><p>Figure <ref type="figure">1</ref> illustrates the overall architecture of our model, which implements our knowledge regularization approach (the right part) on top of an existing model as the base model (the left part). There are only two modifications we made to the base model: (1) we insert a novel knowledge layer between the two BiLSTM layers of the base mode;</p><p>(2) we add a regularizer into the overall objective function. We will first briefly describe the base model, a replication<ref type="foot">foot_1</ref> of our recently proposed paragraph-level discourse parsing model <ref type="bibr">(Dai and Huang, 2018)</ref>. We will then explain the knowledge layer and knowledge regularizer we added.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Base Model</head><p>The base model processes a paragraph containing a sequence of discourse units each time, and predicts a sequence of discourse relations (both implicit and explicit relations) with one relation between each pair of adjacent discourse units (DU). The base model utilizes a hierarchical BiLSTM to calculate both word-level and DU-level representations, followed by a prediction layer and Conditional Random Field (CRF) layer <ref type="bibr">(Lafferty et al., 2001)</ref> for jointly predicting a sequence of discourse relations within a paragraph. The base model consists of the following layers:</p><p>Character-level CNN Layer: The characterlevel features, such as the prefix or suffix of a word, can help alleviate the out-of-vocabulary (OOV) problem and improve the word representation in neural nets <ref type="bibr">(Santos and Zadrozny, 2014)</ref>. In our base model, we use one layer of CNN<ref type="foot">foot_2</ref> with max-pooling to extract character-level representation w char i for the i-th word of the input paragraph.</p><p>Word-level BiLSTM Layer: Given a words sequence X = (x 1 , x 2 , ..., x L ) as the input paragraph, for each word x i , we construct the expanded word vector by concatenating its word embedding w word i with its character-level representation and extra word-level features<ref type="foot">foot_3</ref> as:</p><p>The word-level BiLSTM layer will process the sequence of expanded word vectors (w 1 , w 2 , ..., w L ) and compute the word x i 's hidden representation at each word index i:</p><p>DU-level BiLSTM Layer: Given the output of word-level BiLSTM (h x 1 , h x 2 , ..., h x L ), we calculate the raw DU representation by applying max-pooling operation <ref type="bibr">(Conneau et al., 2017)</ref> over the sequence of word representations for all words within a discourse unit: h 0 DU j = max x i 2DU j h x i Then, the DU-level BiLSTM will process the sequence of raw DU representations and obtain the refined DU representation h DU j for the j-th discourse unit in a paragraph:  Untied (Explicit vs. Implicit) Prediction Layer: Considering the different natures of explicit and implicit discourse relations <ref type="bibr">(Pitler et al., 2009;</ref><ref type="bibr">Lin et al., 2009)</ref>, the base model trains two independent linear layers with untied parameters for predicting explicit or implicit discourse relations between each two adjacent DUs respectively:</p><p>CRF Layer for Discourse Relation Sequence Labeling: A CRF layer <ref type="bibr">(Biran and McKeown, 2015)</ref> is added on top of the prediction layer to fine-tune the predicted sequence of discourse relations by capturing continuity and transition patterns (e.g., a temporal relation is likely to follow another temporal relation).</p><p>Given the hidden discourse relation representations</p><p>y T ) and the target discourse relation label sequence</p><p>T ) for the i-th training instance, we minimize the following CRF loss function during model training:</p><p>During testing, the Viterbi algorithm is used to search for the optimal label sequence y &#8676; that maximizes the conditional probability p(y|H).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Knowledge Layer</head><p>We simply insert a knowledge layer between the word-level and DU-level BiLSTM layers of the base model, as shown in Figure <ref type="figure">1</ref>, for incorporating external knowledge and linguistic constraints. Although the knowledge layer can be easily extended to support other types of knowledge, we only consider event knowledge and coreference relations in this paper and leave the exploration of other knowledge types in the future work.</p><p>Since there are some notable differences between event relations and entity coreference relations, we model event and coreference constraints in different ways considering their specificities. For example, (1) an event relation can be represented as the triple format ((h, r, t) or (head, relation, tail) where head and tail are two event words, relation indicates the relationship between two events head and tail.) while a coreference relation can have more than two coreferential entity mentions; (2) event relations are directed while coreference relations are undirected.</p><p>Event Knowledge: As the input of our knowledge layer, E = (E 1 , E 2 , ..., E M ) denotes the collection of event knowledge triplets generated by matching the paragraph contexts with external event knowledge base (we will give more details in the following section 4.1.), where each triplet has the form of (h, r, t) meaning that there is an event relation r (either temporal, causal or subevent in this work) between the head event x h at the position h and tail event x t at the position t.</p><p>For each triplet E m = (h, r, t), we use a feedforward neural network<ref type="foot">foot_5</ref> f r () to update the hidden word representations of head and tail events:</p><p>where W r and b r are relation-specific weights and bias learned for each type of event relation r only.</p><p>Coreference Relations: Our system assumes that coreference relations in each paragraph are given in the form of coreference clusters, which are generated by running an existing coreference resolver <ref type="bibr">(Clark and Manning, 2016)</ref> from the latest version (3.9.2) of Stanford CoreNLP toolkit.</p><p>Let C = (C 1 , C 2 , ..., C K ) denote coreference clusters in one paragraph, where C k contains the word indices with corresponding words referring to the same entity. Similar as above, we use one feedforward neural network f coref () to update the hidden word representation h x i for words within each coreference cluster. Specifically, the output word vector has the following form:</p><p>where W coref and b coref are the weights and bias, h C k is a coreference vector calculated by applying max-pooling to all word representations in one cluster:</p><p>The role of coreference vector is similar to "context vector" utilized in soft attention mechanism <ref type="bibr">(Bahdanau et al., 2015)</ref>, but we use simple max-pooling instead of computing weights<ref type="foot">foot_6</ref> for different word vectors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Knowledge Regularization</head><p>Inspired by the success of TransE <ref type="bibr">(Bordes et al., 2013)</ref> approach in knowledge representation learning, we adapt the key assumption of TransE (i.e., we want h + r &#8673; t when (h, r, t) holds.) to our framework and hypothesize that the hidden representation of tail t should be close to the hidden representation of head h plus the relationspecific vector h r in vector space if (h, r, t) holds.</p><p>To guide the knowledge-aware word representation learning of the knowledge layer, we propose a knowledge regularizer based on TransE<ref type="foot">foot_7</ref> score function d transE () and apply it to the output word vectors of the knowledge layer. The resulting regularization term is also minimized as a part of the objective function during model training. In other words, the knowledge regularization will smoothly penalize constraint depending on whether this constraint can be applied for interpreting relevant relation in a particular context.</p><p>Specifically, we use cosine similarity<ref type="foot">foot_8</ref> to measure the similarity of two vectors, so the score function for triplet (h, r, t) has the following form:</p><p>where h r is the relation-specific vector which will be updated as parameters during model training. The event knowledge regularization term is:</p><p>For coreference relations, we create a special triplet (h, coref, t) for each two entity mentions h and t in one coreference cluster, and fix the relation-specific vector h coref to be a zero vector representing the relation of being "identical". The coreference relation regularization term is:</p><p>Hence, the overall loss function for our model is:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Dataset and Preprocessing</head><p>Dataset: We evaluate our model on PDTB v2.0 <ref type="bibr">(Prasad et al., 2008)</ref>, which is the largest annotated dataset containing 19K explicit discourse relations and 17K implicit discourse relations. To make our experimental results directly comparable with previous work, we adopted the mostused dataset splitting "PDTB-Ji" <ref type="bibr">(Ji and Eisenstein, 2015)</ref>  as train, dev and test sets respectively. To recover the paragraph contexts and gold discourse units, we directly ran the source code<ref type="foot">foot_9</ref> of <ref type="bibr">Dai and Huang (2018)</ref>, and obtained 12,037/1222/1050 paragraph instances in train/dev/test sets respectively.</p><p>Knowledge Preprocessing: Table <ref type="table">1</ref> gives an overview of the relation types used in our experiments and the number of triplets (clusters) identified in the PDTB dataset. Specifically, for coreference relations, we utilized the Stanford CoreNLP coreference resolver to identify coreference clusters in each paragraph. For event knowledge, we considered three major event relation types including temporal, causal and subevent. We obtained event temporal knowledge from a previous work <ref type="bibr">(Yao and Huang, 2018)</ref>  <ref type="foot">10</ref> and we retrieved the latter two types of event knowledge from Concept-Net<ref type="foot">foot_11</ref>  <ref type="bibr">(Speer and Havasi, 2012)</ref>, which is a widelyused commonsense knowledge base.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Experiment Setting</head><p>Evaluation Setting: Annotated discourse relation labels in PDTB v2.0 are organized in a three-level hierarchy. The top-level coarse-grained discourse relation classes include Comparison (Comp), Contingency (Cont), Expansion (Exp) and Temporal (Temp), which are further split into 16 finegrained classes at the second-level. To compare with previous work, we report the macro-average F1-score and accuracy<ref type="foot">foot_12</ref> on the top-level multi-class classification setting. Note that the macroaverage F1-score is normally treated as the main evaluation metric in most previous work considering the imbalanced distribution of discourse relations. In addition, we report class-wise F1scores for the top-level implicit discourse relations. But different from many previous work that report class-wise F1-scores obtained by using the one-versus-all binary classification setting, we instead report class-wise F1-scores using the 4-way multi-class classification setting, following <ref type="bibr">Dai and Huang (2018)</ref> which pointed out that compared to the one-versus-all binary classification setting where all binary classifiers may predict a positive label for one instance, the multiclass classification setting is more appropriate in evaluating a practical end-to-end discourse parser without the need of prediction conflict resolution. We additionally evaluate our models at the secondlevel using the 11-way<ref type="foot">foot_13</ref> multi-class classification.</p><p>Training Setting: To make it easy for model tuning, we only chose coref and event from [0.1, 0.5, 1.0] and tuned them based on the best performance on the dev set. All the BiLSTM layers and our knowledge layer used the hidden state size of 512, so the dimension of all hidden vectors (h &#8676; , f coref (h &#8676; ) and f r (h &#8676; )) is 512. To prevent gradient exploding problem of LSTMs, we clipped the gradient L2-norm with threshold 5.0 and used L2 regularization with coefficient 10 8 . We applied dropout with probability 0.5 on the input/output of BiLSTMs to alleviate overfitting. For the optimizer, we used the SGD with momentum 0.9 and batch size of 64, and we set the initial learning rate as 0.015 which will decay by 5% after each epoch.</p><p>To diminish the effects of randomness in neural network model training, we ran all our proposed model, its variants as well as our own base model 3 times using different random seeds and reported the average performance over 3 runs. For fair comparison, we implemented all our models with Pytorch and tested them on a Nvidia 1080Ti GPU.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Models for Comparison</head><p>We compare our proposed regularization models with the following base model, our own baselines and recent published discourse parsing systems:</p><p>&#8226; <ref type="bibr">(Dai and Huang, 2018)</ref> &#8226; Base Model: our replicated model of <ref type="bibr">(Dai and Huang, 2018)</ref> for paragraph-level discourse parsing. &#8226; Base Model + Word Features: our own baseline that creates discrete features for each word. We create one feature for each type of relations, including three types of event relations and coreference relations, which counts the number of relation triplets that contain a word. We concatenate these word features with the input word vector w i . &#8226; Base Model + DU Features: our own baseline that creates discrete features for each DU DU j . We create two features for each type of relations: one counts relation triplets that have both nodes within DU j ; and the other counts relation triplets that have one node in DU j and the other node in an adjacent DU.</p><p>We concatenate these DU features with the hidden DU representation h DU j . Adding either word features or DU features is to imitate traditional feature-based approaches and incorporate event knowledge and coreference relation constraints as features. &#8226; Base Model + two-step approach: our own baseline that follows the two-step approach for incorporating relational constraints, including both event relations and coreference relations. We re-implement the inference model proposed by Chen et al. ( <ref type="formula">2018</ref>) 14 , 14 We followed <ref type="bibr">Chen et al. (2018)</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Experiment Results</head><p>Table <ref type="table">2</ref> shows the comparisons. The first section lists the results of previous models that were evaluated on PDTB using top-level multi-class classification. Note that many cells are empty, it is beparameter for each relation type in the range <ref type="bibr">[0.1, 0.5, 1, 5, 10, 20, 50, 100]</ref>, and we found that the best result was achieved when is set to 0.1 for each type of relations.  cause that many previous publications chose to report the class-wise implicit relation prediction performance using the one-versus-all binary classification setting, which are not directly comparable with our class-wise results using the multi-class classification setting following our previous work <ref type="bibr">Dai and Huang (2018)</ref>. In addition, we also report the explicit relation results.</p><p>In the second section, the replicated model (Base Model) achieves an overall similar performance 15 compared to the original model of <ref type="bibr">Dai and Huang (2018)</ref>. By incorporating coreference relations (+ Coreference) into the base model using our regularization approach, implicit discourse relation performance was improved for two classes, Contingency and Adding event temporal knowledge (+ Event Temporal) into the base model significantly boosts the performance of Temporal discourse relation identification. Furthermore, adding the additional two types of event knowledge (+ Event), causal and subevent relations, yields clear performance gains in predicting another two classes of implicit discourse relations: Contingency and Expansion. These performance gains meet our expectations that event relations are correlated with discourse relations and event relational knowledge facilitates predicting corresponding discourse relations, with correspondences listed in Table <ref type="table">1</ref>. The full model considering both event knowledge and coreference relations (+ C&amp;E) achieves further improvements on implicit relation prediction, and outperforms the base model by 1.5 and 1.9 points on macro-average F1-score and accuracy respectively. Meanwhile, our full model obtains the best results for explicit relation prediction as well. Shown in the third section, our own baselines which incorporate relational constraints either as features or via the two-step approach only perform slightly better than the base model, but clearly worse than the full model using the regularization approach. The first two baselines incorporate constraints as additional discrete features and may suffer from feature sparsity issues, while the two-step approach may fail to balance attentions to contexts and knowledge constraints.</p><p>The last section presents the performance when using ELMo word embeddings. Our full model outperforms the base model on three out of four (except Comp) implicit discourse relations and improves both macro-average F1-score and accuracy by 2.1 and 3.2 points respectively. Furthermore, our full model outperforms the previous best system <ref type="bibr">(Bai and Zhao, 2018)</ref> using ELMo by over 1.8 points of macro-average F1-score.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Analysis</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Second-level Multi-class Classification</head><p>Table <ref type="table">3</ref> reports the performance of our models for predicting second-level fine-grained discourse relations. Same as top-level, our full model consistently outperforms the base model and its variants using either word-level or DU-level features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Impact of the Knowledge Regularization</head><p>To study the necessity of the knowledge regularization in our full model, we removed the regularization terms from the objective function by setting coref and event to be 0, which essentially means that we did not restrict or regulate the hidden knowledge-aware word vectors at all. From Table <ref type="table">4</ref>, we can see that the model without knowledge regularizer performs significantly worse than the full model and even worse than the base model, which supports our hypothesis that using external knowledge or linguistic constraints blindly can hurt the performance. We conclude that the knowledge regularizer plays a key role in achieving the state-of-the-art performance and the knowledge layer must be used together with knowledge regularization in our framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Qualitative Analysis</head><p>To better understand the strengths and weaknesses of the regularization approach, we analyzed implicit discourse relation predictions made by our base model and full model on the dev set. In total, there are 507 implicit discourse relations that match with at least one event or coreference constraint across two DUs, while the remaining 717 instances do not involve those constraints. It turns out that both the full model and base model performed comparably on recognizing implicit discourse relations without event or coreference constraints, with 407 vs. 402 discourse relations correctly predicted by the full model and the base model respectively. Therefore, the overall performance gains achieved by the full model are mainly from better resolving implicit discourse relations with constraints, and as shown in Table <ref type="table">5</ref>, the full model made 24 less errors than base model for predicting such implicit discourse relations. We further compared predictions made by the two models for implicit discourse relations with constraints. We found that 96 predictions in total have been changed by the full model, with clearly more corrections (60, i.e., the full model corrected predictions that were made wrongly by the base model.) than false reversions (36, i.e., the correct predictions made by the base model were wrongly reverted by the full model.). Here is one example from the 60 corrections made by the full model: DU1: Steve and his firm still worth a lot of money. DU2: A package of credit support was put together including the assets of Steve and his firm. Gold Discourse Relation: Implicit Contingency Base Model's prediction: Implicit Expansion Full Model's prediction: Implicit Contingency The event causal relation between "worth" and "support" identified using external event knowledge has enabled the full model to correctly recognize this Contingency discourse relation.</p><p>We further examined the 36 wrong reversions of decisions. Around one third of these errors were due to either noise of event relation knowledge or incorrect coreference relations produced by the external CoreNLP coreference resolver we used. The remaining errors came from over-reliance of the full model on constraints in general. Considering the following example from the 36 reversions: DU1: Another analyst thought that India may have pulled back because of the concern over the stock market. DU2: India may have felt that if there was a severe drop in the stock market and it affected sugar, it could buy at lower prices. Gold Discourse Relation: Implicit Expansion Base Model's prediction: Implicit Expansion Full Model's prediction: Implicit Temporal</p><p>The full model could have relied on the event temporal relation between "pulled back" and "drop" and made the wrong discourse relation prediction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion and Future Work</head><p>We have presented an effective regularization approach for incorporating external event knowledge and system predicted coreference relations into an existing paragraph-level neural network model for discourse parsing. Our approach tightly integrates knowledge and linguistic constraints with contexts for deriving knowledge-aware word vectors and meanwhile balances attentions over context and constraints through regularization, which robustly improves both implicit and explicit discourse relation classification performance on the benchmark PDTB corpus. In the future, we will identify new types of commonsense knowledge for further improving the performance of discourse parsing. For example, antonyms (e.g., warm vs. cold) can directly indicate a contrast relation between two situations, and this type of knowledge has potential to further improve the performance on Comparison discourse relations.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Entity coreference relations were generated using an existing coreference resolver from Standford CoreNLP toolkit.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>In our re-implementation, we made several minor modifications to the original base model by using character-level features as well as supporting both traditional fixed word embeddings (300D GloVe<ref type="bibr">(Pennington et al., 2014)</ref>) and latest context-dependent word embeddings (1024D ELMo<ref type="bibr">(Peters et al., 2018)</ref>) for word embedding initialization.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>Both character embedding and CNN hidden size is 50.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>In this work, we used capitalization (Cap) flag, Partof-speech (POS) tag and named entity (NER) tag of each word as extra word-level features. The embedding size for Cap/POS/NER is</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_4"><p>5/35/20. We used Standford CoreNLP toolkit(Manning et al., 2014)  to generate POS and NER tags.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_5"><p>We also tried to use more complicated neural nets including neural tensor networks<ref type="bibr">(Socher et al., 2013)</ref> and selfattentions mechanism<ref type="bibr">(Vaswani et al., 2017)</ref>, but none of them performed better than straightforward feedforward neural network. We even tried to not update (identical function) hidden word vectors, but it performed significantly worse.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_6"><p>We tried to employ weights with soft-attention mechanism, but it did not show improvement in our experiments.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_7"><p>We also tried TransD(Ji et al., 2015)  and TransR<ref type="bibr">(Lin et al., 2015)</ref> for knowledge regularization, but none of them showed clear improvement over TransE in our experiments.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_8"><p>Note that cosine similarity performed better than L1 or L2 distance in our experiments.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_9"><p>Available at https://github.com/ZeyuDai/ paragraph_implicit_discourse_relations. Relations (0.5%) between non-adjacent DUs were discarded.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_10"><p>We also tried to use VerbOcean<ref type="bibr">(Chklovski and Pantel, 2004)</ref> which performed worse than our choices.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_11"><p>Available at http://conceptnet.io. To extract event causal knowledge, we merged the relations ['Causes', 'CausesDesire', 'Entails'] defined in ConceptNet. To extract subevent knowledge, we merged the relations['Has- Subevent', 'HasFirstSubevent', 'HasLastSubevent']. For simplicity, we removed relations containing multi-word events or non-event words (e.g., function words).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_12"><p>Note that 3% discourse relations in PDTB were annotated with more than one label. Following previous work<ref type="bibr">(Dai and Huang, 2018;</ref><ref type="bibr">Bai and Zhao, 2018)</ref>, we considered a prediction as correct if it matches one of the gold labels.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_13"><p>We followed<ref type="bibr">Ji and Eisenstein (2015)</ref> to exclude 5 minor second-level classes in our experiments because none of these classes appear in the test or dev sets.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="16" xml:id="foot_14"><p>We downloaded the pretrained ELMo embedding (5.5B version) from AllenAI's website (https://allennlp. org/elmo) and froze its parameters during model training.15  The performance changes on individual categories are due to minor modifications we made in replication, such as adding char-level CNN and replacing word2vec with GloVe.</p></note>
		</body>
		</text>
</TEI>
