<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Latent Part-of-Speech Sequences for Neural Machine Translation</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/01/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10173297</idno>
					<idno type="doi">10.18653/v1/D19-1072</idno>
					<title level='j'>Empirical Methods in Natural Language Processing</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Xuewen Yang</author><author>Yingru Liu</author><author>Dongliang Xie</author><author>Xin Wang</author><author>Niranjan Balasubramanian</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Learning target side syntactic structure has been shown to improve Neural Machine Translation (NMT). However, incorporating syntax through latent variables introduces additional complexity in inference, as the models need to marginalize over the latent syntactic structures. To avoid this, models often resort to greedy search which only allows them to explore a limited portion of the latent space. In this work, we introduce a new latent variable model, LaSyn, that captures the co-dependence between syntax and semantics, while allowing for effective and efficient inference over the latent space. LaSyn decouples direct dependence between successive latent variables, which allows its decoder to exhaustively search through the latent syntactic choices, while keeping decoding speed proportional to the size of the latent variable vocabulary. We implement LaSyn by modifying a transformer-based NMT system and design a neural expectation maximization algorithm that we regularize with part-of-speech information as the latent sequences. Evaluations on four different MT tasks show that incorporating target side syntax with LaSyn improves both translation quality, and also provides an opportunity to improve diversity.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Syntactic information has been shown to improve the translation quality in NMT models. On the source side, syntax can be incorporated in multiple ways -either directly during encoding <ref type="bibr">(Chen et al., 2018;</ref><ref type="bibr">Sennrich and Haddow, 2016;</ref><ref type="bibr">Eriguchi et al., 2016)</ref>, or indirectly via multi-task learning to produce syntax informed representations <ref type="bibr">(Eriguchi et al., 2017;</ref><ref type="bibr">Baniata et al., 2018;</ref><ref type="bibr">Niehues and Cho, 2017;</ref><ref type="bibr">Zaremoodi et al., 2018)</ref>. On the target side, however, incorporating the syntax is more challenging due to the additional complexity in inference when decoding over latent states. To avoid this, existing methods resort to approximate inference over the latent states using a two-step decoding process <ref type="bibr">(G&#363; et al., 2018;</ref><ref type="bibr">Wang et al., 2018a;</ref><ref type="bibr">Wu et al., 2017;</ref><ref type="bibr">Aharoni and Goldberg, 2017)</ref>. Typically, the first stage decoder produces a beam of latent states, which serve as conditions to feed into the second stage decoder to obtain the target words. Thus, training and inference in these models can only explore a limited sub-space of the latent states.</p><p>In this work, we introduce LaSyn, a new target side syntax model that allows exhaustive exploration of the latent states to ensure a better translation quality. Similar to prior work, LaSyn approximates the co-dependence between syntax and semantics of the target sentences by modeling the joint conditional probability of the target words and the syntactic information at each position. However, unlike prior work, LaSyn eliminates the sequential dependence between the latent variables and simply infers the syntactic information at a given position based on the source text and the partial translation context. This allows LaSyn to search over a much larger set of latent state sequences. In terms of time complexity, unlike typical latent sequential models, LaSyn only introduces an additional term that is linear in the size of latent variable vocabulary and the length of the sentence.</p><p>We implement LaSyn by modifying a transformer-based encoder-decoder model. The implementation uses a hybrid decoder that predicts two posterior distributions: the probability of syntactic choices at each position P (z n |x, y &lt;n ), and the probability of the word choices at each position conditioned on each of the possible values for the latent states P (y n |z n , x, y &lt;n ). The model cannot be trained by directly optimizing the data log-likelihood because of its non-convex property. We devise a neural expectation maximization (NEM) algorithm, whose E-step computes the posterior distribution of latent states under current model parameters, and M-step updates model parameters using gradients from back-propagation. Given some supervision signal for the latent variables, we can modify this EM algorithm to obtain a regularized training procedure. We use partsof-speech (POS) tag sequences, automatically generated by an existing tagger, as the source of supervision.</p><p>Because the decoder is exposed to more latent states during training, it is more likely to generate diverse translation candidates. To obtain diverse sequences, we can decode the most likely translations for different POS tag sequences. This is a more explicit and effective way of performing diverse translation than other methods based on diverse or re-ranking beam search <ref type="bibr">(Vijayakumar et al., 2018;</ref><ref type="bibr">Li and Jurafsky, 2016)</ref>, or coarse codes planning <ref type="bibr">(Shu and Nakayama, 2018)</ref>.</p><p>We evaluate LaSyn on four translation tasks. Evaluations show that LaSyn outperforms models that only use partial exploration of the latent states for incorporating target side syntax. A diversity based evaluation also shows that when using different POS tag sequences during inference, LaSyn produces more diverse and meaningful translations compared to existing models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">A Latent Syntax Model for Decoding</head><p>In a standard sequence-to-sequence model, the decoder directly predicts the target sequence y conditioned on the source input x. The translation probability P (y|x) is modeled directly using the probability of each target word y n at time step n conditioned on the source sequence x and the current partial target sequence y &lt;n as follows:</p><p>where, &#952; denotes the parameters of both the encoder and the decoder.</p><p>In this work, we model syntactic information of target tokens using an additional sequence of variables, which captures the syntactic choices<ref type="foot">foot_0</ref> at Figure <ref type="figure">1</ref>: Target-side Syntax Models: (a) An ideal solution that captures full co-dependence between syntax and semantics. (b) A widely-used two-step decoding model <ref type="bibr">(Wang et al., 2018a;</ref><ref type="bibr">Wu et al., 2017;</ref><ref type="bibr">Aharoni and Goldberg, 2017)</ref>. (c) LaSyn, our latent syntax model that uses non-sequential latent variables for exhaustive search of latent states. each time step. There are multiple ways of incorporating this additional information in a sequenceto-sequence model.</p><p>An ideal solution should capture the codependence between syntax and semantics. In a sequential translation process, the word choices at each time step depend on both the semantics and the syntax of the words generated at the previous time steps. The same dependence also holds for the syntactic choices at each time step. Figure <ref type="figure">1a</ref> shows a graphical model that captures this full co-dependence between the syntactic variable sequence z 1 , . . . , z N and the output word sequence y 1 , . . . , y N . Such a model can be implemented using two decoders, one to decode syntax and the other to decode output words. The main difficulty, however, is that inference in this model is intractable since it involves marginalizing over the latent z sequences.</p><p>To keep inference tractable, existing approaches treat syntactic choices z as observed sequential variables <ref type="bibr">(G&#363; et al., 2018;</ref><ref type="bibr">Wang et al., 2018a;</ref><ref type="bibr">Wu et al., 2017;</ref><ref type="bibr">Aharoni and Goldberg, 2017)</ref>, as shown in Figure <ref type="figure">1b</ref>. These models use a two-stage decoding process, where for each time step they first produce most likely latent state z n and then use this as input to a second stage that decodes words. However, this process is unsatisfactory in two respects. First, the inference of syntactic choices is still approximate as it does not explore the full space of z. Second, these models are not well-suited for controllable or diverse translations. Using such a model to decode from an arbitrary z sequence is a divergence from its training, where it only learns to decode from a limited space of z sequences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Model Description</head><p>Our goal is to design a model that allows for exhaustive search over syntactic choices. We introduce LaSyn, a new latent model shown in Fig- <ref type="figure">ure 1c</ref>. The syntactic choices are modeled as true latent variables i.e., unobserved variables. Compared to the ideal model in Figure <ref type="figure">1a</ref>, LaSyn includes two simplifications for tractability: (i) The dependence between successive syntactic choices is modeled indirectly, via the word choices made in the previous time steps. <ref type="bibr">(ii)</ref> The word choice at each position depends only on the syntactic choice at the current position and the previous predicted words. Dependence on previous syntactic choices is modeled indirectly.</p><p>Under this model, the joint conditional probability of the target word y n together with its corresponding latent syntactic choice z n<ref type="foot">foot_1</ref> is given by: P (y n , z n |x, y &lt;n ) = P (y n |z n , x, y &lt;n )</p><p>We implement LaSyn by modifying the Transformer-based encoder-decoder architecture <ref type="bibr">(Vaswani et al., 2017)</ref>. As shown in Figure <ref type="figure">2</ref>, LaSyn consists of a shared encoder for encoding source sentence x and a hybrid decoder that manages the decoding of the latent sequence z (left branch) and the target sentence y (right branch) separately.</p><p>The encoder consists of the standard selfattention layer, which generates representations of each token in the source sentence x. The hybrid decoder consists of a self-attention layer that encodes the output generated thus far (i.e., the partial translation), followed by a inter-attention layer </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Inference with Exhaustive Search for Latent States</head><p>When using additional variables to model target side syntax, exact inference requires marginalizing over these additional variables.</p><p>To avoid this exponential complexity, prior works use a two-step decoding process with models similar to the one shown in Figure <ref type="figure">1b</ref>. They use greedy or beam search to explore a subset B(z) of the latent space to compute the posterior distribution as follows:</p><p>Finding the most likely translation using LaSyn also requires marginalizing over the latent states. However, because the latent states in LaSyn don't directly depend on each other, we can exhaustively search over the latent states. In particular, we can show that when y is fixed (observed), the {z n } N n=1 variables are d-separated <ref type="bibr">(Bishop, 2006)</ref> i.e., are mutually independent. As a result, the time complexity for searching latent sequence z drops from</p><p>The posterior distribution for the translation probability at a time step n can be computed as follows:</p><p>where, F (z n ) is the full search space of latent states z n and the joint probability is factorized as specified in Equation <ref type="formula">2</ref>.</p><p>For decoding words, we use standard beam search<ref type="foot">foot_2</ref> with P (y n , z n |x, y &lt;n ) as the beam cost. With this inference scheme, we can easily control decoding for diversity, by feeding different z sequences to the right branch of the decoder and decode diverse y n by directly using P (y n |z n , x, y &lt;n ) as the beam cost. Unlike the two-step decoding models which only evaluate a small number of z n values at each time step (constrained by beam size), LaSyn evaluates all possible values for z n at each time step, while avoiding the evaluation of all possible sequences.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Training with Neural Expectation Maximization</head><p>The log-likelihood of LaSyn's parameters &#952;<ref type="foot">foot_3</ref> computed on one training pair x, y &#8712; D is given by:</p><p>Directly optimizing the log-likelihood function (equation 6) with respect to model parameter &#952; is challenging because of the highly non-convex function P (y n , z n |x, y &lt;n ) and the marginalization over z n .<ref type="foot">foot_4</ref> Alternatively, we optimize the system parameters by Expectation Maximization (EM).</p><p>Using Jensen's inequality, equation ( <ref type="formula">6</ref>) can be re-written as:</p><p>where L lower (Q, &#952;) is the lower bound of the loglikelihood and Q is any auxiliary probability distribution defined on z n . &#952; is omitted from the expression for simplicity.</p><p>We set Q(z n ) = P (z n |x, y &#8804;n ; &#952; old ), the probability of the latent state computed by the decoder (shown in the left branch in Figure <ref type="figure">2</ref>). Substituting this in equation ( <ref type="formula">7</ref>), we obtain the lower bound as</p><p>where</p><p>EM algorithm for optimizing Q(&#952;, &#952; old ) consists of two major steps. In the E-step, we compute the posterior distribution of z n with respect to &#952; old by</p><p>where &#947;(z n = i) is the responsibility of z n = i given y n , and can be calculated by equation ( <ref type="formula">2</ref>).</p><p>In the M-step, we aim to find the configuration of &#952; that would maximize the expected loglikelihood using the posteriors computed in the Estep. In conventional EM algorithm for shallow probabilistic graphical model, the M-step is generally supposed to have closed-form solution. However, we model the probabilistic dependencies by deep neural networks, where Q(&#952;, &#952; old ) is highly non-convex and non-linear with respect to network parameters &#952;. Therefore, there exists no analytical solution to maximize it. However, since deep neural network is differentiable, we can update &#952; by taking a gradient ascent step:</p><p>The resulting algorithm belongs to the class of generalized EM algorithms and is guaranteed (for a sufficiently small learning rate &#951;) to converge to a (local) optimum of the data log likelihood <ref type="bibr">(Wu, 1983)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Regularized EM training</head><p>The EM training we derived does not assume any supervision for the latent variables z. This can be seen as inferring the latent syntax of the target sentences by clustering the target side tokens into |V z | different categories. Given some tokenlevel syntactic information, we can modify the training procedure to regularize the generation of latent sequence P (z n |x, y &lt;n ) such that true latent sequences have higher probabilities under the model. In this work, we consider parts-of-speech sequences of the target sentences for regularization.</p><p>The regularized EM training objective is thus redefined as</p><p>where L lower (&#952;) is the EM lower bound in equation (7) and L z (&#952;) denotes cross entropy loss between P (z n |x, y &lt;n ) and the true POS tag sequences and &#955; is a hyper-parameter that controls the impact of the regularization. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Evaluation</head><p>We evaluate LaSyn on four translation tasks, including three with moderate sized datasets IWSLT 2014 <ref type="bibr">(Cettolo et al., 2015)</ref> German-English (De-En), English-German (En-De), English-French (En-Fr), and one with a relatively larger dataset, the WMT 2014 English-German (En-De). We describe the datasets in more details in the appendix.</p><p>We compare against three types of baselines: (i) general Seq2Seq models that use no syntactic information, (ii) models that incorporate source side syntax directly, and multitask learning models which include syntax indirectly, and (iii) models that use syntax on the target side. We also define a LaSyn Empirical Upper Bound (EUB), which is our proposed model using true POS tag sequences for inference.</p><p>We use BLEU as the evaluation metric <ref type="bibr">(Papineni et al., 2002)</ref> for translation quality. For diverse translation evaluation, we utilize distinct-1 score <ref type="bibr">(Li et al., 2016)</ref> as the evaluation metric, which is the number of distinct unigrams divided by total number of generated words.</p><p>For all translation tasks, we choose the base configuration of Transformer with d model = 512. During training, we choose Adam optimizer (Kingma and Ba, 2015) with &#946; 1 = 0.9, &#946; 2 = 0.98 with initial learning rate is 0.0002 with 4000 warm-up steps. We describe additional implementation and training details in the Appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Results on IWSLT'14 Tasks</head><p>Table <ref type="table">1</ref> compares LaSyn versions against some of the state-of-the-art models on the IWSLT'14 dataset. LaSyn-K rows show results when varying the number of EM update steps per batch (K).</p><p>On the De-En task, LaSyn provides a 1.7 points improvements over the Transformer baseline, demonstrating that the LaSyn's improvements come from incorporating target side syntax effectively. This result is also better than a transformer-based source side syntax model by 1.5 points. LaSyn results are also better than the published numbers for LSTM-based models that use multi-task learning for source side and models that uses target side syntax. Note that since the underlying architectures are different, we only present these results to show that the results with LaSyn are comparable with other models that have incorporated syntax.</p><p>On the En-De task, our model achieves 29.2 in terms of BLEU score, with 2.6 points improvement over the Transformer baseline and 2.4 points improvement over Transformer-based Source Side Syntax model. Compared with NPMT <ref type="bibr">(Huang et al., 2018)</ref>, which is a BiLSTM based model, we achieve 3.84 point improvement.</p><p>On the En-Fr task, our model set a new state-ofthe-art with a BLEU score of 40.6, which is 1.7 points improvement over the second best model which uses Transformer to incorporate source side syntax knowledge. Our model also surpasses the basic Transformer model by about 2.1 points.</p><p>We notice that across all tasks, the performance of our model improves with number of EM update steps per batch (K). With larger K values, we get better lower bounds L lower (&#952;) on each training batch, thus leading to better optimization. For update steps beyond K &gt; 5, the performance does not improve any more.</p><p>Last, the EUB row indicates the performance that can be obtained when feeding in the true POS tags. The large improvement here shows the potential for improvement when modeling target side syntax.</p><p>Table <ref type="table">2</ref> shows one example where LaSyn produces correct translations for a long input sentence. The output of LaSyn is close to the reference and the output of LaSyn when given the gold POS tag sequence is even better, demonstrating the benefits of modeling syntax. The transformer model however fails to decode the later portions of the long input accurately.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Speed</head><p>We compare the speeds of our (un-optimized) implementation of LaSyn with a vanilla transformer with no latents in its decoder. Table <ref type="table">3</ref> shows the training time per epoch, and the inference time for the whole test set. computed on the IWSLT'14 De-En task. When K = 1, LaSyn takes almost twice as much time as the vanilla transformer for training. Increasing K increases training time further. For inference, LaSyn takes close to four times as much time compared to the vanilla Transformer. In terms of complexity, LaSyn only adds a linear term (in POS tag size to the decoding complexity. Specifically, its decoding complexity is</p><p>where B is beam size, m is a constant proportional to the tag set size and N is output size. As the table shows, empirically, our current implementation incurs m 4. We leave further optimizations for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Diversity</head><p>We compare the diversity of translations using distinct-1 score <ref type="bibr">(Li et al., 2016)</ref>, which is simply the number of distinct unigrams divided by total number of generated words. We use our model to generate 10 translations for each source sentence of the test dataset. We then compare our results with baseline Transformer. The result is shown in Table <ref type="table">4</ref>. Much like translation quality, LaSyn's diversity increases with number of EM updates and is better than diversity of the transformer and a source side encoder model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.1">Controlling Diversity with POS Sequences</head><p>One of the main strengths of LaSyn is that it can generate translations conditioned on a given POS sequence. First, we present some examples that we generate by decoding over different POS tag sequences. Given a source sentence, we use SRC letztes jahr habe ich diese beiden folien gezeigt , um zu veranschaulichen , dass die arktische eiskappe , die fr annhernd drei millionen jahre die grsse der unteren 48 staaten hatte , um 40 prozent geschrumpft ist . REF last year i showed these two slides so that demonstrate that the arctic ice cap , which for most of the last three million years has been the size of the lower 48 states , has shrunk by 40 percent .</p><p>Transformer last year , i showed these two slides to illustrate that the arctic ice caps that had the size of the lower 48 million states to 40 percent . LaSyn last year , i showed these two slides to illustrate that the arctic ice cap , which for nearly three million years had the size of the lower 48 states , was shrunk by 40 percent . LaSyn (groundtrue POS) last year i showed these two slides just to illustrate that the arctic ice cap , which for nearly about the last three million years has been the size of the lower 48 states , has shrunk by 40 percent .  LaSyn to provide the most-likely target pos tag sequence. Then, we obtain a random set of valid POS tag sequences that differ from this maximum likely sequence by some edit distance. For each of these randomly sampled POS tag sequences, we let LaSyn generate a translation that fits the sequence. Table <ref type="table">5</ref> shows some example sentences.</p><p>LaSyn is able to generate diverse translations that reflect the sentence structure implied by the input POS tags. However, in trying to fit the translation into the specified sequence, it deviates somewhat from the ideal translation.</p><p>To understand how diversity plays against translation quality, we also conduct a small scale quantitative evaluation. We pick a subset of the test dataset, and for each source sentence in this subset, we sample 10 POS tag sequences whose edit distance to their corresponding Top-1 POS tag sequence equal to a specific value, we then use them  to decode W translations. We calculate their final BLEU and distinct-1 scores. The results are shown in Figure : 3. As the edit distance increases, diversity increases dramatically but at the cost of translation quality. Since the POS tag sequence acts as a template for generation, as we move further from the most likely template, the model struggles to fit the content accurately. Understanding this tradeoff can be useful for re-ranking or other scoring functions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Results on WMT'14 En-De</head><p>To assess the impact on a larger dataset, we show results on the WMT'14 English-German in table 6. Compared to the previously reported systems, we see that our transformer implementation is a strong baseline. LaSyn produces small gains, with the best gain at K=5 -a BLEU score improvement of 0.6. This demonstrates that syntactic information can contribute more to the increase of translation quality on a smaller dataset.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Related Work</head><p>Attention-based Neural Machine Translation (NMT) models have shown promising results in</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model BLEU</head><p>BiRNN+GCN <ref type="bibr">(Bastings et al., 2017)</ref> 23.9 ConvS2S <ref type="bibr">(Gehring et al., 2017)</ref> 25.16 MoE <ref type="bibr">(Shazeer et al., 2017)</ref> 26.03 Transformer (base) 27.3 LaSyn (K=1) (base) 27.6 LaSyn (K=3) (base) 27.8 LaSyn (K=5) (base)</p><p>27.9</p><p>Table <ref type="table">6</ref>: WMT'14 English-German results -shown are the BLEU scores of various models on TED talks translation tasks. We highlight the best model in bold.</p><p>various large scale translation tasks <ref type="bibr">(Bahdanau et al., 2015;</ref><ref type="bibr">Luong et al., 2015;</ref><ref type="bibr">Sennrich et al., 2016;</ref><ref type="bibr">Vaswani et al., 2017)</ref> using an Seq2Seq structure. Many Statistical Machine Translation (SMT) approaches have leveraged benefits from modeling syntactic information <ref type="bibr">(Chiang et al., 2009;</ref><ref type="bibr">Huang and Knight, 2006;</ref><ref type="bibr">Shen et al., 2008)</ref>. Recent efforts have demonstrated that incorporating syntax can also be useful in neural methods as well.</p><p>One branch uses features on the source side to help improve the translation performance <ref type="bibr">(Sennrich and Haddow, 2016;</ref><ref type="bibr">Morishita et al., 2018;</ref><ref type="bibr">Eriguchi et al., 2016)</ref>. <ref type="bibr">Sennrich et al. (2016)</ref> explored linguistic features like lemmas, morphological features, POS tags and dependency labels and concatenate their embeddings with sentence features to improve the translation quality. In a similar vein, <ref type="bibr">Morishita et al. (2018)</ref> and <ref type="bibr">Eriguchi et al. (2016)</ref>, incorporated hierarchical subword features and phrase structure into the source side representations. Despite the promising improvements, these approaches are limited in that the trained translation model requires the availability of external tools during inference -the source text needs to be processed first to extract syntactic structure <ref type="bibr">(Eriguchi et al., 2017)</ref>.</p><p>Another branch uses multitask learning, where the encoder of the NMT model is trained to produce multiple tasks such as POS tagging, namedentity recognition, syntactic parsing or semantic parsing <ref type="bibr">(Eriguchi et al., 2017;</ref><ref type="bibr">Baniata et al., 2018;</ref><ref type="bibr">Niehues and Cho, 2017;</ref><ref type="bibr">Zaremoodi et al., 2018)</ref>. These can be seen as models that implicitly generate syntax informed representations during encoding. With careful model selection, these methods have demonstrate some benefits in NMT.</p><p>The third branch directly models the syntax of the target sentence during decoding <ref type="bibr">(G&#363; et al., 2018;</ref><ref type="bibr">Wang et al., 2018a;</ref><ref type="bibr">Wu et al., 2017;</ref><ref type="bibr">Aharoni and Goldberg, 2017;</ref><ref type="bibr">Bastings et al., 2017;</ref><ref type="bibr">Li et al., 2018)</ref>. <ref type="bibr">Aharoni et al. (2017)</ref> treated constituency trees as sequential strings and trained a Seq2Seq model to translate source sentences into these tree sequences. <ref type="bibr">Wang et al. (2018a)</ref> and <ref type="bibr">Wu et al. (2017)</ref> proposed to use two RNNs, a Rule RNN and a Word RNN, to generate a target sentence and its corresponding tree structure. <ref type="bibr">Gu et al. (2018)</ref> proposed a model to translate and parse at the same time. However, apart from the complex tree structure to model, they all have a similar architecture as shown in Figure <ref type="figure">1b</ref>, which limits them to only exploring a small portion of the syntactic space during inference. LaSyn uses simpler parts-of-speech information in a latent syntax model, avoiding the typical exponential search complexity in the latent space with a linear search complexity and is optimized by EM. This allows for better translation quality as well as diversity. Similar to our work, <ref type="bibr">(Shankar et al., 2018)</ref> and <ref type="bibr">(Shankar and Sarawagi, 2019)</ref> proposed a latent attention mechanism to further reduce the complexity of model implementation by taking a top-K approximation instead of the EM algorithm as in LaSyn.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusion</head><p>Modeling target-side syntax through true latent variables is difficult because of the additional inference complexity. In this work, we presented LaSyn, a latent syntax model that allows for efficient exploration of a large space of latent sequences. This yields significant gains on four translation tasks, IWSLT'14 English-German, German-English, English-French and WMT'14 English-German. The model also allows for better decoding of diverse translation candidates. This work only explored parts-of-speech sequences for syntax. Further extensions are needed to tackle tree-structured syntax information.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>The variables can be used to model any linguistic information that can be expressed as choices for each word position (e.g., morphological choices). (a) Full co-dependence model. (b) Two-step decoding model. (c) LaSyn: Our Latent syntax model</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>Note that zn &#8712; Vz, where Vz is the vocabulary of latent syntax for the target, which differs from language to language.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>Note our primary goal is to perform exhaustive search in the latent space. Search in the target vocabulary space remains exponential in our model.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>This includes the trainable parameters of the encoder, decoder, and the latent state embeddings.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>Note that marginalization is an issue during training, unlike in inference. As P (yn, zn) is already an non-convex function with respect to &#952;, summing P (yn, zn) over different values of zn makes the function more complicated. Besides, we also need to compute gradients to update the parameters and computing the gradient of a log-of-sum function is costly and unstable. During the translation, we only need to compute the value of L(&#952;) as score for beam searching. Therefore, the marginalization is not an issue.</p></note>
		</body>
		</text>
</TEI>
