<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Sentiment-based Candidate Selection for NMT</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>08/16/2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10291477</idno>
					<idno type="doi"></idno>
					<title level='j'>MT Summit</title>
<idno></idno>
<biblScope unit="volume">Volume 1: Research Track</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Alex Jones</author><author>Derry Wijaya</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The explosion of user-generated content (UGC)—e.g. social media posts and comments and and reviews—has motivated the development of NLP applications tailored to these types of informal texts. Prevalent among these applications have been sentiment analysis and machine translation (MT). Grounded in the observation that UGC features highly idiomatic and sentiment-charged language and we propose a decoder-side approach that incorporates automatic sentiment scoring into the MT candidate selection process. We train monolingual sentiment classifiers in English and Spanish and in addition to a multilingual sentiment model and by fine-tuning BERT and XLM-RoBERTa. Using n-best candidates generated by a baseline MT model with beam search and we select the candidate that minimizes the absolute difference between the sentiment score of the source sentence and that of the translation and and perform two human evaluations to assess the produced translations. Unlike previous work and we select this minimally divergent translation by considering the sentiment scores of the source sentence and translation on a continuous interval and rather than using e.g. binary classification and allowing for more fine-grained selection of translation candidates. The results of human evaluations show that and in comparison to the open-source MT baseline model on top of which our sentiment-based pipeline is built and our pipeline produces more accurate translations of colloquial and sentiment-heavy source texts.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The Web, widespread internet access, and social media have transformed the way people create, consume, and share content, resulting in the proliferation of user-generated content (UGC). UGC-such as social media posts, comments, and reviews-has proven to be of paramount importance both for users and organizations/institutions <ref type="bibr">(Pozzi et al., 2016)</ref>. As users enjoy the freedoms of sharing their opinions in this relatively unconstrained environment, corporations can analyze user sentiments and extract insights for their decision-making processes, <ref type="bibr">(Timoshenko and Hauser, 2019)</ref> or translate UGC to other languages to widen the company's scope and impact. For example, <ref type="bibr">Hale (2016)</ref> shows that translating UGC between certain language pairs has beneficial effects on the overall ratings customers gave to attractions and shows on Tri-pAdvisor, while the absence of translation hurts ratings. However, translating UGC comes with its own challenges that differ from those of translating well-formed documents like news articles. UGC is shorter and noisier, characterized by idiomatic and colloquial expressions <ref type="bibr">(Pozzi et al., 2016)</ref>. Translating idiomatic expressions is hard, as they often convey figurative meaning that cannot be reconstructed from the meaning of their parts <ref type="bibr">(Wasow et al., 1983)</ref>, and remains one of the open challenges in machine translation (MT) <ref type="bibr">(Fadaee et al., 2018)</ref>. Idiomatic expressions, however, typically carry an additional property: they imply an affective stance rather than a neutral one <ref type="bibr">(Wasow et al., 1983)</ref>. The sentiment of an idiomatic expression, therefore, can be a useful signal for translation. In this paper, we hypothesize that a good translation of an idiomatic text, such as those prevalent in UGC, should be one that retains its underlying sentiment, and explore the use of textual sentiment analysis to improve translations.</p><p>Our motivation behind adding sentiment analysis model(s) to the NMT pipeline are several. First, with the sorts of texts prevalent in UGC (namely, idiomatic, sentiment-charged ones), the sentiment of a translated text is often arguably as important as the quality of the translation in other respects, such as adequacy, fluency, grammatical correctness, etc. Second, while a sentiment classifier can be trained particularly well to analyze the sentiment of various texts-including idiomatic expressions <ref type="bibr">(Williams et al., 2015)</ref>-these idiomatic texts may be difficult for even state-of-the-art (SOTA) MT systems to handle consistently. This can be due to problems such as literal translation of figurative speech, but also to less obvious errors such as truncation (i.e. failing to translate crucial parts of the source sentence). Our assumption however, is that with open-source translation systems such as OPUS MT<ref type="foot">foot_1</ref> , the correct translation of a sentiment-laden, idiomatic text often lies somewhere lower among the predictions of the MT system, and that the sentiment analysis model can help signal the right translation by re-ranking candidates based on sentiment. Our contributions are as follows:</p><p>&#8226; We explore the idea of choosing translations that minimize source-target sentiment differences on a continuous scale (0-1). Previous works that addressed the integration of sentiment into the MT process have treated this difference as a simple polarity (i.e., positive, negative, or neutral) difference that does not account for the degree of difference between the source text and translation. &#8226; We focus in particular on idiomatic, sentiment-charged texts sampled from real-world UGC, and show, both through human evaluation and qualitative examples, that our method improves a baseline MT model's ability to select sentiment-preserving and accurate translations in notable cases. &#8226; We extend our method of using monolingual English and Spanish sentiment classifiers to aid in MT by substituting the classifiers for a single, multilingual sentiment classifier, and analyze the results of this second MT pipeline on the lower-resource English-Indonesian translation, illustrating the generalizability of our approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Several papers in recent years have addressed the incorporation of sentiment into the MT process. Perhaps the earliest of these is <ref type="bibr">Sennrich et al. (2016)</ref>, which examined the effects of using honorific marking in training data to help MT systems pick up on the T-V distinction (e.g. informal tu vs. formal vous in French) that serves to convey formality or familiarity. <ref type="bibr">Si et al. (2019)</ref> used sentiment-labeled sentences containing one of a fixed set of sentiment-ambiguous words, as well as valence-sensitive word embeddings for these words, to train models such that users could input the desired sentiment at translation time and receive the translation with the appropriate valence. Lastly, <ref type="bibr">Lohar et al. (2017</ref><ref type="bibr">Lohar et al. ( , 2018) )</ref> experimented with training sentimentisolated MT models-that is, MT models trained on only texts that had been pre-categorized into a set number of sentiment classes i.e., positive-only texts or negative-only texts. Our approach is novel in using sentiment to re-rank candidate translations of UGC in an MT pipeline and in using precise sentiment scores rather than simple polarity matching to aid the translation process.</p><p>In terms of sentiment analysis models of non-English languages, <ref type="bibr">Can et al. (2018)</ref> experimented with using an RNN-based English sentiment model to analyze the sentiment of texts translated into English from other languages, while <ref type="bibr">Balahur and Turchi (2012)</ref> used SMT to generate sentiment training corpora in non-English languages. <ref type="bibr">Dashtipour et al. (2016)</ref> provides an overview and comparison of various techniques used to tackle multilingual sentiment analysis.</p><p>As for MT candidate re-ranking, Hadj Ameur et al. ( <ref type="formula">2019</ref>) provides an extensive overview of the various features and tools that have been used to aid in the candidate selection process, and also proposes a feature ensemble approach that doesn't rely on external NLP tools. Others who have used candidate selection or re-ranking to improve MT performance include <ref type="bibr">Shen et al. (2004)</ref> and <ref type="bibr">Yuan et al. (2016)</ref>. To the best of our knowledge, however, no previous re-ranking methods have used sentiment for re-ranking despite findings that MT often alters sentiment, especially when ambiguous words or figurative language such as metaphors or idioms are present or when the translation exhibits incorrect word order <ref type="bibr">(Mohammad et al., 2016)</ref>.</p><p>3 Models and Data</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Sentiment Classifiers</head><p>For the first portion of our experiments, we train monolingual sentiment classifiers, one for English and another for Spanish. For the English classifier, we fine-tune the BERT Base uncased model <ref type="bibr">(Devlin et al., 2019)</ref>, as it achieves SOTA or nearly SOTA results on various text classification tasks. We construct our BERT-based sentiment classifier model using BERT-ForSequenceClassification, following <ref type="bibr">McCormick and Ryan (2019)</ref>. For our English training and development data, we sample 50K positive and 50K negative tweets from the automatically annotated sentiment corpus described in <ref type="bibr">Go et al. (2009)</ref> and use 90K tweets for training and the rest for development. For the English test set, we use the human-annotated sentiment corpus also described in <ref type="bibr">Go et al. (2009)</ref>, which consists of 359 total tweets after neutral-labeled tweets are removed. We use BertTokenizer with 'bert-base-uncased' as our vocabulary file and fine-tune a BERT model using one NVIDIA V100 GPU to classify the tweets into positive or negative labels for one epoch using the Adam optimizer <ref type="bibr">(Kingma and Ba, 2014)</ref> with weight decay (AdamW in PyTorch) and a linear learning rate schedule with warmup. We use a batch size of 32, a learning rate of 2e-5, and an epsilon value of 1e-8 for Adam. We experiment with all hyperparameters manually, but find that the model converges very quickly (i.e. additional training after one epoch improves test accuracy negligibly, or causes overfitting). We achieve an accuracy of 85.2% on the English test set.</p><p>For the Spanish sentiment classifier, we fine-tune XLM-RoBERTa Large, a multilingual language model that has been shown to significantly outperform multilingual BERT (mBERT) on a variety of cross-lingual transfer tasks <ref type="bibr">(Conneau et al., 2020)</ref>, also using one NVIDIA V100 GPU. We construct our XLM-RoBERTa-based sentiment classifier model again following <ref type="bibr">McCormick and Ryan (2019)</ref>. The Spanish training and development data were collected from <ref type="bibr">Mozeti&#269; et al. (2016)</ref>. After removing neutral tweets, we obtain roughly 27.8K training tweets and 1.5K development tweets. The Spanish test set is a human-annotated sentiment corpus<ref type="foot">foot_2</ref> containing 7.8K tweets, of which we use roughly 3K after removing neutral tweets and evening out the number of positive and negative tweets. We use the XLMRobertaTokenizer with vocabulary file 'xlm-roberta-large' and fine-tune the XLM-RoBERTa model to classify the tweets into positive or negative labels. The optimizer, epsilon value, number of epochs, learning rate, and batch size are the same as those of the English model, determined via experimentation (without grid search or a more regimented method). Unlike with the English model, we found that fine-tuning the Spanish model sometimes produced unreliable results, and so employ multiple random restarts and select the best model, a technique used in the original BERT paper <ref type="bibr">(Devlin et al., 2019)</ref>. The test accuracy on the Spanish model was 77.8%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Baseline MT Models</head><p>The baseline MT models we use for both English-Spanish and Spanish-English translation are the publicly available Helsinki-NLP/OPUS MT models released by Hugging Face and based on Marian NMT <ref type="bibr">(Tiedemann and Thottingal, 2020;</ref><ref type="bibr">Junczys-Dowmunt et al., 2018;</ref><ref type="bibr">Wolf et al., 2019)</ref>. Namely, we use both the en-ROMANCE and ROMANCE-en Transformer-based models, which were both trained using the OPUS dataset <ref type="bibr">(Tiedemann, 2017)</ref> <ref type="foot">foot_3</ref> with Sentence Piece tokenization and using training procedures and hyperparameters specified on the OPUS MT Github page<ref type="foot">foot_4</ref> and in <ref type="bibr">Tiedemann and Thottingal (2020)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Method: Sentiment-based Candidate Selection</head><p>We propose the use of two language-specific sentiment classifiers (which, as we will describe later in the paper, can be reduced to one multilingual sentiment model)-one applied to the input sentence in the source language and another to the candidate translation in the target language-to help an MT system select the candidate translation that diverges the least, in terms of sentiment, from the source sentence.</p><p>Using the baseline MT model described in Section 3.2, we first generate n = 10 best candidate translations using a beam size of 10 at decoding time. We decided on 10 as our candidate number based on the fact that one can expect a relatively low drop off in translation quality with this parameter choice <ref type="bibr">(Hasan et al., 2007)</ref>, while also maintaining a suitably high likelihood of getting variable translations. Additionally, decoding simply becomes too slow in practice beyond a certain beam size.</p><p>Once our model generates the 10 candidate translations for a given input sentence, we use the sentiment classifier trained in the appropriate language to score the sentiment of both the input sentence and each of the translations in the interval [0, 1]. To compute the sentiment score S(x) for an input sentence x, we first compute a softmax over the array of logits returned by our sentiment model to get a probability distribution over all m possible classes (here, m = 2, since we only used positive-and negative-labeled tweets). Representing the negative and positive classes using the values 0 and 1, respectively, we define S(x) to be the expected value of the class conditioned on x, namely S(x) = m n=1 P (c n | x) v n , where c i is the ith class and v i is the value corresponding to that class. In our case, since we have only two classes and the negative class is represented with value 0, S(x) = P (positive class | x). After computing the sentiment scores, we take the absolute difference between the input sentence x's score and the candidate translation t i 's score for i = 1, 2, ..., 10 to obtain the sentiment divergence of each candidate. We select the candidate translation that minimizes the sentiment divergence, namely y = argmin ti |S(t i ) -S(x)|. Our method of selecting a translation differs from previous works in our use of the proposed sentiment divergence, which takes into account the degree of the sentiment difference (and not just polarity difference) between the input sentence and the candidate translation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">English-Spanish Evaluation Data</head><p>The aim of our human evaluation was to discover how Spanish-English bilingual speakers assess both the quality and the degree of sentiment preservation of our proposed sentiment-sensitive MT model's translations in comparison to those of the human (a professional translator), the baseline MT model (Helsinki-NLP/OPUS MT), and a SOTA MT model, namely Google Translate.</p><p>The human evaluation data consisted of 30 English (en) tweets, each translated using the above four methods to Spanish. We sample 30 English tweets from the English sentiment datasets that we do not use in training (Section 3.1) as well as from another English sentiment corpus (CrowdFlower, 2020) <ref type="foot">6</ref> . In assembling this evaluation set, we aimed to find a mix of texts that were highly idiomatic and sentiment-loaded-and thus presumably difficult to translate-but also ones that were more neutral in affect, less idiomatic, or some combination of the two.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">English-Spanish Evaluation Setup</head><p>For the English-Spanish evaluation, we hired two fully bilingual professional translators using contracting site Freelancer<ref type="foot">foot_6</ref> . Both evaluators were asked to provide proof of competency in both languages beforehand. The evaluation itself consisted of four translations (one generated by each method: human, baseline, sentiment-MT, Google Translate) for each of the 30 English tweets above, totaling 120 texts to be evaluated. For each of these texts, evaluators were asked to:</p><p>1. Rate the accuracy of the translation on a 0-5 scale, with 0 being the worst quality and 5 being the best 2. Rate the sentiment divergence of the translation on a 0-2 scale, with 0 indicating no sentiment change and 2 indicating sentiment reversal 3. Indicate the reasons for which they believe the sentiment changed in translation</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">English-Spanish Evaluation Results</head><p>As depicted in Table <ref type="table">1</ref>, the results of the English-Spanish human evaluation show improvements across the board for our modified pipeline over the vanilla baseline model. For the purposes of analysis, we divide the 30 English sentences (120 translations) into two categories: "all" (consisting of all 120 translations) and "idiomatic," consisting of 13 sentences (52 translations) deemed particularly idiomatic in nature. Although methods exist for identifying idiomatic texts systematically, e.g. <ref type="bibr">Peng et al. (2014)</ref>, we opt to hand-pick idiomatic texts ourselves. We do this in hopes of curating not only texts that contain idiomatic "multi-word" expressions, but also ones that are idiomatic in less concrete ways, which will enable us to gain more qualitative insights in the evaluation. Examples of such sentences are discussed in Section 7.</p><p>In the 'all' subset of the data, we see a +0.12 gain for our modified pipeline over the baseline in terms of accuracy (where higher accuracy is better), as well as a +0.11 reduction in sentiment divergence (where smaller divergence is better). On the idiomatic subset, the differences are more pronounced: we see a +0.80 gain over the baseline for accuracy and a +0.35 reduction in sentiment divergence. While our pipeline lags behind Google Translate in all metrics for English-Spanish-due to the superiority of Google Translate over OPUS MT in multiple regards (training data size, parameters, multilinguality, compute power, etc.)-our modification moves OPUS MT closer to this SOTA system. As a benchmark and to validate the soundness of our evaluation set, we include results for translations performed by a professional human translator, which, as expected, are vastly superior to those for any of the NMT systems used across all metrics and subsets of the data.</p><p>We also provide qualitative insights gained from the evaluations, in which evaluators were asked to identify why they believe the sentiment of the text per se changed in translation. The codes corresponding to these qualitative results are listed in the rightmost column of Table <ref type="table">1</ref>, and may be identified as follows:</p><p>&#8226; "MI" indicates the Mistranslation of Idiomatic/figurative language per se The BLEU scores on the Tatoeba dataset, the accuracy and sentiment divergence scores on Twitter data, and the top 3 reasons given for sentiment divergence for each translation method, language pair, and chosen subset of the Twitter data: all vs. idiomatic. en&#8594;es represents English-Spanish, and en&#8594;id represents English-Indonesian. Note that ratings for each language are given by different sets of evaluators, and shouldn't be compared on a cross-lingual basis.</p><p>&#8226; "MO" indicates the Mistranslation of Other types of language &#8226; "IG" indicates Incorrect Grammatical structure in the translation &#8226; "IR" indicates IRrecoverability of the source text's meaning, i.e. even the gist of the sentence was gone &#8226; "LT" indicates a Lack of Translatability of the source text to the language in question &#8226; "O" indicates some Other reason for sentiment divergence</p><p>The top three most frequently cited causes of sentiment divergence for both the baseline and Google Translate were mistranslation of idiomatic language per se, mistranslation of other types of language, and other reasons not listed on the evaluation form. For our modified pipeline, the only distinctive top three cause of sentiment divergence was incorrect grammatical structure in the translation; additionally, one human translation was surprisingly flagged as rendering the source text's meaning "irrecoverable." However, the actual frequency of these error codes varied among models. For instance, 'MO' was given 5 times to human translations but 13 times to the baseline model's, and 'O' was given 3 times to Google Translate's translations and 7 times to our pipeline's. Some translations flagged with the 'Other' category are deemed to be of special interest and are discussed in Section 7.</p><p>We also noted strong and statistically significant (p &lt;&lt; 0.05) negative correlations between accuracy and sentiment divergence for both the whole and idiomatic subsets of the data; the values of Pearson's r ( <ref type="bibr">Lewis-Beck et al., 2004)</ref> with their corresponding p-values are reported in Table <ref type="table">2</ref>.</p><p>Additionally, we measure agreement between the two English-Spanish evaluators using Krippendorff's inter-annotator agreement measure &#945; <ref type="bibr">(Krippendorff, 2011)</ref>, which we choose as a metric in order to compare with previous work examining human agreement on sentiment judgments. In line with <ref type="bibr">Provoost et al. (2019)</ref>'s findings of moderate agreement (&#945; = 0.51), we see &#945; values ranging from 0.638 to 0.673 for the whole and idiomatic subsets of the data, respectively.  In terms of automatic MT evaluation, we note that although our method causes a decrease in BLEU score on the Tatoeba test data for both languages (Table <ref type="table">1</ref>: SentimentMT vs. Baseline)-which is to be expected, as Tatoeba consists of "general" texts as opposed to UGC, and we select potentially non-optimal candidates during re-ranking-our method improves over the baseline for the Spanish tweets (and more so on the idiomatic tweets) on which the human evaluation was conducted. This result supports the efficacy of our model in the context of highly-idiomatic, affective UGC, and highlights the different challenges that UGC presents in comparison to more "formal" text.</p><p>Google Translate still outperforms the baseline and our method in terms of BLEU score on Tatoeba and the tweets. The explanation here is simply that the baseline model is not SOTA, which is to be expected given it's a free, flexible, open-source system. However, as our pipeline is orthogonal to any MT model, including SOTA, it could be used to improve a SOTA MT model for UGC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Method Extension</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Translation with Multilingual Sentiment Classifier</head><p>As highlighted in Hadj Ameur et al. ( <ref type="formula">2019</ref>), one of the major criticisms of decoder-side reranking approaches for MT is their reliance on language-specific external NLP tools, such as the sentiment classifiers described in Section 3.1. To address the issue of language specificity and to develop a sentiment analysis model that can be used in tandem with MT between any two languages, we develop a multilingual sentiment classifier following <ref type="bibr">Misra (2020)</ref>. Specifically, we fine-tune the XLM-RoBERTa model using the training and development data used to train the English sentiment classifier, and the same tokenizer, vocabulary file, hyperparameters, and compute resources (GPU) used in training the Spanish classifier. We then use this multilingual language model fine-tuned on English sentiment data to perform zero-shot sentiment classification on various languages, and incorporate it into our beam search candidate selection pipeline for MT.</p><p>We test the model using the same test data used previously. On the English test data, this multilingual model achieves an accuracy of 83.8%, comparable to the accuracy score achieved using the BERT monolingual model (85.2%). On the Spanish test set, the multilingual model achieves a somewhat lower score of 73.6% (cf. 77.8% for the monolingual trained model), perhaps showing the limitations of this massively multilingual model on performing zero-shot downstream tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">English-Indonesian Evaluation Setup</head><p>We use the multilingual sentiment classifier in our sentiment-sensitive MT pipeline to perform translations on a handful of languages; examples from this experimentation are displayed in Tables <ref type="table">4</ref> and<ref type="table">5</ref> in the appendix.</p><p>We perform another human evaluation, this time involving English&#8594;Indonesian translations in place of English&#8594;Spanish. We choose Indonesian, as it is a medium-resource language (unlike Spanish, which is high-resource) <ref type="bibr">(Joshi et al., 2020)</ref>, and because we were able to obtain two truly bilingual annotators for this language pair.</p><p>The setup of the evaluation essentially mirrors that of the en&#8594;es evaluation, except we don't obtain professional human translations as a benchmark for Indonesian, due to the difficulty of obtaining the quality of translation required. Thus, the resulting evaluation set contains only 30 * 3 = 90 translations instead of 120.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">English-Indonesian Evaluation Results</head><p>The accuracy and sentiment divergence averages for different subsets of the en-id data are located in Table <ref type="table">1</ref>, and we direct readers to Section 5.3 for a qualitative discussion of these results. Quantitatively, we observe that our modified model outperforms the baseline in accuracy and sentiment divergence on every subset of the en-id data, while being comparable or better than Google Translate on the "all" and idiomatic subsets, respectively (Table <ref type="table">1</ref>). Specifically, on the "all" subset we see reductions of +0.33 and +0.12 over the baseline for accuracy and sentiment divergence, respectively, and on the idiomatic subset we see respective reductions of +0.70 and +0.36. Google Translate achieves slightly better accuracy and sentiment preservation overall (+0.26 and +0.10 over our pipeline for accuracy and sentiment divergence, respectively), but lags behind our pipeline in the idiomatic category (-0.20 and -0.30 for accuracy and sentiment divergence, respectively, compared to our pipeline).</p><p>Qualitatively, we see very similar reasons listed for sentiment divergence as we did for English-Spanish: each of the NMT systems we looked at had errors most frequently in the MI, MO, and O categories, denoting mistranslation of idiomatic language, mistranslation of other types of language, and other reasons for sentiment divergence, respectively; with MO being more frequent than MI in English-Indonesian evaluations, potentially due to lower MT performances for this language than Spanish (i.e., BLEU score for English-Indonesian modified model is 20.85 on the Tatoeba dataset compared to 22.15 for English-Spanish). However, as noted in the analysis of the previous evaluation, not all of these errors occurred with equal frequency across systems. For instance, Google Translate and the human translator produced less errors overall than the OPUS MT system, so the error codes should be interpreted as indicating the relative frequency and prevalence of certain translation errors that affect sentiment, not as markers to be compared on a system-to-system basis. As with the English-Spanish evaluation, certain qualitative observations made by our evaluators will be discussed further in Section 7. In line with results on the previous evaluation, accuracy and sentiment divergence are shown to be strongly negatively correlated, with Pearson's r values of -0.570 and -0.756 for the whole and idiomatic subsets of the data, respectively, both of which are statistically significant (p &lt;&lt; 0.05) and are displayed in Table <ref type="table">2</ref>  Table <ref type="table">3</ref> shows Krippendorff's alpha agreement measure <ref type="bibr">(Krippendorff, 2011)</ref> for accuracy and sentiment divergence across both subsets, indicating moderate agreement, with higher agreement on accuracy. As was found with the English-Spanish evaluation, this is in line with previous findings of moderate human agreement on sentiment judgement (Krippendorff's &#945;=0.51) <ref type="bibr">(Provoost et al., 2019)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Discussion</head><p>Our experimentation with the various MT models generated a number of interesting example cases concerning the translation of idiomatic language. For example, given the tweet "Time Warner Road Runner customer support here absolutely blows," the baseline MT gives a literal translation of the word "blows" as "pukulan" (literally, "hits") in Indonesian; Google Translate gives a translation "hebat" ("awesome") that is opposite in sentiment to the idiomatic sense of the word "blows" ("sucks") in English; and our model gives a translation closest in meaning and sentiment to "blows," namely "kacau" (approx. "messed up" in Indonesian). There are also cases where our model gives a translation that is closer in degree of sentiment than what Google Translate produces. Given the source text "Yo @Apple fix your shitty iMessage," Google Translate produces "Yo @Apple perbaiki iMessage buruk Anda" ("Yo @Apple fix your bad iMessage"), which has roughly the same polarity as the source tweet. By contrast, our proposed model produces "Yo @Apple perbaiki imessage menyebalkan Anda," using the word "menyebalkan" ("annoying") instead of "buruk," which conveys a closer sentiment to "shitty" than simply "bad".</p><p>The evaluators of the English-Spanish translations provided us with rich qualitative commentary as well. For the sentence "Just broke my 3rd charger of the month. Get your shit together @apple," which is translated by the professional translator as "Se acaba de romper mi tercer cargador del mes. Sean m&#225;s eficientes @apple," one evaluator acutely notes that "The expression 'Get your shit together' was translated in a more formal way (it loses the vulgarism). I would have translated it as 'Poneos las pilas, joder' to keep the same sentiment. We could say that this translation has a different diaphasic variation than the source text." This demonstrates that sentiment preservation is a problem not only for NMT systems, but for human translators as well. There are also problems attributed to challenges in machine translating informal texts. Acronyms such as "tbh" and "smh" made for another interesting case, as they weren't translated by any of the MT models for any language pairing, despite their common occurrence in UGC. The same evaluator also notes that "The acronym 'tbh' was not translated" in the sentence "@Apple tbh annoyed with Apple's shit at the moment," and says "this acronym is important for the sentiment because it expresses the modality of the speaker." In another example, we see our sentiment-sensitive pipeline helping the baseline distinguish between such a semantically fine-grained distinction as that between "hope" and "wish": the baseline translates the sentence "@Iberia Ojal&#225; que encuentres pronto tu equipaje!!" as "@Iberia I wish you'd find your luggage soon!!," while our pipeline correctly chooses "@Iberia I hope you will find your luggage soon!!." We observe similar issues contribute to sentiment divergence in Spanish and Indonesian despite the fact that these are typologically disparate languages with different amounts of training data in the MT system.</p><p>In terms of automatic MT evaluation, our method improves over the baseline for the Spanish tweets on which the human evaluation was conducted. This result supports the efficacy of our model in the context of highly-idiomatic, affective UGC. And while Google Translate still outperforms the baseline and our pipeline in terms of BLEU score on Tatoeba (for both languages) and the tweets (for which only Spanish had a gold-standard benchmark)-given that the baseline model that we built our pipeline on is not SOTA-our pipeline can be added to any MT system and can also improve SOTA MT for UGC. Furthermore, our approach also lends itself to many practical scenarios, e.g. companies who are interested in producing sentiment-preserving translations of large bodies of UGC but who lack the sufficient funds to use a subscription API like Google Cloud Translation. In these contexts, it may be beneficial-or even necessary-to improve free, open-source software in a way that is tailored to one's particular use case (thus the idea of "customized MT" that many companies now offer), instead of opting for the SOTA but more costly software.</p><p>More generally, since our approach shows that we can improve performance of an MT model for a particular use case i.e., UGC translation using signals beyond translation data that is relevant for the task at hand i.e., sentiment, it will be interesting to explore other signals that are relevant for improving MT performance in other use cases. It will also be interesting to explore the addition of these signals in a pipeline (our current method), as implicit feedback such as in <ref type="bibr">Wijaya et al. (2017)</ref>, or as explicit feedback in an end-to-end MT model for example, as additional loss terms in supervised <ref type="bibr">(Wu et al., 2016)</ref>, weakly-supervised <ref type="bibr">(Kuwanto et al., 2021)</ref>, or unsupervised <ref type="bibr">(Artetxe et al., 2017)</ref> MT models. Beyond the potential engineering contribution for low-resource, budget-constrained settings, our experiments also offer rich qualitative insights regarding the causes of sentiment change in (machine) translation, opening up avenues to more disciplined efforts in mitigating and exploring these problems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Conclusion</head><p>In this paper, we use several distinct sentiment classifiers trained on Twitter data to help machine translation models select sentiment-preserving translations of highly idiomatic source texts. Diverging from previous works, we use continuous (rather than binary or categorical) sentiment scores to select minimally divergent translations, and we test the performance of our pipeline with automated and human evaluations for English-Spanish and English-Indonesian translations.</p><p>Furthermore, we implement our sentiment-aware translation pipeline on free, open-source MT models available on Hugging Face 8 . Although many of these models are non-SOTA, our choice to use them represents a real-world scenario: Many users and companies do not have the resources or budget to subscribe to a SOTA translation API or train their own MT model from scratch. Our pipeline poses a lightweight solution for getting more with less, in a somewhat niche yet ubiquitous translation context (social media posts).</p><p>In future work, we would like to evaluate the effect of sentiment classifier performance on the downstream MT results, including the effects of classifier architecture, the number of sentiment categories and their distribution in the training data (e.g., UGCs with more informal words may contain more affective texts), etc. We would also like to investigate how continuous sentiment scoring compares with binary or categorical scoring for this task, using a larger evaluation set for idiomatic texts (e.g. in English <ref type="bibr">(Michel and Neubig, 2018)</ref> or constructed in other languages <ref type="bibr">(Wibowo et al., 2021)</ref>), or from a dataset we create ourselves. Finally, further work should establish benchmarks and put forth improvements for cross-lingual sentiment classification (i.e. the extent to which sentences that are translations of each other are assigned similar sentiments)-including the problem of zero-shot transfer-adding onto recent work in cross-lingual performance benchmarks <ref type="bibr">(Hu et al., 2020;</ref><ref type="bibr">Liang et al., 2020)</ref>.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Code and reference materials are available at https://github.com/AlexJonesNLP/SentimentMT</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>https://github.com/Helsinki-NLP/Opus-MT</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>https://www.kaggle.com/c/spanish-arilines-tweets-sentiment-analysis</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>http://opus.nlpl.eu</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4"><p>https://github.com/Helsinki-NLP/OPUS-MT-train</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_5"><p>https://data.world/crowdflower/apple-twitter-sentiment</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_6"><p>https://www.freelancer.com/ Proceedings of the 18th Biennial Machine Translation Summit</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_7"><p>Proceedings of the 18th Biennial Machine Translation Summit</p></note>
		</body>
		</text>
</TEI>
