<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Neural Polysynthetic Language Modelling</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>05/13/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10184461</idno>
					<idno type="doi"></idno>
					<title level='j'>ArXivorg</title>
<idno>2331-8422</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Lane Schwartz</author><author>Francis Tyers</author><author>Lori Levin</author><author>Christo Kirov</author><author>Patrick Littell</author><author>Chi-kiu Lo</author><author>Emily Prud'hommeaux</author><author>Hyunji Hayley Park</author><author>Kenneth Steimel</author><author>Rebecca Knowles</author><author>Jeffrey Micher</author><author>Lonny Strunk</author><author>Han Liu</author><author>Coleman Haley</author><author>Katherine J. Zhang</author><author>Robbie Jimmerson</author><author>Vasilisa Andriyanets</author><author>Aldrian Obaja Muis</author><author>Naoki Otani</author><author>Jong Hyuk Park</author><author>Zhisong Zhang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Many techniques in modern computational linguistics and natural language processing (NLP) make the assumption that approaches that work well on English and other widely used European (and sometimes Asian) languages are “language agnostic” – that is that they will also work across the typologically diverse languages of the world.In high-resource languages, especially those that are analytic rather than synthetic, a common approach is to treat morphologically-distinct variants of a common root (such as dog and dogs) as completely independent word types.Doing so relies on two main assumptions: that there exist a limited number of morphological inflections for any given root, and that most or all of those variants will appear in a large enough corpus (conditioned on assumptions about domain, etc.) so that the model can adequately learn statistics about each variant.Approaches like stemming, lemmatization, morphological analysis, subword segmentation, or other normalization techniques are frequently used when either of those assumptions are likely to be violated, particularly in the case of synthetic languages like Czech and Russian that have more inflectional morphology than English.Within the NLP literature, agglutinative languages like Finnish and Turkish are commonly held up as extreme examples of morphological complexity that challenge common modelling assumptions.Yet, when considering all of the world’s languages, Finnish and Turkish are closer to the average case in terms of synthesis. When we consider polysynthetic languages (those at the extreme of morphological complexity), even approaches like stemming, lemmatization, or subword modelling may not suffice.These languages have very high numbers of hapax legomena (words appearing only once in a corpus), underscoring the need for appropriate morphological handling of words, without which there is no hope for a model to capture enough statistical information about those words.Moreover, many of these languages have only very small text corpora, substantially magnifying these challenges.To this end, we examine the current state-of-the-art in language modelling, machine translation, and predictive text completion in the context of four polysynthetic languages: Guaraní, St. Lawrence Island Yupik, Central Alaskan Yup’ik, and Inuktitut.We have a particular focus on Inuit-Yupik, a highly challenging family of endangered polysynthetic languages that ranges geographically from Greenland through northern Canada and Alaska to far eastern Russia.The languages in this family are extraordinarily challenging from a computational perspective, with pervasive use of derivational morphemes in addition to rich sets of inflectional suffixes and phonological challenges at morpheme boundaries.Finally, we propose a novel framework for language modelling that combines knowledge representations from finite-state morphological analyzers with Tensor Product Representations (Smolensky, 1990) in order to enable successful neural language models capable of handling the full linguistic variety of typologically variant languages.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chapter 1 Introduction</head><p>Many techniques in modern computational linguistics and natural language processing (NLP) make the assumption that approaches that work well on English and other widely used European (and sometimes Asian) languages are "language agnostic" -that is that they will also work across the typologically diverse languages of the world. 1  In high-resource languages, especially those that are analytic rather than synthetic, a common approach is to treat morphologically-distinct variants of a common root (such as dog and dogs) as completely independent word types. Doing so relies on two main assumptions: that there exist a limited number of morphological inflections for any given root, and that most or all of those variants will appear in a large enough corpus (conditioned on assumptions about domain, etc.) so that the model can adequately learn statistics about each variant. Approaches like stemming, lemmatization, morphological analysis, subword segmentation, or other normalization techniques are frequently used when either of those assumptions are likely to be violated, particularly in the case of synthetic languages like Czech and Russian that have more inflectional morphology than English.</p><p>Within the NLP literature, agglutinative languages like Finnish and Turkish are commonly held up as extreme examples of morphological complexity that challenge common modelling assumptions. Yet, when considering all of the world's languages, Finnish and Turkish are closer to the average case in terms of synthesis. When we consider polysynthetic languages (those at the extreme of morphological complexity), approaches like stemming, lemmatization, or subword modelling may not suffice. These languages have very high numbers of hapax legomena (words appearing only once in a corpus), underscoring the need for appropriate morphological handling of words, without which there is no hope for a model to capture enough statistical information about those words. Moreover, many of these languages have only very small text corpora, substantially magnifying these challenges. The remainder of this work is structured as follows.</p><p>In Chapter 2 we briefly review the relevant background literature in finite-state morphology, language modelling, and machine translation. We review finite-state approaches to morphological analysis. We review the major approaches to language modelling, including n-gram language models, feed-forward language models, and recurrent neural language models.</p><p>In Chapter 3 we present a set of polysynthetic languages which we will consider throughout this work and detail the resources available for each. We have a particular focus on Inuit-Yupik, a highly challenging family of endangered polysynthetic languages that ranges geographically from Greenland through northern Canada and Alaska to far eastern Russia. The languages in this family are extraordinarily challenging from a computational perspective, with pervasive use of derivational morphemes in addition to rich sets of inflectional suffixes and phonological challenges at morpheme boundaries.</p><p>In Chapters 4-6 we examine the current state-of-the-art in language modelling, machine translation, and predictive text completion in the context of four polysynthetic languages: Guaran&#237;, St. Lawrence Island Yupik, Central Alaskan Yup'ik, and Inuktitut. In Chapter 4 we present experiments and results on machine translation into, out of, and between polysynthetic languages; we carry out experiments between various Inuit-Yupik languages and English, as well as between Guaran&#237; and Spanish, showing that multilingual approaches incorporating data from higher-resource members of the language family can effectively improve translation into lower-resource lan- 1 </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Emily</head><p>Bender provides a thorough discussion of this problem in <ref type="url">https://thegradient.pub/ the-benderrule-on-naming-the-languages-we-study-and-why-it-matters/</ref>. Overview of the tangible artefacts, models, and applications in this report. We start with all of the available resources for a given language, including (bi-)texts, grammars, and dictionaries. These are used to create finite-state morphological analyzers and MT systems ( &#167;4) directly. The finite-state morphological analyzers are then applied to corpora to create segmented or analyzed corpora ( &#167;2). These are used both to build language models ( &#167;5) and machine translation systems ( &#167;4) based on the segmented morphemes and to create interpretable morpheme-based language models using tensor product representations ( &#167;7). The final results are predictive keyboards that use morphemes as the unit of prediction ( &#167;6), with potential future work (greyed out) including automatic speech recognition and morpheme-based machine translation.</p><p>guages. In Chapter 5, we present language modelling experiments across a range of languages and vocabularies. In Chapter 6 we present practical applications which we anticipate will benefit from our language model and multilingual approaches, along with preliminary experimental results and discussion of future work.</p><p>Finally in Chapter 7 we present a core theoretical contribution of this work: a feature-rich open-vocabulary interpretable language model designed to support a wide range of typologically and morphologically diverse languages. This approach uses a novel neural architecture that explicitly model characters and morphemes in addition to words and sentences, making explicit use knowledge representations from finite-state morphological analyzers, in combination with Tensor Product Representations <ref type="bibr">(Smolensky, 1990)</ref> to enable successful neural language models capable of handling the full linguistic variety of typologically variant languages. We present our conclusions in Chapter 8.</p><p>Chapter 2</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Background</head><p>In this chapter we provide a brief overview of the background technologies that underlie this report, namely finitestate approaches to morphological analysis ( &#167;2.1), n-gram and neural language modelling techniques ( &#167;2.2), and neural machine translation ( &#167;2.3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Finite-state morphology</head><p>Initial approaches to modelling the morphology of natural languages in the mid-20th century tended to focus on unidirectional algorithmic solutions to particular languages, implemented in general-purpose (rather than domainspecific) programming languages. These included generators, which generated wordforms from an analysis specification, analyzers, which returned possible analyses for a given word, and lemmatizers or stemmers which aimed to return a baseform, stem, or lemma given a wordform. These approaches had a number of downsides, the first being that the same code could not be used for analysis and generation, so for each language, separate code had to be written for these two tasks. In addition, descriptions could not be shared between related languages without much difficulty and there was little formalization.</p><p>In the early 1980s this changed with the introduction of finite-state morphology. In this formalization of morphology, the set of potential strings (wordform-analysis pairs) in a language is represented by a finite-state transducer. A finite-state transducer is a special class of finite-state automaton where each arc has both an input symbol and an output symbol. There are two main approaches to modelling morphophonological (or morphographemic) rules using finite-state approaches. The first consists of applying a sequence of rewrite rules in the form &#945; &#8594; &#946; / &#947; _ &#948;, where the alphabet symbol &#945; is rewritten as &#946; between &#947; and &#948;. The second approach is referred to as two-level morphology <ref type="bibr">(Koskenniemi, 1983)</ref>. In this approach, phonological rules are unordered constraints over possible symbol pairs. As <ref type="bibr">Karttunen (1993)</ref> notes, the two approaches are formally equivalent and all phonological phenomena that can be described with one can be described with the other.</p><p>Given a description, a finite-state morphological analyzer can produce both analyses of surface tokens (e.g. sequences of tags and lemmas such as those found in interlinear glosses) and segmentations of surface tokens. Consider the output of the analyzer for the Guaran&#237; sentence Reh&#243;tapa che rend&#225;pe. 'Will you come with me' in Example <ref type="bibr">(1)</ref>. The output includes the lemmas ho 'come', che 'my' and tenda 'place', person and number tags such as &lt;p2&gt; 'second person', &lt;sg&gt; 'singular', tags indicating word class, &lt;n&gt; 'noun' and &lt;v&gt; 'verb' among others.</p><p>(1) Input Reh&#243;tapa che rend&#225;pe. Analysis re&lt;prn&gt;&lt;p2&gt;&lt;sg&gt;+ho&lt;v&gt;&lt;iv&gt;+ta&lt;fti&gt;+pa&lt;qst&gt; che&lt;prn&gt;&lt;pos&gt;&lt;p1&gt;&lt;sg&gt; r&lt;det&gt;+tenda&lt;n&gt;+pe&lt;post&gt; Segmentation Reh&#243;&gt;ta&gt;pa che r&gt;end&#225;&gt;pe This is especially important for polysynthetic languages, as words can be made up of many morphemes, for example the word &#241;aha'ar&#245;'&#7929;et&#233;va 'that we did not expect at all' in the sentence Oiko pete&#297; mba'e &#241;aha'ar&#245;'&#7929;et&#233;va.</p><p>"Something happened that we did not expect at all' can be decomposed as in Example (2) below.</p><p>(2) Input &#241;aha'ar&#245;'&#7929;et&#233;va Analysis &#241;a&lt;prn&gt;&lt;p1&gt;&lt;pl&gt;+ha'ar&#245;+&#7929;&lt;neg&gt;+ete&lt;emph&gt;+va&lt;subs&gt; Segmentation &#241;a&gt;ha'ar&#245;&gt;'&#7929;&gt;ete&gt;va</p><p>The amount of time required to develop a finite-state description can vary widely, but can be anywhere from two weeks, given a trained developer and a description of a related language -e.g. Kumyk in <ref type="bibr">Washington et al. (2014)</ref> -to a year for a developer completely unfamiliar with the tools and language. The speed is also affected by the available resources such as grammatical descriptions and machine-readable lexicons.</p><p>One shortcoming of many finite-state morphological analyzers is an inability to assign probabilities to analyses. Table 2.1 depicts six example English sentences which each contain the word wound; each of these six uses is analyzed with a distinct linguistic analysis. When analyzing an English sentence that contains the word wound, an unweighted English morphological analyzer would posit all of these analyses, and would be unable to suggest which might be the most probable. Some finite-state morphological toolkits support the use of probabilities on</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Analysis</head><p>Example Frequency Rel. frequency 'wind-PAST' She wound the watch. 4 0.66 'wind-PP' She had wound the watch. 1 0.16 'wound-N.SG'</p><p>The wound healed quickly. 1 0.16 'wound-INF' Therefore I will wound you. 0 0 'wound-PRES' They wound and they heal. 0 0 'wound-IMPER' You wound me sir! 0 0</p><p>Table 2.1: List of analyses for the wordform wound in English, along with example sentences and frequency according to the English treebanks from the Universal Dependencies project <ref type="bibr">(Nivre et al., 2016)</ref>.</p><p>arcs in constructed finite-state transducers <ref type="bibr">(Mohri, 2001)</ref>. This means that it is possible to make analyzers and segmenters where the output is ranked, either by probability or by some other metric. Arc probability weights can be obtained from corpus statistics or from other measures. This is especially important for polysynthetic languages, where words may potential have many analyses. We describe the methods we used to weight our analyzers in Section 3.4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Language modelling</head><p>A language model is any model that describes natural language. By that description, the finite-state models from the previous section could also be considered as a form of language model. In this section, however we use a narrower definition of language model as being a model of a probability distribution over a sequence of vocabulary items (characters, words). Perhaps the simplest approximation to determine the probability of a sentence would be to use a unigram model over words. In such a model, the probability of a sentence is defined as the product of the probabilities of the individual words, which could be estimated by taking their relative frequency in a given corpus. While such a model could reasonably discriminate between the relative probabilities of sentences such as (a) "have a great trip" and (b) "have a superannuated tardigrade", it would not be able to distinguish the relative probability of (c) "great a have trip" and (a). A more accurate, but less tractable approximation would be to ask all speakers of a given language to rank all of the possible sentences in that language by some metric of 'goodness'. So the idea of language modelling is to find a tractable way to model the distribution of probability for sequences of linguistic symbols or tokens.</p><p>This simple model can be extended to n-gram language models <ref type="bibr">(Shannon, 1948</ref><ref type="bibr">(Shannon, , 1951))</ref>, whereby instead of modelling single units <ref type="bibr">(characters, words)</ref>, what is modelled is sequences of units. Thus in a bigram word model, the sequences modelled would be bigrams, e.g. {have a, a great, a trip} and {great a, a have, have trip} from examples (a) and (c) respectively. For languages where large amounts of monolingual training data are available, language models of order 5-7 have been widely used in applications such as machine translation and automatic speech recognition.</p><p>However, as the model is extended to cover longer sequences, the problem of out-of-vocabulary (OOV) items becomes more severe. This happens when the sequence we are attempting to estimate the probability of does not appear in our model. This can be illustrated with the example in (b) above. The sequence "superannuated tardigrade" does not return any results with a search engine query on several major search engines. It is therefore highly likely that a bigram language model trained using all English text available on the internet would estimate the probability of this sequence to be zero, and therefore the probability of the entire sentence would also be zero. There are two techniques that have been developed to deal with this problem. Smoothing techniques reserve a small amount of the probability mass to distribute to unseen n-grams <ref type="bibr">(Good, 1953;</ref><ref type="bibr">Jelineck and Mercer, 1980;</ref><ref type="bibr">Katz, 1987;</ref><ref type="bibr">Witten and Bell, 1991;</ref><ref type="bibr">Church and Gale, 1991;</ref><ref type="bibr">Ney et al., 1994;</ref><ref type="bibr">Kneser and Ney, 1995)</ref>, while backoff techniques allow combinations of lower-order n-grams to be used to estimate the probability of higher-order ones. In example (b) the probabilities of 'superannuated' and 'tardigrade' would be used to estimate the probability of 'superannuated tardigrade'.</p><p>One of the issues with n-gram language models is that parameters are not shared between tokens and sequences. For example, the token 'wonderful' is as far from 'great' as is the token 'superannuated'. So if we have the sequence "have a wonderful trip", the other shared contexts that 'wonderful' and 'great' appear in are not taken into account. A way of dealing with this problem is to use distributional representations of individual tokens, as in <ref type="bibr">Bengio et al. (2000</ref><ref type="bibr">Bengio et al. ( , 2003))</ref>. Here each token is represented by a vector of real numbers, embedding each token in a shared vector space. In these kind of language models it is still necessary to specify a fixed n-gram context, which means that the amount of context that can be taken into account is limited to a fixed-sized window for each token. <ref type="bibr">Mikolov et al. (2010)</ref> describe using recurrent neural networks to model context to allow whole-sentence context to be taken into account. In addition they introduce efficient methods of training the distributional vectors such that corpora numbering in the billions of words can be used in training. In both the models proposed by <ref type="bibr">Bengio et al. (2000)</ref> and <ref type="bibr">Mikolov et al. (2010)</ref> each token is represented by a single vector. As evidenced from the examples above this is not always tenable, words in natural language are ambiguous (cf. wound and trip -'to trip over something' or 'a nice trip'). In ELMo <ref type="bibr">(Peters et al., 2018)</ref> and BERT <ref type="bibr">(Devlin et al., 2019)</ref>, each word vector is context dependent, both on external, sentence-level context, and on word-internal context, so even if a given token has not been seen before, the model can generalize from forms that have similar surface forms and appear in similar contexts. This would seem to be an ideal model for polysynthetic languages, however the downside is that these models typically contain very large numbers of parameters which in turn require very large amounts of training data, far more than is available for most endangered languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Machine Translation</head><p>In recent years, the machine translation community has gravitated toward neural approaches to machine translation. Midway through the 2010s, these began outperforming phrase-based statistical and other approaches in large-scale evaluations <ref type="bibr">(Bojar et al., 2016)</ref>. This success has driven a rapid sequence of approaches to building neural machine translation models, from sequence-to-sequence models <ref type="bibr">(Sutskever et al., 2014)</ref>, to models with attention <ref type="bibr">(Bahdanau et al., 2015)</ref>, to models that primarily rely on attention <ref type="bibr">(Vaswani et al., 2017)</ref>. In preparation for the workshop, we trained both statistical and neural machine translation models on the available training data. During the workshop, we focused solely on neural approaches to machine translation, and report those experiments in Chapter 4. As our experiments tended to examine variations of the input to the translation models rather than modifications to the networks themselves, we do not provide a thorough overview of the techniques here; for additional detail, please see the cited code and papers.</p><p>There does exist prior work on machine translation for polysynthetic languages, though it has generally been limited by small data sizes. In their recent overview of corpus resources for indigenous languages of the Americas, <ref type="bibr">Mager et al. (2018a)</ref> note that most of the parallel corpora they found were quite small (less than 250,000 lines of text). <ref type="bibr">Homola (2012)</ref> proposed the use of rule-based systems for polysynthetic languages, but this approach is still labor-intensive, as it requires the application of extensive linguistic knowledge or other tools. <ref type="bibr">Monson et al. (2006)</ref> report on Mapudungun and Quechua to Spanish machine translation systems. <ref type="bibr">Mager et al. (2018b)</ref> discuss challenges of translating between polysynthetic and fusional languages. This is not a complete account of all such work.</p><p>Of special note for the purposes of this work is existing research on two of the languages we worked on this summer: Inuktitut and Guaran&#237;. For translation between Guaran&#237; and Spanish, we are aware of an online gister (<ref type="url">http://iguarani.com/</ref>) and Bible translations evaluated on stemmed output <ref type="bibr">(Rudnick, 2018)</ref>, and a system for translators called Mainumby by <ref type="bibr">Gasser (2018)</ref>. Previous work on translation between Inuktitut and English can be found in <ref type="bibr">Micher (2018b)</ref>, in which results of statistical machine translation for English and Inuktitut are reported. Micher makes use of a morphologically analyzed previous version of the Nunavut Hansard corpus to enhance SMT systems. Details on developing this corpus can be found in <ref type="bibr">Micher (2018a)</ref>. The FSTbased analyzer <ref type="bibr">(Farley, 2009)</ref> in combination with the neural analyzer <ref type="bibr">(Micher, 2017</ref>) are used to morphologically analyze this data set. <ref type="bibr">Klavans et al. (2018a)</ref> discuss some of the challenges of building such translation systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chapter 3</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Languages &amp; Resources</head><p>A central issue that arises when conducting research on polysynthetic languages is the lack of resources: many polysynthetic languages are very low resource. Due to the need for corpora for use in language modelling efforts, an effort was directed towards locating existing corpora for polysynthetic languages and assessing their usability for different experiments. While we used only a subset of what we collected for experiments, this chapter provides an overview of all linguistic resources we gained access to in the process in order to offer a glimpse into available polysynthetic language resources.</p><p>In what follows, we provide short descriptions of the language families and languages involved and the corpora we collected. We briefly discuss the characteristics of polysynthetic languages based on descriptive statistics and the texts we selected for subsequent experiments. Details regarding corpus preprocessing are described in the context of experiments discussed in later chapters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Language selection &amp; data collection</head><p>We obtained corpora and resources for six languages: Chukchi, St. Lawrence Island Yupik, Central Alaskan Yup'ik, Inuktitut, Crow, and Guaran&#237;. These languages were chosen from four different families, all of which are low-resource and polysynthetic. There was a focus in particular on the Inuit-Yupik-Unangan family, from which three of the languages were selected. The Inuit-Yupik-Unangan languages, historically known as Eskimo-Aleut, are a language family native to the Russian Far East, Alaska, Canada, and Greenland. The family is divided into two branches: Inuit-Yupik and Unangan. St. Lawrence Island Yupik, Central Alaskan Yup'ik, and Inuktitut belong to the Inuit-Yupik branch of the family.</p><p>In preparation for the workshop, we gathered spoken and written corpora for the selected polysynthetic languages. In addition to written and spoken corpora, where available, we also gathered dictionaries, reference grammars, and finite-state morphological analyzers. Table <ref type="table">3</ref>.1 provides a summary of the resources we had in each language. We refer to each language by name or by ISO 639-3 code.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Language</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.1">Chukchi</head><p>Chukchi (ckt) is the most widely spoken language in the Chukotko-Kamchatkan family, with approximately 5000 speakers. The Chukotko-Kamchatkan languages are native to the Russian Far East, and Chukchi is spoken in the easternmost part, mainly on the Chukotka Peninsula. We obtained audio data and transcripts for Chukchi from <ref type="url">http://chuklang.ru</ref>, a website dedicated to materials and research on Chukchi funded by the Russian Science Foundation. The audio data contains two books of the Bible, the Book of Jonah and the Gospel of Luke, and short stories in the language. The stories represent a valuable resource for the endangered language. The transcripts are in both Latin and Cyrillic scripts. There also exists a prototype finite-state morphological analyzer for Chukchi <ref type="bibr">(Andriyanets and Tyers, 2018)</ref>. This analyzer was expanded on during the workshop using the transcripts of the audio data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.2">St. Lawrence Island Yupik</head><p>St. Lawrence Island Yupik (ess) is an endangered language in the Inuit-Yupik family spoken on St. Lawrence Island, Alaska and on the Chukokta Peninsula of the Russian Far East. We collected a corpus consisting primarily of scanned and digitized books, including educational materials <ref type="bibr">(Apassingok et al., 1993</ref><ref type="bibr">(Apassingok et al., , 1994</ref><ref type="bibr">(Apassingok et al., , 1995))</ref>, oral narratives <ref type="bibr">(Nagai, 2001;</ref><ref type="bibr">Apassingok et al., 1985</ref><ref type="bibr">Apassingok et al., , 1987</ref><ref type="bibr">Apassingok et al., , 1989;;</ref><ref type="bibr">Slwooko, 1977</ref><ref type="bibr">Slwooko, , 1979</ref>) and a reference grammar <ref type="bibr">(Jacobson, 2001)</ref>. In addition, we made use of the Yupik translation of the New Testament<ref type="foot">foot_0</ref>  <ref type="bibr">(Wycliffe, 2018)</ref>. We made use of the <ref type="bibr">Chen and Schwartz (2018)</ref> finite-state morphological analyzer, which was based on the Yupik grammar of <ref type="bibr">Jacobson (2001)</ref> and incorporated Yupik lexical entries from the Badten et al. ( <ref type="formula">2008</ref>) dictionary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.3">Central Alaskan Yup'ik</head><p>Central Alaskan Yup'ik (esu) is an official language of Alaska that is spoken by about 10,000 speakers in the western and southwestern parts of the state. There are five major dialects of Central Alaskan Yup'ik, of which General Central Yup'ik (Yugtun) is the most widely spoken.</p><p>This workshop made use of a Yup'ik translation<ref type="foot">foot_1</ref> of the Bible. As one of our team members speaks the language, we were able to align it with a corresponding English Bible <ref type="bibr">(Good News Translation, Today's English Version, Second Edition)</ref>. The parallel data were used for both machine translation and language modelling experiments. Additionally, the Yup'ik Bible and a dictionary <ref type="bibr">(Jacobson, 1984)</ref> were used to begin development on a Yup'ik finite-state morphological analyzer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.4">Inuktitut</head><p>Inuktut (a term that includes the variants Inuktitut and Inuinnaqtun) is one of the official languages of Nunavut, the largest territory of Canada, and is spoken by approximately 39,770 people in Canada <ref type="bibr">(Statistics Canada, 2017)</ref>. It also has official recognition in several other areas and is part of the Inuit-Yupik-Unangan language family. Inuktut can be written in syllabics or in roman orthography, and regional variations use different special characters and spelling conventions.</p><p>As Inuktut is an official language of government in Nunavut, there exist some resources that are available in this language at a much larger scale than most other languages in the same family, notably a parallel corpus with English. Since its formation in 1999, the Legislative Assembly of Nunavut has been publishing its proceedings (known as a Hansard) in both Inuktitut (iku) and English. <ref type="foot">3</ref> In the subsequent 20 years, the collected Nunavut Hansard has grown to be a substantial bilingual corpus <ref type="bibr">(Martin et al., 2003</ref><ref type="bibr">(Martin et al., , 2005;;</ref><ref type="bibr">Farley, 2008;</ref><ref type="bibr">Joanis et al., 2020)</ref>, putting Inuktitut in the perhaps unique position of a polysynthetic language with a parallel corpus of more than a million sentence pairs. We discuss the different versions of this data, and their preprocessing for machine translation, in Section 4.2.</p><p>We also made use of a Inuktitut translation<ref type="foot">foot_3</ref> of the Bible for language modelling experiments. We decided to exclude the Hansard in the language modelling experiments as including it would make the Inuktitut dataset substantially different from other datasets and thus making it hard to compare it with other languages. How we preprocessed the data for language modelling is discussed in Chapter 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Language</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.5">Crow</head><p>Crow (Aps&#225;alooke, language code cro) is one of the most widely spoken languages of the Siouan family, with approximately 3500 speakers. The Siouan languages are native primarily to the Great Plains of North America, and Crow specifically is spoken in southern Montana. Our primary resource for Crow was a series of audio recordings for a dictionary developed by the Language Conservancy, an organization that protects and revitalizes Native American languages. This corpus consists of 11.7 hours of recordings produced by 14 speakers. The data is entirely composed of single words and short phrases from the online Crow Dictionary project <ref type="bibr">(The Crow Language Conservancy, 2019)</ref>. This data was obtained on special permission from the Language Conservancy and is not publicly available.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.6">Guaran&#237;</head><p>Guaran&#237; (grn) is a Tupian language native to South America. It is an official language of Paraguay and the most widely spoken language in the country with almost 5 million speakers. It is also the only indigenous language of the Americas with a large number of non-indigenous native speakers.</p><p>We were able to obtain Guaran&#237;-Spanish parallel Bible translations. The Guaran&#237; Bible was translated and published by the Sociedad B&#237;blica Paraguaya. The parallel translations were used for language modelling and machine translation experiments. A morphological analyser developed by <ref type="bibr">Kuznetsova and Tyers (2019)</ref>, apertium-grn, was also used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Descriptive statistics of the corpora</head><p>The polysynthetic languages described above differ significantly from languages such as English and Spanish. One major point of difference is in the ratio of word types to word tokens; given the number of word tokens and the number of unique word types, the type-token ratio is calculated as TTR = |types| |tokens| . Another useful metric, proposed by <ref type="bibr">Hasegawa-Johnson et al. (2017a)</ref> and used for polysynthetic language by <ref type="bibr">Schwartz et al. (2020)</ref>, calculates the mean distance to the next novel word type (MDN).</p><p>Table <ref type="table">3</ref>.2 displays these text metrics for all textual corpora used. Large differences exist between different languages and between different corpora of the same language with respect to these metrics. The polysynthetic languages examined display higher type-token ratios and lower average distances to the next novel word type in comparison to the non-polysynthetic languages (English and Spanish). This is particularly poignant for parallel corpora. The New Testament in English has a type-token ratio approximately nine times lower than St. Lawrence Island Yupik. This is somewhat expected as the central focus of this work is determining effective strategies for working with highly morphologically complex polysynthetic languages and previous research <ref type="bibr">(Kettunen, 2014)</ref> has indicated that morphological complexity is correlated with metrics like TTR.</p><p>The datasets utilized cover a large number of different domains as well, including religious texts, parlimentary proceedings, audio transcriptions, and data scraped from internet resources. These domain differences contribute to the differences in corpus properties as well. For example, both the English Bible and the English Nunavut Hansard corpus have lower type token ratios and higher mean distances to the next novel type. However, the formulaic language of parliamentary proceedings causes the English Hansard corpus to have a type-token ratio seven times lower than the English Bible used. These domain differences were controlled for the language modelling experiments described in Chapter 5 by using the New Testament for several different languages. For the other tasks, comparisons between languages are used sparingly if similar genres of text are not available for both languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Preprocessing</head><p>We preprocessed the corpora for 1) machine translation and 2) language modelling experiments. The general principle and strategies we adapted for preprocessing for both experiments are very similar. We removed any redundant lines and verse numbers to clean up the corpora. We made sure to normalize apostrophes so that they remained as part of a word after we tokenized the data using Moses scripts <ref type="bibr">(Koehn et al., 2007)</ref>. As truecasing is a common practice in machine translation, we truecased the text for machine translation experiments, but not for language modelling experiments. Using the cleaned-up datasets, we explored different tokenization strategies. FST and BPE segmentation methods were adapted for machine translation experiments, and character, BPE, Morfessor and FST segmentation levels were used for language modelling experiments. Details about how we selected and preprocessed the datasets for the two sets of experiments are discussed in Chapter 4 (Machine Translation) and Chapter 5 (Language modelling), respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Estimating weights for finite-state morphological analyzers</head><p>We used three approaches to estimate weights for our finite-state analysers, one supervised, one heuristic and one unsupervised. The supervised method was the most simple. We had a small corpus of annotated (manually disambiguated) text for Guaran&#237;, the test corpus from <ref type="bibr">Kuznetsova and Tyers (2019)</ref>. We used this and assigned a weight to all wordform:analyses pairs of 1. For the wordform-analysis pairs found in the corpus, a weight was assigned equal to 1 -P (a|w), where P (a|w) is the number of times the analysis occurs with the particular wordform over the total number of times the wordform appears. This is necessarily a number between zero and one and thus for wordforms seen in the corpus, their analysis received a lower weight than unseen wordformanalysis pairs. Given the size of the corpus, 2020 wordforms, the majority of the wordforms seen in the corpora were unseen. For both the Yupik analyser and the Guaran&#237; analyser we added an additional heuristic, for each morpheme boundary, we increased the weight by 1. The motivation behind this heuristic is that we wanted to favor lexicalized forms and defavor forms with very many derivations when there was a lexicalized alternative. In addition, we experimented with a novel unsupervised approach to weighting the transducers based on byte-pair encoding (BPE; <ref type="bibr">Sennrich et al., 2016)</ref>.</p><p>Chapter 4</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Machine Translation</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Introduction</head><p>This chapter discusses neural machine translation (NMT) experiments for translation into, out of, and between polysynthetic languages. While polysynthetic and, more generally, morphologically complex languages are often considered to pose a greater challenge for machine translation research than languages with relatively simple morphology <ref type="bibr">(Oflazer and Durgar El-Kahlout, 2007;</ref><ref type="bibr">Bojar et al., 2015)</ref>, this challenge is often entangled with the challenges of low-resource machine translation. What really causes this challenge? Is it the length and complexity of the word forms? The type-token ratio and data sparsity? A lack of sufficient training data or a need for more training data than morphologically simple languages? A matter of many evaluation metrics being ill-suited to morphologically complex languages? Some combination of all of this?</p><p>In this work, we take steps towards answering two relevant questions through experiments on machine translation between English, Inuktitut, and Yupik as well as Guaran&#237; and Spanish. First, can we untangle the influences of small data and morphological complexity on the challenge of modelling these languages? Second, can we make use of higher-resource languages in the same language family to improve machine translation of lowerresource languages? We examine the first through translation of Inuktitut using a new, larger, pre-release version of the Nunavut Hansard,<ref type="foot">foot_4</ref> as described in Sections 3. <ref type="bibr">1.4, 4.2.1 and 4.</ref>3. We examine the second through a series of experiments on low-resource machine translation (described in Section 4.4); our most promising experiments incorporate Inuktitut data into the translation of Yupik data (Table <ref type="table">4</ref>.8).</p><p>We first discuss the data resources for machine translation, providing more detail about data size, preprocessing, and the like (Section 4.2). This is followed by descriptions of our machine translation experiments. Section 4.3.3 briefly covers challenges of machine translation evaluation for polysynthetic languages.</p><p>The main contributions of our machine translation work during this workshop are as follows.</p><p>&#8226; We achieved state-of-the-art performance on translation between Inuktitut and English (since surpassed by <ref type="bibr">Joanis et al. (2020)</ref>).</p><p>&#8226; With first access to the beta version 3.0 of the Nunavut Hansard <ref type="bibr">(Joanis et al., 2020)</ref>, we were able to provide feedback and best practices for preprocessing the dataset and contributed to knowledge about existing character and spelling variations in the dataset.</p><p>&#8226; We collected empirical evidence on several well-known but unresolved challenges, such as best practices in token segmentation for MT into and out of polysynthetic languages, as well as an examination of how to evaluate MT into polysynthetic languages.</p><p>&#8226; We successfully used multilingual neural machine translation methods to improve translation quality into low-resource languages using data from related languages. Notably, our "low-resource" languages were lower resource than much of the literature, and we produced improvements without the use of large monolingual corpora (which are unavailable for these languages and many other languages of interest). We observed these improvements across both n-gram-oriented and semantic-oriented metrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Parallel Data Resources</head><p>Chapter 3 describes the general data resources used throughout the workshop. Here we provide a more in-depth look at the resources used for machine translation specifically, including some notes on preprocessing. The machine translation resources available to us ranged from moderate to extremely low resource, as shown in Table <ref type="table">4</ref>.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1">Inuktitut-English Data</head><p>As described is Section 3.1.4, there have been several releases of the Nunavut Hansard. The first, version 1.0, was released to the natural language processing community in <ref type="bibr">Martin et al. (2003)</ref>, and consisted of 3.4 million English tokens and 1.6 million Inuktitut tokens of parallel data. A subsequent update, version 1.1, corrected some errors in version 1.0 <ref type="bibr">(Martin et al., 2005)</ref>. Version 2.0 covered proceedings from 1999 through late 2007 (excluding 2003) with about 5.5 million English tokens and 2.6 million Inuktitut tokens <ref type="bibr">(Farley, 2008)</ref>.</p><p>For the purposes of this workshop, we received pre-release access to a beta version of the Nunavut Hansard Inuktitut-English parallel corpus version 3.0, which contains 17.3 million English tokens and 8.1 million Inuktitut tokens, a huge increase over the original data. We refer to this pre-release version as 3.0 or 3.0 beta. We use deduplicated development and test sets in our experiments. The final Nunavut Hansard Inuktitut-English parallel corpus version 3.0 corpus is now available and is described in <ref type="bibr">Joanis et al. (2020)</ref>. Through our early access to this corpus, we provided feedback on the corpus and on preprocessing best practices, which have been incorporated into the data release.</p><p>The corpus contains 17.3 million English tokens and 8.1 million Inuktitut tokens, spanning 1999 to 2017, a major increase over the version 1.0 and 2.0 releases <ref type="bibr">(Martin et al., 2003</ref><ref type="bibr">(Martin et al., , 2005;;</ref><ref type="bibr">Farley, 2008)</ref>. This is the largest corpus we had access to for this workshop, and is arguably no longer truly "low-resource" for machine translation research. It is, however very domain-specific, and differs in domain from the other parallel corpora we use in our experiments.</p><p>As prior machine translation work performed translation on romanized Inuktitut <ref type="bibr">(Micher, 2018b)</ref>, we chose to do the same. We converted Inuktitut data from syllabics as follows: we first applied uniconv,<ref type="foot">foot_5</ref> then repaired errors (e.g., incorrectly handled accented French characters in the Inuktitut data) using iconv, then identified and corrected other characters using a hand-built preprocessing script (including treating word-internal apostrophes as non-breaking characters on the Inuktitut side of the data). <ref type="foot">3</ref>We ran standard preprocessing scripts from Moses <ref type="bibr">(Koehn et al., 2007)</ref>: punctuation normalization, tokenization, cleaning, and truecasing. We discuss subword segmentation in Section 4.3.</p><p>St. Lawrence Island Yupik, and then all data was punctuation-normalized, tokenized, cleaned, and truecased using standard Moses scripts <ref type="bibr">(Koehn et al., 2007)</ref> with English default settings.</p><p>For Central Alaskan Yup'ik, we had access to the full Bible. For consistency, we still used Luke for development and validation and used John for testing. The remainder of the data was used for training. For Central Alaskan Yup'ik, we normalize apostrophes and convert characters with certain diacritics that would otherwise be split by the Moses tokenizer. Both Central Alaskan Yup'ik and its corresponding English translations were punctuation-normalized, tokenized, cleaned, and truecased using standard Moses scripts <ref type="bibr">(Koehn et al., 2007)</ref> with English default settings. In the case of Central Alaskan Yup'ik, we performed tokenization without aggressive hyphen-splitting. <ref type="foot">5</ref>Table <ref type="table">4</ref>.1 shows the number of lines in the datasets; the Central Alaskan Yup'ik training data is more than 5 times larger than the St. Lawrence Island Yupik training data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.3">Guaran&#237;-Spanish Data</head><p>As with the Yupik datasets, we had verse-aligned parallel Bible data available in Spanish and Guaran&#237;. We used Luke for development and validation and used John for testing, with the remaining data used for training. Guaran&#237; data was first preprocessed with quotation and apostrophe normalization, along with the removal of paragraph and other symbols that were artifacts of the initial data creation. Guaran&#237; and Spanish data were then punctuationnormalized, tokenized, cleaned, and truecased using standard Moses scripts <ref type="bibr">(Koehn et al., 2007)</ref> using Spanish defaults.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Inuktitut Machine Translation Experiments</head><p>Our Inuktitut-English machine translation efforts were largely concerned with doing initial experiments on the pre-release version of the Nunavut Hansard parallel corpus. Being substantially larger than previous releasesto our knowledge, by far the largest aligned parallel corpus of a polysynthetic language to date -this corpus offered a unique opportunity to try contemporary NMT methods on Inuktitut, and consider whether methods of segmentation like byte-pair encoding (BPE; <ref type="bibr">Sennrich et al., 2016)</ref> are sufficient to handle a language of this level of complexity.</p><p>In the experiments that follow, our baseline systems -that is, conventional Transformer <ref type="bibr">(Vaswani et al., 2017</ref>) NMT systems, using BPE and standard hyperparameter settings -always outperformed the experimental systems (which included special segmentation procedures and multi-source attention). This suggests that contemporary methods are indeed adequate for processing Inuktitut, although we do not consider the case closed as there are many interesting possibilities for principled segmentation that we have not yet explored.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1">Segmentation experiments</head><p>In this set of experiments, we contrast automatic segmentation (by byte-pair encoding) with more morphological segmentations based on human knowledge of Inuktitut morphology, and also consider a simple method of combining them. We perform our machine translation experiments contrasting these approaches in the Inuktitutto-English direction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Byte-Pair Encoding</head><p>Byte-pair encoding (BPE; <ref type="bibr">Sennrich et al., 2016)</ref> -broadly, the segmentation of text at the character-level into larger chunks by compressing the text and using the resulting compression units as word segmentation -has become a ubiquitous practice in current machine translation. While the units discovered are not guaranteed to correspond to morphemes as such, the resulting systems do end up working at a more morpheme-like level, with units larger than a character but smaller than a word. Table <ref type="table">4</ref>.2 shows the segmentation of several words according to four BPE vocabulary sizes. The Inuktitut loanword siipiisiikkut (meaning CBC or Canadian Broadcasting Corporation) is frequent enough in the corpus that at 30000 merges it is represented as a single token. The word qimirruvita (meaning are we looking at, as in the context Are we looking at trying to find out? or qimirruvita qaujimanittinnuk) can be split into the morpheme qimirru-(to scan, to inspect<ref type="foot">foot_9</ref> ) and the verb ending -vita? (are we (3+) ...?<ref type="foot">foot_10</ref> ); we see that here BPE successfully respects the morpheme boundary at all sizes, segmenting exactly and only along that boundary with a vocabulary of 30000. For utaqqivita (meaning are we waiting for?, as in the context What are we waiting for? or kisumik utaqqivita?), the story is somewhat different. Though the word contains the same suffix -vita? with the verb root utaqqi-(to wait<ref type="foot">foot_11</ref> ), BPE does not segment the words along the expected morpheme boundaries; the only segmentation that respects them (500) appears to oversegment. In these examples, we are able to see clear morpheme splits in the surface form, but this is not always the case. In many cases, the underlying forms may undergo phonological changes at the boundaries where two morphemes meet, making it impossible to segment the word such that the resulting units have a uniform representation across all examples of that morpheme. One of our topics of investigation was whether this procedure alone would be sufficient to pre-process Inuktitut for machine translation, whether more sophisticated morphological processing would be necessary, or whether a combination of the two (morphological processing where possible, BPE for the rest) might prevail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Morphological Analysis</head><p>The Nunavut Hansard version 1.1 was the starting point for morphological analysis of the larger, later-released corpus (version 3.0). As version 1.1 is a subset of the days of debate included in version 3.0, we made use of prior morphological processing of version 1.1 when possible (processing described in <ref type="bibr">Micher (2018a)</ref> and summarized here). Every word type of the version 1.1 corpus was processed with the Uqailaut analyzer <ref type="bibr">(Farley, 2009)</ref>, which provides morpheme segmentation and labeling (including deep representation and morphological category tags). About 70% of the corpus was analyzable by this tool. The remaining 30% was subsequently processed using a neural morphological analyzer, which is trained on a subset of the Uqailaut processed data <ref type="bibr">(Micher, 2017)</ref>. Filtering out noise (concatenations of numbers and alphanumerics), we were left with approximately 413K processed word types from version 1.1 of corpus.</p><p>We then extracted the word types from the larger corpus, using the same noise filtering script as with version 1.1 and omitting the word types that had been successfully processed already from version 1.1. We ended up with &#8764;1.14M additional types. From these another &#8764;9K words were identified as English and removed, yielding &#8764;1.13M types to process. However, we note a few differences between these corpora, which affected the processing pipeline. First, the romanization scheme performed for version 1.1 of the Hansard is not identical to the romanization we performed on version 3.0 beta. In many cases, the resulting romanizations of words match, but in the cases that do not, the morphological analysis needed to be performed anew. For example, there are differences in romanization between Hansard versions (e.g. "lh" vs. "&amp;" for the lateral fricative) and between dialects (e.g. "s" vs. "h" for a particular phoneme); since Uqailaut presumes "&amp;" and "h", these are substituted before re-processing. After all of the pre-processing, we followed the same procedure as with version 1.1 of the corpus, first processing what the Uqailaut analyzer would process, and sending the remaining types through the neural morphological analyzer. In total, we have 1,548,500 types, processed through one or the other analyzer.</p><p>For our work during the workshop, however, we are training and evaluating using only the Uqailaut segmentations (that is to say, without using the neural parser), as the neural parses were not yet finished at the time of these experiments. We expect that the more complete analyses of the neural parser will have a more positive effect on downstream performance in future experiments.</p><p>In the following experiments, the morphologically processed text uses "deep" forms, in the sense of <ref type="bibr">Micher (2017)</ref>, rather than the surface forms. Since Uqailaut, and thus the neural generalization of it, only parse surface words into deep forms (and do not generate surface words from deep forms), we present our experiments with different segmentation approaches solely in the Inuktitut to English translation direction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>System configuration</head><p>The model uses a 3-layer encoder, a 3-layer decoder, a model dimension of 512 and 2048 hidden units in the feed-forward networks. The network was optimized using Adam <ref type="bibr">(Kingma and Ba, 2014)</ref>, with an initial learning rate of 1e-4, decreasing by a factor of 0.7 each time the development set BLEU did not improve for 8000 updates, and stopping early when BLEU did not improve for 32, 000 updates.</p><p>In addition to the most common automatic MT evaluation metric, BLEU<ref type="foot">foot_12</ref>  <ref type="bibr">(Papineni et al., 2002)</ref>, we also evaluated our MT experiments using two recently proposed metrics, chrF 10 (Popovi&#263;, 2015) and YiSi <ref type="bibr">(Lo, 2019)</ref>, which were shown to correlate better with human judgments on translation quality in English by <ref type="bibr">Ma et al. (2019)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head><p>iku segmentation eng segmentation BLEU chrF YiSi-0 YiSi- We compared BPE of various vocabulary sizes to the morphological analysis described above. In Table <ref type="table">4</ref>.3, we observe that morphological analysis underperforms BPE across all metrics.</p><p>We think this is not due to a problem in the morphological analysis itself (e.g. identifying morphemes incorrectly), but that the process left unanalyzable words intact, whereas BPE manages to segment all words into more manageable pieces. We therefore also performed a preliminary attempt to combine them, in hopes of combining some of the benefits of true morphological analysis with the statistical advantages of BPE. First, we took the output of morphological analysis (i.e., the input corpus to the "Morphological" system in Table <ref type="table">4</ref>.3), trained a new BPE model on it, and segmented it according to this model. Manual inspection of the results of this process suggest that morphemes identified in morphological analysis were typically left intact by BPE -that is to say, they were identified as units by BPE as well -and only unanalyzed words were further segmented.</p><p>This system also underperformed the BPE-only system, but only by small margins. We think that this avenue is still promising, as there are many possible ways to integrate BPE and morphology. Many questions remain:</p><p>&#8226; Does one resegment only the unanalyzed words, or all words?</p><p>&#8226; Does one train the BPE model on only unanalyzed words, or all words?</p><p>&#8226; Do we use surface morphemes or underlying morphemes?</p><p>&#8226; Do we rejoin underlying forms or keep them segmented?<ref type="foot">foot_14</ref> Also, as not all the corpus was fully analyzed, more development in neural analysis will probably lead to improvements downstream.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2">Single source and multi-source experiments</head><p>One experimental theme we pursued in this workshop was whether multi-source techniques <ref type="bibr">(Zoph and Knight, 2016;</ref><ref type="bibr">Nishimura et al., 2018;</ref><ref type="bibr">Libovick&#253; and Helcl, 2017)</ref>, typically used for training MT systems with multiple source languages, could be of value when applied to multiple representations of the input text, as a potential way to combine the benefits of two different kinds of analysis.</p><p>A recent result in multilingual machine translation <ref type="bibr">(Littell et al., 2019)</ref> suggested that it can be valuable, when training MT on a corpus that has undergone significant processing (in that case, machine translation of the original source into Russian), to attend to both the original text and its processed version. That is to say, "attention" in MT makes it possible to avoid having to choose between using the original text or a process that may have been helpful (or may have destroyed useful information); rather, we can allow the model to attend to the results of any stage in the pipeline, and learn for itself which representations to attend to the most. The above result concerned a pre-processing step that was itself machine translation -that is to say, this was a "pivot" system in which L1 is translated to L2, and L2 is translated into L3. We were wondering whether the result might also apply for processing steps that were not machine translation. Would, for example, it be fruitful to attend to two different pre-processings: say, BPE and morphological, syllabics or romanized, etc.?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>System configuration</head><p>The following experiments were performed using the architecture in <ref type="bibr">Littell et al. (2019)</ref>, a variant of Transformer <ref type="bibr">(Vaswani et al., 2017)</ref> with multi-source attention, implemented in the Sockeye framework <ref type="bibr">(Hieber et al., 2017)</ref> for machine translation.</p><p>The model uses two 3-layer encoders (one for each source type), a 3-layer decoder, a model dimension of 512 and 2048 hidden units in the feed-forward networks. The decoder attended to each decoder using "flat" attention (that is, attending to each and combining the result by simple addition, rather than an additional, hierarchical attention layer). The network was optimized using Adam (Kingma and Ba, 2014), with an initial learning rate of 1e-4, decreasing by a factor of 0.7 each time the development set BLEU did not improve for 8000 updates, and stopping early when BLEU did not improve for 32, 000 updates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head><p>As an initial sanity check, we performed two tests of the idea:</p><p>1. Source 1: BPE vocab size 5000, source 2: BPE vocab size 30000 2. Source 1: Inuktitut in syllabics, BPE vocab size 5000; source 2: Inuktitut romanized, BPE vocab size 5000.</p><p>We did not expect these to show significant gains, but we wanted to make sure the systems did not experience a serious drop in scores. Unfortunately, Table <ref type="table">4</ref>  We believe this is because the multi-source source system greatly increases the number of parameters without an associated increase in information in the corpus. If we compare this to the positive results in <ref type="bibr">Littell et al. (2019)</ref>, the difference is that there the introduction of a third language greatly increases the amount of information available to the system: it is not just another view of the same data. So, rather than continue exploring additional monolingual multi-source setups (e.g., BPE and morphology together), we instead moved on to the multilingual multi-source experiments detailed in Section 4.4.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.3">Challenges in Evaluation of English-to-Inuktitut MT</head><p>For questions of segmentation, we primarily looked at the Inuktitut-to-English direction, since our morphological analyzer was only able to parse, rather than generate. (That is to say, while we could output segmented, underlying morphemes, we could not, at that time, rejoin them into fluent outputs.) For English-to-Inuktitut, we only looked at BPE-based systems, since these can trivially be de-segmented. In this translation direction, we focused on questions of evaluation because morphologically complex languages pose a challenge in terms of the choice of automatic evaluation metric.</p><p>BLEU <ref type="bibr">(Papineni et al., 2002)</ref> is a common metric for evaluation of machine translation output given reference translations. However, because BLEU score is (typically) computed at the word level, an error in a single morpheme is penalized just as harshly as a completely incorrect choice of terminology. This can be expected to have a particularly detrimental effect when evaluating translation output in morphologically complex languages; even if the system chooses the correct lemma, any errors of morphological inflection will be counted as wholeword errors, decreasing the count of correctly-predicted n-grams. BLEU score could also be computed over byte pair encodings rather than words, but this poses challenges when trying to compare systems built with different vocabularies.</p><p>chrF sidesteps the segmentation issue by first removing whitespace before counting character n-grams and computes a precision/recall-balanced score over the character n-gram counts. On the other hand, YiSi-0 respects the word boundaries in the MT output but uses the character-level longest common substring accuracy to evaluate the word-level similarities and aggregates the word-level similarity scores into the sentence-level score. These two automatic evaluation metrics based on character-level information would be more suitable for evaluating MT output in morphological complex languages. In fact, <ref type="bibr">Ma et al. (2018)</ref> showed that chrF correlates the best with human in evaluating Finnish translation output and YiSi-0 correlates the best with human in evaluating Turkish translation output. However, we think it important to point out that the complexity of Inuktitut morphology is higher than that of Finnish or Turkish and there is no existing work on MT evaluation for polysynthetic languages. This remains an area for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>System configuration</head><p>The English-to-Inuktitut MT system was built using the same architecture as that of the system mentioned in Section 4.3.1. We evaluated the system at both word-level and 5k BPE-vocabulary segmentation using BLEU,<ref type="foot">foot_15</ref> chrF,<ref type="foot">foot_16</ref> and YiSi-0. Since YiSi-0 is a weighted harmonic mean of precision and recall, we also dissected YiSi-0 into pure precision and recall for further analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Results</head><p>First and the foremost, we have to emphasize that MT system scores for different translation directions are not directly comparable. Thus, one should not conclude from Table <ref type="table">4</ref>.5 that translating Inuktitut into English is an easier task to the opposite direction, or the translation quality of a system in one direction is better than that in the other direction. Instead, we would like to point out that there is a notable difference in word-level BLEU scores for the systems in two translation directions because BLEU penalizes systems on failing to correctly inflect a word form equally harshly as choosing an entirely incorrect word; thus MT systems translating into morphological complex languages are expected to achieve lower word-level BLEU scores. A huge difference can also be seen in YiSi-0 scores using word segmentation in evaluation. However, the chrF score difference between the two translation directions is marginal.</p><p>When evaluating translation output at subword unit level, both BLEU and chrF showed a wider score difference when the translation direction was flipped. However, YiSi-0 showed a smaller difference. The contradicting results showed that evaluating translation output in polysynthetic languages itself is a challenging and unsolved research problem.</p><p>Without human evaluation on translation output in polysynthetic languages, we could not conclude whether the quality of the English-to-Inuktitut MT system is acceptable or not (or whether it is sufficient for some use cases but not others). We hope that future human evaluation of machine translation into polysynthetic languages will provide a basis for the examination of different evaluation approaches, allowing future researchers to select the most appropriate evaluation metrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Low-Resource Experiments</head><p>In keeping with the theme of the workshop, our low-resource machine translation experiments involve neural systems rather than phrase-based ones, despite the fact that they are built from extremely small datasets. While we perform our experiments with fairly simple modern neural models and minimal hyperparameter tuning, recent work <ref type="bibr">(Sennrich and Zhang, 2019)</ref> suggests that careful tuning of hyperparameters can result in NMT systems outperforming statistical machine translation systems even on datasets of around 5000 sentences (comparable to our smaller datasets).</p><p>Most of the low-resource machine translation experiments were performed using Sockeye <ref type="bibr">(Hieber et al., 2017)</ref>, and the multi-source generalization of Sockeye introduced in <ref type="bibr">Littell et</ref>  We first compare RNN and Transformer translation models using BPE vocabularies of 5000. The size of 5000 was selected for consistency with other experiments and because it was among the highest performing vocabulary size on initial RNN experiments for several language pairs (not reported here). The RNN models were trained using OpenNMT <ref type="bibr">(Klein et al., 2017)</ref> with default settings, and the Transformer models were trained using Sockeye <ref type="bibr">(Hieber et al., 2017)</ref> with a 3 layer encoder, 3 layer decoder, batch size 2048, optimized toward perplexity, and the remaining parameters set to defaults. As Table <ref type="table">4</ref>.6 shows, the Transformer system outperformed the RNN system in all but one case (which was within 0.1 BLEU); we use the Transformer system for all remaining experiments.</p><p>We compare using a BPE vocabulary of 5000 symbols to using a whole word vocabulary. In all cases, the BPE vocabulary outperforms the whole word vocabulary (by between 0.9 and 7.4 BLEU points). Using whole words, English-St. Lawrence Island Yupik experiments were run with vocabulary sizes of 4787 and 26888 (respectively, including special characters), while English-Central Alaskan Yup'ik whole word vocabularies consisted of 13501 and 106736 types respectively. Given the small data sizes and large Yupik vocabulary sizes, it is unsurprising that BPE outperforms whole words; there may simply not be enough examples of many types in the long tail for the system to accurately translate them, and the word system includes a large number of out of vocabulary items in the test set.</p><p>Following the results of the Yupik experiments, we omit the RNN experiments for Guaran&#237;-Spanish and instead start with a baseline of a Transformer model (3 layer encoder, 3 layer decoder, batch size 2048, optimized toward perplexity, remaining parameters set to defaults), using separately learned BPE encodings for Spanish and Guaran&#237; with vocabularies of 5000 types each. There does exist other work on machine translation for Guaran&#237;-Spanish, notably an online gister<ref type="foot">foot_17</ref> and work in <ref type="bibr">Rudnick (2018)</ref>. Though Rudnick (2018) also performs experiments on Bible translation, we do not compare directly, as those results are measured on stemmed output.</p><p>For Guaran&#237;-Spanish, we also experiment with full-word vocabularies, FST-segmented vocabulary (Guaran&#237; side only; Spanish side BPE 5000), and an FST-segmented vocabulary with backoff to BPE (all Guaran&#237; words left unsegmented by the FST were segmented by a BPE model learned for a BPE 5000 vocabulary on Guaran&#237;; Spanish side BPE 5000). As shown in Table <ref type="table">4</ref>.6, the baseline BPE model outperforms all other experiments.<ref type="foot">foot_18</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.2">Yupik Language Experiments</head><p>Our Yupik language experiments begin with baseline RNN and Transformer models. Finding that the Transformer strongly outperforms the RNN (Table <ref type="table">4</ref>.6), we perform the remainder of the experiments with the Transformer architecture only.</p><p>In addition to the baseline, we perform two experiments: multi-source experiments on a multi-parallel subset of the data and multilingual NMT system experiments. BPE vocabularies of size 5000 were learned separately on each language's training data using subword-nmt <ref type="bibr">(Sennrich et al., 2016)</ref>. Our most promising low-resource experiments, described in Section 4.4.2 involve the use of higher resource languages from the same language family to build multilingual neural machine translation systems which can then be finetuned for specific lowresource languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Multisource</head><p>In order to experiment with multisource machine translation, we build a multiparallel verse-aligned corpus from the intersection of all available Yupik Bible data. The resulting New Testament corpus has 5449 lines for training, 1091 lines for development and validation, and 874 lines for testing. It contains data in St. Lawrence Island Yupik and Central Alaskan Yup'ik, as well as data from two English Bibles. We call the English Bibles eng ess (for the English Bible originally aligned to St. Lawrence Island Yupik) and eng esu (for the English Bible originally aligned to Central Alaskan Yup'ik). We preprocessed them identically to the baseline experiments, with one change: we removed verse numbers from Central Alaskan Yup'ik and its corresponding English (eng esu ) as those were not present in the St. Lawrence Island Yupik corpus.</p><p>We compared single-source (Sing.) and multi-source (Mult.) approaches, as described in &#167;4.3.2, as well as separately learned and jointly learned 5000 symbol BPE representations (the joint BPE representations were learned across all 4 sides of the multiparallel corpus). For the multi-source experiments, we tried translating into Central Alaskan Yup'ik using its corresponding English and St. Lawrence Island Yupik, as well as translating into St. Lawrence Island Yupik using its corresponding English and Central Alaskan Yup'ik. Without any major parameter search, we found that the joint BPE single-source systems performed the best.</p><p>As these BLEU scores are extremely low, it is quite difficult to draw any conclusions from this set of experiments; the following notes should be understood in that context. We do observe that for single-source, using a jointly trained BPE vocabulary performs better than separately trained BPE vocabularies. This may be due in part to improved translation of copied terms (e.g., names). We do not observe the same consistency in multisource. Perhaps unintuitively, in single-source experiments, we find that swapping the English Bibles (translating eng esu into ess and eng ess into esu) performs better than the "correct" pairs. This highlights several challenges of performing machine translation using Bible corpora: we do not have a guarantee in our case that the "source" English Bible is the version from which the Yupik Bibles were translated, Bible translations may rely on metaphor or other non-literal phrases, and verse alignment provides additional challenges due to mismatches between sentence and verse boundaries. In some cases, we observe that a sentence spans more than one verse, with a name appearing in the first verse in English and in the second verse in Yupik or vice versa, an impossible challenge for machine translation without extrasentential context to overcome; this is a known challenge in parallel Bible corpora <ref type="bibr">(Mayer and Cysouw, 2014)</ref>. We also did not perform hyperparameter optimization due to time constraints; more extensively tuned models may show different results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Multilingual</head><p>Multilingual neural machine translation has been proposed as a means of improving neural machine translation of low-resource languages, using a variety of distinct approaches. These approaches depend are split into approaches to translate into or out of low-resource languages. <ref type="bibr">Neubig and Hu (2018)</ref> explore the multilingual translation task translating from multiple low-resource languages into a single high-resource language. <ref type="bibr">Gu et al. (2018)</ref> also work in the same translation direction, and incorporate large amounts of monolingual data and many closely-related source languages.</p><p>Our interest is on translation into low-resource languages. In that direction, <ref type="bibr">Ha et al. (2016)</ref> perform multilingual neural machine translation by tagging each subword with a language-specific tag, and then building a system based on available training data. <ref type="bibr">Johnson et al. (2017)</ref> use a single special token at the beginning of input sentences to indicate the desired target language to translate into. <ref type="bibr">Rikters et al. (2018)</ref> follow this approach to do multilingual translation into and out of morphologically rich languages, though their low-resource setting consists of more than 3 million sentence pairs.</p><p>St. Lawrence Island Yupik, Central Alaskan Yup'ik, and Inuktitut belong to the same language family. Despite this, they have very limited vocabulary overlap in our parallel data (less than 1% type overlap between Inuktitut and Yupik, and less than a 3% type overlap between St. Lawrence Island Yupik and Central Alaskan Yup'ik). This is certainly due in part to the different domains we had available: legislative text (Inuktitut) and Bible (Yupik). As described in Section 4.2.2 and Section 4.2.1, our data spans a wide range in terms of size, from approximately 5000 lines of text to approximately 1.3 million lines. We approximately follow the <ref type="bibr">Johnson et al. (2017)</ref> approach in our approach to translating from English into Inuktitut and Yupik languages.</p><p>Baseline Multilingual ess-Ad. Multi. esu-Ad. Multi. eng-ess 4.4 5.8 6.5 1.3 eng-esu 5.3 5.7 1.9 6.0 Table <ref type="table">4</ref>.8: BLEU scores for experiments on multilingual neural machine translation. The baseline is the original Transformer baseline for each language pair. Multilingual is the single multilingual system (trained on Inuktitut and Yupik data), and the remaining two columns show that system fine-tuned on a particular variety of Yupik.</p><p>We train joint BPE (vocabulary 5000) on Inuktitut, St. Lawrence Island Yupik, and Central Alaskan Yup'ik, downsampling the Inuktitut and upsampling St. Lawrence Island Yupik to match the size of Central Alaskan Yup'ik. We prepend a language tag (e.g. "&lt;ess&gt;") to each source and target sentence in the three sub-corpora. Next we train a Transformer model (our "multilingual baseline") on the concatenation of all available training Baseline Multilingual ess-Ad. Multi. esu-Ad. Multi. eng-ess 26.9 28.0 30.1 10.5 eng-esu 31.0 32.5 16.7 33.2 Table <ref type="table">4</ref>.9: YiSi-1 scores (higher is better) computed using ess or esu BPE 5000 embeddings built by word2vec <ref type="bibr">(Mikolov et al., 2013)</ref> for experiments on multilingual neural machine translation. The baseline is the original Transformer baseline for each language pair. Multilingual is the single multilingual system (trained on Inuktitut and Yupik data), and the remaining two columns show that system fine-tuned on a particular variety of Yupik.</p><p>data (with no sampling, 3 layer encoder, 3 layer decoder, 512 embedding size, early stopping on perplexity of the concatenated development data). For St. Lawrence Island Yupik and Central Alaskan Yup'ik, we then fine-tune the multilingual baseline on all language-specific training data (with early stopping based on perplexity on the language-specific development data). The BLEU score results are shown in Table <ref type="table">4</ref>.8. Table <ref type="table">4</ref>.9 reports YiSi results, which follow the same trend as the BLEU score results. As expected, fine-tuning on language specific data boosts performance on that particular language (while the output on the other language appears to exhibit catastrophic forgetting <ref type="bibr">(Kirkpatrick et al., 2017)</ref>), giving us our best performance. However, with BLEU scores in the single digits, it is clear that there is still a long way to go before the MT output may be genuinely useful (e.g. in post-editing or interactive translation) for these low-resource languages.</p><p>Chapter 5</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Language Modelling</head><p>In this chapter, we report on language modelling experiments, comparing different tokenization strategies for polysynthetic languages. We trained a state-of-the-art RNN language model using the character, BPE, Morfessor and FST as the unit for segmenting text data. In order to facilitate comparisons across the tokenization strategies, we carefully selected datasets for two experimental settings: 1) A setting where all the data available for a language is used and 2) a setting where only the New Testament in a language is used. The former setting provides us an opportunity to utilize all the data we have in a language while the latter allows us to draw a more precise comparison across languages. We use the average perplexity per character or the character-level perplexity as a metric to compare different models. The results show that the linguistically-oriented, FST segmentation strategy performed the best in modelling polysynthetic languages when it was available. In addition, difficulty of modelling different languages is compared using the average perplexity per word or the word-level perplexity. The potential of FST in aiding language modelling of polysynthetic languages and implications on comparing models for different languages are discussed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Data Preparation</head><p>After much consideration, we selected four low-resource, polysynthetic languages for our language modelling experiments (hereafter referred to by ISO 639-3 code): St. Lawrence Island Yupik (ess), Central Alaskan Yup'ik (esu), Inuktitut (iku) and Guaran&#237; (grn). These languages were chosen because we had the most available text data in them. We had at least the Bible, the Gospel books in New Testament in particular, in these languages, and that allowed us to have a commonality among the datasets to facilitate comparison across the languages. In addition to the polysynthetic languages, we included two well-researched, non-polysynthetic languages: English (eng) and Spanish (spa). The eng and spa data were included to provide comparison between polysynthetic languages and non-polysynthetic languages as esu and eng and grn and spa were parallel translations.</p><p>We designed two experimental settings to fully utilize available data while ensuring comparability across different languages. As for the 1) all data setting, we included any available monolingual data in a given language, including but not limited to the New Testament. The second setting, the 2) New Testament only setting, focused only on the New Testament data in order to further ensure comparability given the near-parallel data across different languages. Regardless of the settings, Luke was used as the development set and John as the test set to further facilitate fair comparison as we had the Gospel books in all languages. This ensured that different languages  shared a development set and a test set and a part of the train set (the rest of the New Testament) in common even though the exact train set available in each language may differ from one another. The train set in the 1) all data setting included the New Testament, but may also include the Old Testament, transcripts and oral narratives if available. This setting, therefore, fully utilizes the data we had in each language. In the 2) New Testament setting, the development and test sets stayed the same, but the train set included the rest of the New Testament only. It should be noted that we did not align the Bibles at the sentence level, and there was some variability among different Bible translations as discussed in Chapter 4. However, esu and eng and grn and spa Bible translations were assumed to be parallel, and we assume that the other Bible translations provide comparable texts with similar intensions overall. While the 2) New Testament only setting may provide a more precise comparison, the 1) all data setting may be more representative of the reality given the limited size of the data for the former setting. Table <ref type="table">5</ref>.1 summarizes the two experimental settings and the dataset split.</p><p>Given the data split, we preprocessed the datasets systematically to further ensure comparability among subsequent language models. We removed redundant, bracketed texts when applicable, and normalized apostrophes as they were meaningful in some languages and should not be tokenized separately from their surrounding words. Then, we normalized the punctuation and tokenized the texts using Moses scripts <ref type="bibr">(Koehn et al., 2007)</ref> with default settings. The overall preprocessing for language modelling experiments resembles that for machine translation experiments discussed in Chapter 4 except that we did not truecase the data for langauge modelling experiments.</p><p>Tables 5.2 and 5.3 summarize descriptive statistics of the preprocessed data under each setting. Overall, it seems that the characteristics of a language as captured by the statistics are quite similar under the two settings. This may not be surprising given that the two settings concern very similar domains. While it remains to be seen if these descriptive statistics would be similar under a different setting for the languages, we observed the followings for the languages given our datasets: As discussed in Chapter 3, the languages seem different in the TTR and mean distance to the next unseen word. ess, esu and iku consistently show a higher TTR and a lower mean distance to the next unseen word than grn. While grn is considered as a polysynthetic language, it seems that grn might be slightly different from the other polysynthetic languages spoken in Alaska (ess, esu, iku). Still, grn is distinctive from spa and eng in that it still had a higher TTR and lower mean distance to the next unseen word. While the spa data seems more complex under the New Testament setting, eng and spa are consistently simpler than polysynthetic languages in terms of TTR and mean distance to the next unseen word.</p><p>It is noted that, across languages, the datasets are similar in terms of sentence counts within each experimental setting. While esu-eng and grn-spa differed slightly in terms of the exact sentence count, they are aligned at the verse level. The rest of the data are not aligned at the verse level, but they seem to contain similar number of sentences under the respective data conditions. Note that we did not include the Hansard data for iku. We exclude the data because including it would increase the amount of available data and genre variability for the particular language too much to allow comparison across languages.</p><p>Given the similar number of sentences present in each dataset, it is noteworthy that the word count and type count are distinct across the languages. Again, ess, esu and iku seem similar to each other in that they have a smaller number of words and a large number of types than others. This reflects their typological characteristic, that they tend to have longer words with more morphemes, which may lead to more unique tokens. grn still seems distinct from the other polysynthetic languages in that the datasets in the language tend to have more words and less unique words. In fact, grn seems to have similarity with spa in terms of the descriptive statistics even though grn still has a lower mean distance to the next unseen word than spa. eng seems to be clearly more analytic than the other languages as it has more word counts and less type counts. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Tokenization strategies</head><p>We considered five different tokenization strategies in modelling the languages: word, character, BPE, morfessor and FST segmentation methods. In what follows, we briefly explain each tokenization strategy and why they might be helpful in segmenting polysynthetic languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.1">Word</head><p>A common tokenization strategy is to tokenize text by whitespace or by words. While it may be simple and seem intuitive, this tokenization strategy faces data sparsity and out-of-vocabulary (OOV) issues. For example, if we tokenize by words, dog and dogs will count as two separate tokens even though there is much shared information between the two. If the train set includes only the singular form and the test set contains only the plural form, the plural form in the test set will be considered as OOV.</p><p>(3) aghnaaguq aghnagh -&#8764;:(ng)u -&#8764; f (g/t)u--q woman -to.be -INTR.IND -3SG</p><p>'she is a woman' <ref type="bibr">(Jacobson, 2001, p.25-26)</ref> This tokenization method is particularly problematic for polysynthetic languages given their rich morphology. A word in polysynthetic languages may contain several morphemes to express a sentence-like intension. For example, a word in ess, aghnaaguq, consists of four morphemes and is translated as 'She is a woman' as shown in Example (3). Importantly, this results in a high rate of hapax legomena (words appearing only once), which results in much higher OOV rates than observed in most non-polysynthetic languages. In modelling polysynthetic languages, the word-level tokenization is too unrealistic to be useful in predicting the next word, and its performance may be over-estimated or under-estimated depending on how we reward or penalize OOVs. For example, if we do not penalize a model for predicting an OOV symbol for the next word, it may predict an OOV symbol repeatedly for a polysynthetic language to falsely record a good performance. If we do want to penalize OOV, we will have to come up with a metric that does that fairly given our data. Given that the model we adapted did not penalize OOV, we opted to use language models that would not over-generate OOVs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.2">Character</head><p>One possible solution to such issues of word-level tokenization is to tokenize text by the character. The characterlevel tokenization rarely has OOV issues because a text typically consists of a finite set of characters regardless of its morphological complexity. However, this tokenization method, again, cannot fully utilize the linguistic information present in a text as it reduces all words into a sequence of a finite set of characters. The relationship between dog and dogs may be easily captured by a character-level model, but words with more complex morphology like Example (3) may be hard to model using the character as the tokenization unit.</p><p>While we report our results for character-level models as the baseline to compare other results to, we note that character-level models may not be meaningful for downstream applications for polysynthetic languages such as keyboard prediction: Predicting a character at a time when a word consists of several morphemes and a long sequence of characters may be too slow or too low-quality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.3">BPE</head><p>If word-level tokenization is too coarse-grained and character-level tokenization is too fine-grained, it may mean that we need to utilize subword units to segment our data. As discussed in Section 4.3.1, byte pair encoding (BPE; <ref type="bibr">Sennrich et al., 2016</ref>) is a unsupervised segmentation method that uses subword units. Originally a data compression algorithm <ref type="bibr">(Gage, 1994)</ref>, BPE has become one of the standard techniques in neural machine translation since <ref type="bibr">Sennrich et al. (2016)</ref>. Tokens segmented by BPE can represent texts with the minimum entropy by the fixed vocabulary size, which should be chosen as the hyperparameter. BPE segmentation may look like morpheme segmentation for some words, but it is data-driven rather than based on linguistic information. For example, with enough support from a given data, BPE may segment 'lower' as 'low@@ er' (@@ represents a within-word morpheme boundary), which may seem linguistically motivated, but it is also possible to get different segmentations such as 'l@@ ow@@ er' with different hyperparameters and different data conditions. Refer to Table <ref type="table">4</ref>.2 for examples of BPE segmentations for machine translation of iku, some of which respect morphological boundaries and some of which do not.</p><p>We trained a BPE model on the training data and applied the model to all data using subword-nmt<ref type="foot">foot_19</ref> . We experimented with different vocabulary sizes for BPE segmentation, and report results on two vocabulary sizes: 500 and 5,000. While BPE provides an off-the-shelf method to segment words into subword units, it remains unclear whether the unsupervised method would prove useful in modelling polysynthetic languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.4">Morfessor</head><p>We adopted another unsupervised segmentation method called Morfessor to compare with BPE. Morfessor is a tool for unsupervised (and semi-supervised) morphological segmentation and has been utilized in speech recognition, MT, and speech retrieval. While there is no literature on its efficiency in neural language modelling tasks for polysynthetic languages, it is said to be useful in modelling languages with rich morphology such as Finnish, Estonian, German and Turkish <ref type="bibr">(Smit et al., 2014)</ref>. Morfessor uses Maximum a Posteriori (MAP) estimation to approximate morpheme segmentation assuming that a word consists of one or more "morph", yet its results may not be the same as linguistically motivated morpheme segmentation. We used Morfessor 2.0 with the default settings for Morfessor segmentation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.5">FST segmentation</head><p>The last segmentation strategy we considered was segmentation based on FSTs. FST segmentation provides knowledge-based, rule-based segmentation based on linguistic knowledge and analysis. Several FST-based morphological analyzers or morphological segmenters have been developed for polysynthetic languages, and we were able to experiment with two of them for our experiments: ess (Chen and Schwartz, 2018) and grn <ref type="bibr">(Kuznetsova and Tyers, 2019)</ref>. The FST-based morphological analyzers produce zero or more morphological analyses for any given word. When there are more than one analysis available for a word, we used heuristics (e.g. choose the shortest analysis) to select one analysis to segment the given word. When there was no analysis available, we used character (character backoff) or BPE (BPE backoff) segmentation for the word. The BPE backoff was performed using the existing BPE segmentations with the vocabulary size of 500 and 5,000. While we were able to obtain this segmentation results only for two polysynthetic languages, this provides a point of comparison between supervised, linguistically motivated segmentation and unsupervised, data-driven segmentation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">RNN-LSTM</head><p>We used a state-of-the-art language model <ref type="bibr">(Merity et al., 2017</ref><ref type="bibr">(Merity et al., , 2018) )</ref>  models and the hyperparameters for character level enwik8 for character and BPE models. Table <ref type="table">5</ref>.4 summarizes the hyperparameters.</p><p>We acknowledge that none of these models (nor any other models to our knowledge) have been specifically designed to model polysynthetic languages or reported to be used to model polysynthetic languages. With a lack of a language model designed to model polysynthetic languages, we chose a state-of-the-art model that has proven competitive in modelling English instead.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Character-level perplexity</head><p>Perplexity is a measure of language modelling difficulty and calculated by taking the exponent of the average negative log-likelihood per token. Because perplexity as it is depends on the tokenization strategy, we calculate the character-level perplexity for each model to allow comparison among them. We define the character-level perplexity as the exponent of the average negative log-likelihood per character and calculate it by adding up the token-level loss for a given tokenization, multiplying the total loss by the number of tokens in the test set and dividing the value by the number of characters in the test set. We count whitespace and the end of a sentence symbol as separate tokens. This ensures a fair comparison among different tokenization strategies. The choice of character as the common denominator is arbitrary, and it can be other tokenization methods such as the word. Refer to <ref type="bibr">Mielke (2019)</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Results &amp; Discussion</head><p>Tables 5.5 and 5.6 summarize the language modelling experiment results excluding FST segmentation for the 1) all data setting and 2) New Testament only setting, respectively. It is suggested that the character and Morfessor models might work better than BPE models for polysynthetic languages. As for the 1) all data setting, tokenization by character resulted in the best performance in modelling ess and esu while Morfessor models performed the best for iku and grn. BPE models with the vocabulary size of 5k worked the best with eng and spa. The same trend was observed for the 2) New Testament setting for ess, esu and iku: character models performed the best for ess and esu while Morfessor led to the lowest perplexity measure for iku. However, the character-level model resulted in the lowest character-level perplexity for grn while the Morfessor model was the best for eng and spa for the 2) New Testament setting. While it is unclear why a certain tokenization method worked better for a language, it is speculated that BPE might not be well-suited in segmenting polysynthetic languages given their morphological richness. A word in a polysynthetic language might consists of several morphemes that are not immediately retrievable based on the surface form. As shown in Example (3), a word in ess may contain a root, a derivational suffix and inflexional suffixes, which may look different in the surface form depending on the morphophonological rules that apply to the suffixation. For example, the derivational suffix (-&#8764;:(ng)u) in example (3) has two morphophonological symbols (&#8764; and :), the latter of which applies to delete the gh ending of the root (for details see <ref type="bibr">Jacobson, 2001)</ref>. Given such characteristics of polysynthetic languages, the fact that character models worked the best for ess and esu might mean that those languages were hard to segment with unsupervised segmentation methods like Morfessor and BPE. Segmenting those languages might require getting at the underlying form with linguistically motivated segmentation rather than segmenting the surface form only.</p><p>Even though Morfessor models worked the best for iku under both settings and for grn under the 1) all data setting, the difference between the Morfessor models and character models are quite small. It should be noted that the hyperparameters for Morfessor and BPE operations are not optimized. While the BPE models with the two hyperparameters (V=500 and V=5k) did not result in the best model for any of the polysynthetic languages, it is possible that different hyperparameters might result in better (or worse) perplexity measures. In a similar note, different datasets in a language might work differently with Morfessor tokenization: the Morfessor segmentation was the best in modelling spa under the 2) New Testament only setting, but it was the very worst under the 1) all data setting. As a way to utilize rich morphology in modelling polysynthetic languages, we trained FST-based models for ess and grn. Table <ref type="table">5</ref>.7 summarizes the character-level perplexity values for all tokenization methods including FST segmentation only and FST segmentation with character or BPE backoff strategy for ess and grn. For all settings, FST-based segmentation resulted in the best model for the two languages. The clear difference between FST-based models and non-FST-based models suggest that the Morfessor and BPE models failed to capture the morphological information present in the data.</p><p>The fact that the FST segmentation only worked the best for ess might suggest that the FST segmentation for the language might have been more robust than grn. Indeed, the FST segmentation only resulted in high perplexity in modelling grn under the 2) New Testament setting. With the BPE and character backoff, grn FST models still worked the best, but it is speculated that the FST morphological segmentation alone for grn might not have been reliable or the coverage of the FST was not as good as the ess FST.</p><p>After comparing different tokenization methods per language, we compared different languages to see which language is easier or harder to model. This line of inquiry has been pursued by several recent studies <ref type="bibr">(Cotterell et al., 2018;</ref><ref type="bibr">Mielke et al., 2019;</ref><ref type="bibr">Gerz et al., 2018)</ref>, where various languages are modeled using a state-of-the-art neural language model to compare relative difficulty of modelling a language with particular linguistic features. It should be noted that our data per language were not parallel so the comparison has to be drawn with caution. However, we still attempted the comparison here as comparing our models may provide insights for future studies given that we used the same or very similar RNN language models as the previous literature and that polysynthetic languages have not been discussed in this line of inquiry. If we compared the character-level perplexity, Table <ref type="table">5</ref>.5 and Table <ref type="table">5</ref>.6 show that iku was the easiest to model under the 1) all data setting and eng under the 2) New Testament setting. However, character-level perplexity may not be the right metric to use to compare different languages. The problem with the character-level measure is that it does not tell us much about real-life applications, where the difficulty of predicting an entire word might be more meaningful. More importantly, the character-level perplexity underestimates the difficulty of modelling polysynthetic languages as they tend to have longer, morphologically complex words. In fact, when we look at the word-level perplexity, the differences between polysynthetic languages and others become clearer. Table <ref type="table">5</ref>.9 and Table <ref type="table">5</ref>.10 show the word-level perplexity measures for the two experimental settings. When considering the difficulty of predicting the next word in the languages than the next character, iku is no longer the easiest to model under any condition. The word-level measure clearly shows that eng, followed by spa, was the easiest to model. Comparisons of the word-level perplexity values suggest that ess, esu and iku are quite similarly hard to model while grn is less difficult even though it is still quite harder to model than language like eng and spa. This observation agrees with our previous observation about Of course, it might be unrealistic to expect that a model for polysynthetic languages would result in a wordlevel perplexity comparable to that for eng given the linguistic difference. Polysynthetic languages tend to have longer and diverse word forms because of their richer morphology. Therefore, they are likely to be harder to model than other languages. However, comparing the character-level perplexity only may result in mistakenly arguing that iku is easier to model than eng.</p><p>While the relative performance of each tokenization method for a given language stays the same regardless, the choice of the unit for the perplexity measure should be carefully made if we are to compare different languages. As mentioned above, the datasets were not strictly parallel across the languages even under the 2) New Testament setting. Parallel texts and different evaluation methods might facilitate comparison across languages. For example, <ref type="bibr">Mielke et al. (2019)</ref> uses the average surprisal (negative log-likelihood loss) per verse when comparing languages models using data fully aligned at the verse level and also suggests a statistical method to estimate the difficulty coefficient of a language given some missing verses. Aligning a parallel corpus of polysynthetic languages and others at the verse or sentence level may lead to a more useful comparison in future research.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6">Future Direction</head><p>The results clearly show that FST segmentation is helpful in modelling polysynthetic languages. While we had only two languages to experiment with FST segmentation, FST segmentation with or without a backoff strategy resulted in the best model by a large margin. Figure <ref type="figure">5</ref>.1 and Figure <ref type="figure">5</ref>.2 visualize the relative performance of the FST model v. BPE or character models at the sentence level for ess and grn, respectively. For both figures, points under the 45 degree line mean lower loss or better performance for the FST model than Morfessor, BPE or character model. For both languages, it is clear that the FST models resulted in lower loss (negative log-likelihood) per sentence overall as well as the entire text. This represents an opportunity to utilize an existing, linguistically-oriented system in aiding neural language modelling. While FSTs might not be as helpful in modelling high-resource languages with poor morphology, they will be essential in modelling low-resource polysynthetic languages.</p><p>Another line of inquiry we are currently pursuing is comparing polysynthetic languages with other languages in terms of language modelling difficulty. In order to compare different languages more precisely, we are using aligned Bible datasets and comparing a perplexity measure per verse. By modelling 149 Bibles in 94 languages, covering 24 language families, we aim to answer if polysynthetic languages are indeed harder to model than other languages and what kind of linguistic, typological features (if any) explain such difficulty.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chapter 6</head><p>Applications &amp; Future Work</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">On-device Text Prediction</head><p>One of the goals of this workshop was to make progress in providing human language technologies that can actually be used by native speakers. As smartphones become ubiquitous in native communities, text entry is becoming an increasingly important use case.</p><p>In particular, users should have access to text entry methods, namely custom keyboards, that allow them to enter text quickly and accurately. Currently, most of the languages we consider have no form of predictive keyboard available.</p><p>Our goal was to develop a pipeline for constructing custom predictive keyboards for polysynthetic languages. We wanted the keyboards to allow both automatic completion of the current unit of text being typed by the user (where units could refer to morphemes or words) and prediction of the next unit when the user input reached a boundary. Both completion and prediction rely on language models to work, so the bulk of our efforts focused on adapting trained neural network language models for on-device use.</p><p>Ultimately, we successfully built functional prototype on-device keyboards for Guaran&#237; (grn) and St. Lawrence Island Yupik (ess). To our knowledge, these would be the first open-source predictive keyboards available for these languages on the Android platform.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.1">Open Source Stack</head><p>We chose to integrate our predictive LM models with the android branch of the open source Divvun toolkit <ref type="foot">1</ref> . Divvun was chosen since it is actively developed, and the project has a stated goal of enabling text entry for lowresource languages. The toolkit provides base IME front end source code that handles on-device keyboard display and capturing of user input. We rewrote Divvun's default back end to enable loading a trained neural LM that could be used to make future predictions based on the text buffer content the user has already typed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.2">User Interface Considerations</head><p>Polysynthetic languages pose unique challenges for UI/UX design in the context of a predictive keyboard. A key question concerns the level of granularity at which predictions should be presented.</p><p>Existing keyboards almost exclusively make predictions over whole words. For polysynthetic languages, word-level prediction is problematic. For reasons introduced in Chapters 1 and 2, it isn't feasible to train an effective language model over words in languages with extremely productive morphology. Most words are composed on-the-fly, and so would not have been seen during training. Furthermore, polysynthetic morphology permits extremely long words (e.g., "o&#241;embohuguaipu'&#227;" in Guaran&#237;). The small prediction strip present on device keyboards would not be able to comfortably accommodate so many characters in a single prediction. As a compromise, we chose to use morphemes as the unit of prediction for our keyboard prototypes. As the user types, the prediction bar presents them with either completions of the current morpheme they are in the middle of, or predictions for the next morpheme if the language model predicts they are at a morpheme boundary.</p><p>The use of morphemes as units of prediction implies that we have access to morphological analysis and segmentation tools that can generate morpheme-level training data for our language models. These tools may not be available for all languages, in which case different subword units may need to be used. One option is do modelling and prediction over BPE word chunks. However, these would likely appear unnatural to most users, since BPE segmentation is unsupervised and linguistically unaware, leading to segmentation that doesn't correspond to any natural boundaries. A better option would be to use syllables as units, since they can be extracted with a simple model that looks for consonant/vowel alternations, and do correspond to cognitively 'natural' linguistic units.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.3">Adapting Neural Language Models for Mobile Devices</head><p>As shown in Figure <ref type="figure">6</ref>.1, we'd like to build an interface that uses the context typed into a buffer to present completions and predictions to the keyboard user. To do this, we need to feed the context data into a language model. Initially, we attempted to use the SOTA PyTorch-based language models tested in Chapter 5 directly ondevice. However, this proved to be technically prohibitive. First, device resources are limited, and keyboards should be lightweight -they only account for text entry and shouldn't have a significant impact on other running applications. We set a goal of keeping keeping our model size on the order of 10Mb. Second, there is little built-in support in Android for loading and running PyTorch models. In contrast, Google provides the TensorFlow Lite(TFLite) framework for loading models trained via TensorFlow and converted for on-device use.</p><p>We attempted to convert our PyTorch models to TensorFlow using the ONNX, toolkit<ref type="foot">foot_21</ref> but found that the automatic converter did not support many of the operations used. Ultimately, we settled on training custom models for keyboard operation building on TensorFlow sample code. <ref type="foot">3</ref> We trained our models using the full desktop version of TensorFlow, and successfully exported the portion of the resulting computation graph responsible for inference to TFLite.</p><p>For both Guaran&#237; and Yupik, language models were trained on text from the Bible, that had been processed via the FSTs described in Chapters 3 and 7 to include morpheme boundaries. The data was split as described in Chapter 5 for consistency with the language modelling experiments described there. The training data covered all available Bible verses except the gospel of Luke (which was reserved as development data), and John (which was reserved as test data). The models were built at the character level, but with morpheme boundaries (@) marked directly on predicted symbols, as shown in Figure <ref type="figure">6</ref>.2. This modification enabled the model to guess when a morpheme boundary was reached (i.e., a symbol with @ was predicted/typed).<ref type="foot">foot_23</ref>  The model consisted of the following architecture. A single LSTM with 2 layers, and 200 hidden units per layer, read a 30-character context. The final hidden state of the LSTM was passed through a dense layer followed by a softmax to assign probabilities to each possible next symbol. The LSTM was trained with dropout (keep_prob=0.75) between layers, with dropout disabled during inference. Batches of 20 contexts were used for training. Optimization was done via Adam, with initial learning rate 1.0 and learning rate decay 0.5.</p><p>When the model was loaded on the device, our custom Divvun back end sent the last 30 chars of the input buffer the user had typed through the model, and used the greedy algorithm below to generate continuations and predictions to display to the user in the keyboard's prediction bar. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>append(prediction) end</head><p>Currently, prediction stops when the model predicts a morpheme or word boundary. This stopping condition can be altered as needed to, for example, avoid stopping if the current prediction is too small (e.g., a single character) or continue predicting until the total log probability of the predicted string drops below a given threshold. Predictions can also be reached by a different, less greedy search algorithm, such as a depth first search starting at the current context. However, this has a high chance of producing many candidates with the same prefix. The method used here was chosen for its simplicity, and because it ensures candidates are diverse (no two can-didates can share the same initial character). User testing might be able to determine if this bias towards diverse predictions is desirable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.4">Future Development</head><p>In Chapter 5, we evaluate our underlying language model quality via perplexity measures. Unfortunately, we did not have access to native speakers during the workshop and so could not perform direct user testing with our prototype keyboards.</p><p>Our ultimate goal would be to push our development back to the main Divvun project, so that it can receive ongoing support, and make it into the hands of native speakers. However, there are a number of evaluation measures that approximate the user experience related to prediction quality. Top-n prediction recall measures how often the correct prediction would have been shown to the user in the keyboard's prediction strip (assuming the user was typing a fixed script). Similarly, we can measure how many keystrokes a user can save by selecting a prediction (1 touch) versus typing it out (# touches corresponding to characters in the prediction unit).</p><p>Our prototype keyboards lack certain features that are standard on more mature offerings for languages like English. First, we assume the users touch exactly the keys they intended, and that they don't make spelling mistakes. The reality of using a touch device is that input is noisy and prone to error, with touches often sensed only in the vicinity of the intended key. A noisy channel model applied to the sequence of touch points received by the keyboard can be used to auto-correct these mistakes.</p><p>Second, our keyboard's predictions are at the mercy of the data used to train our language models. Without a whitelist of acceptable units, or a blacklist of units that shouldn't be predicted, there is nothing preventing the model from generating offensive language. Similarly, predictions can be significantly biased towards the style of the training data. In our case, the our LMs are noticeably 'evangelical,' being trained almost exclusively on text from the Bible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Speech Recognition</head><p>Within this section, we discuss two experiments with automatic speech recognition on polysynthetic languages: preliminary experiments with Crow (cro) word prediction and experiments with Guaran&#237; speech recognition. First, we describe previous work on speech recognition for polysynthetic languages as well as some of the inherent difficulties that arise when constructing speech corpora. Then, we discuss our baseline approach to end to end neural speech recognition using the Deepspeech model <ref type="bibr">(Hannun et al., 2014)</ref>, the preliminary results obtained and a discussion of future directions for polysynthetic speech recognition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.1">Related work</head><p>Speech recognition for polysynthetic languages is a relatively new area of research. Much of this is due to the necessity of large transcribed speech corpora. <ref type="bibr">Klavans et al. (2018b)</ref> presents an overview of the challenges facing automatic speech recognition for polysynthetic languages. They note that there is a dearth of resources for polysynthetic languages, particularly transcribed speech corpora. These corpora require large volumes of data from skilled native language speakers. The size of the corpora required and the linguistic, technological and language specific knowledge required make this an difficult task for communities to accomplish on their own. Hasegawa- <ref type="bibr">Johnson et al. (2017b)</ref> states that "transcribing even one hour of speech may be beyond the reach of communities that lack large-scale government funding" (as cited in <ref type="bibr">Klavans et al. (2018b)</ref>).</p><p>For Seneca, <ref type="bibr">Jimerson et al. (2018)</ref> investigated the application of different ASR models to a small spoken corpus of Seneca (consisting of approximately 155 minutes of recordings). They found that GMM ASR models from the Kaldi ASR toolkit <ref type="bibr">Povey et al. (2011)</ref> yielded better results than neural approaches on this small dataset size -requiring transfer learning from pretrained English ASR models and various augmentation procedures on both the text data and audio data to even approach GMM performance.</p><p>For Guaran&#237;, a relatively large speech corpus has been constructed as part of the IARPA Babel project.  <ref type="bibr">Gales et al. (2017)</ref>. Their paper also discusses a number of optimization methods for keyword search in speech data. They obtain a WER of 49.5 for their Guaran&#237; ASR system using stimulated network training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.2">Methodology</head><p>Deepspeech <ref type="bibr">Hannun et al. (2014)</ref> introduces the end to end neural speech recognition system used for the following experiments. This system takes in short time fourier transform (STFT) features (referred to as 'spectrogram' features in the original work). These features go through three convolutional layers with ReLU activation, and then a single bidirectional RNN. Lastly, a softmax layer is used to give a probability distribution over the possible characters in the dataset.</p><p>We borrow from this original implementation with some modifications: instead of a simple recurrent layer, we utilize gated-recurrent units, and instead of a single hidden recurrrent layer, we utilize a number of different recurrent layers. <ref type="bibr">Hannun et al. (2014)</ref> use a non-gated recurrent final layer as they were seeking to avoid computing and storing the update, input and output gates used in Long-Short-Term-Memory (LSTM) recurrent units. As a compromise between LSTMs and non-gated RNNs, we utilize Gated Recurrent Units (GRUs). Gated Recurrent Units have an update gate but no output gate, thus saving some computation in comparison to an LSTM but also allowing the neural network to be less susceptible to exploding/vanishing gradients. We also introduce more recurrent layers after the convolutional layers with significant increases in performance at the cost of increased runtime.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.3">Decoding</head><p>Language models can help improve automatic speech recognition systems by imposing constraints on the possible character co-occurances. We present results for greedy decoding, where no language model is utilized and the network's predicted character sequence is not explicitly constrained. In the future, we will incorporate language models into the speech recognition system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.4">Preliminary results</head><p>Initial results for Crow word recognition and Guaran&#237; speech recognition are shown in the following sections. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.5">Crow</head><p>As noted in 3.1.5, the data available for Crow consists only of recordings of single words and small phrases. In addition, very little monolingual text data for Crow was available. Due to the lack of long phrases, as with the Guaran&#237; data, and the lack of large monolingual language resources, only a single recurrent layer was used in our model, similar to the original DeepSpeech implementation. In addition, the language model created from a very small collection of Crow monolingual stories was given very little weight due to the low coverage of the model. Initial experiments at word prediction proved unsuccessful. The neural net simply produced all spaces for output.</p><p>A pretrained English model trained on the Librispeech corpus was leveraged in an attempt to get any output at all from the Crow data. This pretrained model was then adapted to the available Crow data. The results from this adapted speech recognition model are shown in Table <ref type="table">6</ref>.1. While the results produced are very poor, the network was at least producing some output at this point.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.6">Guaran&#237;</head><p>For Guaran&#237;, a number of different recurrent layers were used. Character and word error rates for the development dataset from the IARPA corpus using greedy decoding are shown in Table <ref type="table">6</ref>.2. Both the development and training dataset used only utterances between 1 and 15 seconds in length, thus the results shown are not directly comparable to <ref type="bibr">Hartmann et al. (2016)</ref>. Future experiments will be conducted on all the data for more direct comparison. All models were trained for 50 epochs with a starting learning rate of 10 -4 and learning rate annealing each epoch.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.7">Future directions</head><p>Moving forward, we will incorporate neural language models into the speech recognition systems. Currently, the results displayed utilize simple greedy predictors with no explicit language modelling or conventional n-gram based language models <ref type="bibr">(Heafield, 2011)</ref> for decoding. <ref type="bibr">Gales et al. (2017)</ref> use an RNN language model with Pashto speech recognition and found that it had a minor effect on speech recognition but helped significantly with keyword search. However, their approach seems to involve a neural language model during the decoding stage. Incorporating a neural language model into the architecture using adversarial networks could enable still lower error rates as the model Chapter 7</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Feature-rich Open-vocabulary Interpretable Language Model</head><p>In this chapter, we present a novel general-purpose neural language modelling framework designed to be capable of handling a broad variety of typologically diverse languages, including languages whose morphology includes any or all of the following: prefixes, suffixes, infixes, circumfixes, templatic morphemes, derivational morphemes, inflectional morphemes, and clitics. In this chapter we motivate our language modelling framework using examples drawn primarily from St. Lawrence Island Yupik. St. Lawrence Island Yupik is a polysynthetic suffixing language in which words with 1 root, 0-3 derivational morphemes, and 1 inflectional are common, and words with up to 7 derivational morphemes have been attested <ref type="bibr">(de Reuse, 1994)</ref>. In Example <ref type="bibr">(4)</ref> we observe a sample two-word sentence from St. Lawrence Island Yupik. The first word qikmighhaak is a noun composed of a noun root qikmigh, a derivational suffix -ghhagh that serves as a diminutive, and an inflectional suffix -k that indicates the noun's case (absolutive) and number (dual). The second word neghtuk is a verb composed of a verb root negh and an inflectional suffix -tuk that indicates the verb's mood (indicative) and valence (intransitive), as well as the person (3rd person) and number (dual) of the verb's subject. Note that it is common for the form in which a morpheme surfaces in a word to differ from the underlying lexical form of that morpheme. In the morphemes' respective surface forms in this example, the final uvular fricative of qikmigh and -ghhagh are each dropped, the vowel of -ghhagh is lengthened, and the final uvular fricative of negh devoices to match the adjacent voiceless stop at the beginning of -tuk.</p><p>(5) Mangteghaghrugllangllaghyunghitunga mangteghagh--ghrugllag--ngllagh--yug--nghite--tu--nga house--big--build--want.to--to.not--INTR.IND--1SG 'I didn't want to make a huge house' <ref type="bibr">(Jacobson, 2001, pg. 43)</ref> In Example ( <ref type="formula">5</ref>), a single Yupik word represents an entire sentence. The word consists of a noun root mangteghagh, a derivational suffix ghrugllag that serves as an augmentative, a verbalizing derivational suffix ngllagh, a verbelaborating derivational suffix yug, another verb-elaborating derivational suffix nghite, and inflectional suffixes tu and nga that mark mood (indicative) and valence (intransitive), as well as the person (1st person) and number (singular) of the verb's subject.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Language Model Desiderata</head><p>A language model capable of effectively modelling the full linguistic diversity found in human languages, including St. Lawrence Island Yupik and similar endangered and polysynthetic languages, should have the following desiderata.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.1">Flexibility with respect to language typology</head><p>Typical methods of categorizing languages by morphological type include isolating, fusional, agglutinative and polysynthetic. There are also morphological affix types such as prefixes, suffixes, circumfixes, infixes and templatic morphology, and processes such as compounding and incorporation.</p><p>One can think of isolating languages as those (almost) without productive morphology, such as Chinese and Vietnamese. These languages are well served by existing approaches to language modelling which treat the word as the fundamental unit.</p><p>Fusional languages are those where a morpheme may represent multiple morphological or syntactic features. Most well-known Indo-European languages are of this type. They may also have complicated, irregular, or lexicalised phonological processes occurring when morphemes are joined together. Consider for example Catalan tener 'to have'-tinc 'I have'-tinga 'I have'. The stem is ten-, -er is the formant of the infinitive, -c is the formant of the first person singular present indicative and -nga is the formant of the first and third person present subjective. A vowel change in the stem occurs when the suffixes are attached to the stem. This example has two fusional features: multiple features per morpheme and stem-internal phonological changes caused by affixing. These languages are fairly well dealt with in existing approaches, the number of forms that can be generated by these processes may be larger than in isolating languages, but is essentially a finite-set.</p><p>As mentioned, current ad hoc methods work fairly well with isolating and fusional languages, where there are a finite number of forms for a single word. Out of vocabulary items are a problem, but are typically related to unseen new stems rather than forms of seen stems. Agglutinating and polysynthetic languages have this problem too, but in addition they have the problem of unseen forms of previously seen stems.</p><p>In agglutinating languages -and in polysynthetic languages to an even greater extent -words are typically made up of many morphemes concatenated together. These are typically with prefixes or suffixes, or a combination. The Yupik example in ( <ref type="formula">4</ref>) is an example of suffixing, and indeed Yupik is an exclusively suffixing language. Guaran&#237; combines suffixes, which are primarily for tense, aspect, and mood (TAM) markers and subordination, with prefixes for valency changing and agreement. This is illustrated in Example <ref type="bibr">(6)</ref> where the aiprefix indicates first-person singular agreement, and the -se suffix indicates volitional mood, and in Example <ref type="bibr">(7)</ref> where the &#241;aprefix indicates agreement and the -va suffix indicates nominalisation. 'that we did not expect at all'</p><p>The negative form of Guaran&#237; verbs is formed by a circumfix of two morphemes, ndand -i. These circumfixes go around verbal derivations, agreement and (TAM) markers etc, as in <ref type="bibr">(7.1.1)</ref>.</p><p>(8) ndojuhumo'&#227;i nd-o-juhu-mo'&#227;-i NEG-3-find-FUT-NEG</p><p>In Chukchi the comitative case is made up of a circumfix of two morphemes, /&#947;a/-and -/ma/. The noun /&#322;awt/ 'head' forms the associative singular /&#947;a&#322;awt&#515;ma/ by combining these and adding an epenthetic schwa.</p><p>Infixes are morphemes that break a given stem and appear inside it. For example in Seri, a language spoken in the north-west of Mexico. It uses infixation after the first vowel in the root to create forms with number agreement. For example, ic 'to plant', i{t&#237;}c i 'did she plant it?' vs. i{t&#237;}{t&#243;o}c 'did they plant it?'.</p><p>In languages with templatic morphology, the root is typically represented as a consonant template, e.g. in Maltese, k-t-b 'book'. Inflection takes place by "filling" the slots in the root with other templates, such that e.g. ktieb 'book' (singular), kotba 'books', are formed by combining the root with the vowel templates {&#248;-ie, o-&#248;}, and in the plural the suffix -a.</p><p>An ideal language model would be able to encode all of these types of morphology in a generic and compositional manner without using language-or typology-specific tricks or assumptions (e.g. productive morphological processes are exclusively suffixing). <ref type="foot">1</ref>It should allow for arbitrary subsets of characters in a given string to form meaningful, compositional units.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.2">Ability to incorporate external knowledge sources as features</head><p>In high-resource settings, neural networks commonly function as effective feature extractors <ref type="bibr">(Goodfellow et al., 2016)</ref>. In very low-resource settings such as St. Lawrence Island Yupik, extreme data sparsity means that neural models are likely to have insufficient data to effectively extract such reliable features. To alleviate this issue, our language model should be capable of incorporating a rich array of features from supplementary knowledge sources when insufficient data conditions prevent learning them. Finite-state morphological analyzers <ref type="bibr">(Beesley and Karttunen, 2003)</ref> in particular represent a mature technology capable of serving as a reliable source of rich linguistic features. In the Yupik Example ( <ref type="formula">4</ref>) above, we make use of the finite-state morphological analyzer of <ref type="bibr">Chen and Schwartz (2018)</ref>. At a minimum, we expect such an analyzer to decompose a Yupik word, providing morpheme boundary information and the associated constituent morphemes. We expect that in most cases a morphological analyzer should also provide the underlying orthographic form of each root morpheme and each derivational morpheme, the set of linguistic features such as noun case, verb mood, person, and number associated with each inflectional morpheme, and the underlying type of each morpheme (such as noun, verb, nominalizing suffix, etc). In the some cases, an analyzer might also provide information regarding the phonemes in each morpheme.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.3">Open vocabulary</head><p>In high-resource languages, especially those that are analytic rather than synthetic, a common approach is to treat morphologically-distinct variants (such as dog and dogs) as completely independent word types, rather than inflected variants of a common root. In polysynthetic languages in general, and in Yupik in particular, encountering previously unseen word forms is pervasive and should be considered the norm rather than the exception. In very low-resource settings, it is especially important that our language model be able to robustly handle and predict out-of-vocabulary tokens. Language models with a closed vocabulary are not viable in such settings. Instead, we require an open vocabulary language model in which the probability of a token given a history can be robustly calculated even when that token was not present in the training data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.4">Interpretability of predicted units</head><p>By definition, a language model provides a probabilistic model over a sequence of linguistic units. In other words, a language model must be able to provide a probability distribution over the identity of the current linguistic unit given a history representing the preceding linguistic units in the sequence. We use the term linguistic unit to refer to an instance of any well-defined linguistic level of analysis, such as a word, a morpheme, a syllable, a phoneme, or even a grapheme.</p><p>In our language model, we require that the computational mechanism that implements the linguistic unit be interpretable. For example, consider the case of a trained instance of our language model randomly generating a sequence of morphemes; when the model generates a morpheme, we should be able to recover whatever rich features may be encoded therein (see &#167;7.1.2), such as the underlying grapheme or phoneme sequence and the type of morpheme <ref type="bibr">(root, derivational, inflectional, etc)</ref>. This should be the case regardless of whether the generated unit was present in the training data or not (see &#167;7.1.3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Sub-word language models</head><p>The rich morphology and phonology of Yupik and typologically similar languages results in an extreme type-token ratio. This fact coupled with a very small corpus size make the use of n-gram language models and recurrent neural language models over words highly unlikely to be effective. <ref type="bibr">Schwartz et al. (2019)</ref> examined the number of potential word forms word forms in St. Lawrence Island Yupik, and estimated approximately 1.27 &#215; 10 23 morphotactically licensed word forms. This number is approximately equal to current estimates of the number of stars in the observable universe.<ref type="foot">foot_27</ref> While this estimate does not take into account restrictions imposed by semantic felicity, the polysynthetic nature of the language ensures an extremely high fraction of hapax legomenon in Yupik texts, with <ref type="bibr">Schwartz et al. (2020)</ref> reporting that approximately every other Yupik word token establishes a previously unseen word type. In contrast to the astronomical number of potential Yupik word forms, the complete collection of fully digitized St. Lawrence Island Yupik texts available at the time of the 2019 JSALT workshop consisted of a corpus of slightly over 81,000 word tokens (see Chapter 3 for more details). In lieu of word-based language models, we consider language models that utilize sub-word units.</p><p>Language models serve as an enabling technology for other downstream language technologies, including mobile text prediction. These technologies are mature and widespread for many high-resource languages, but relatively immature and rare for polysynthetic languages. In this section, we present several motivating use cases of sub-word language models for polysynthetic language.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.1">Prediction of next morpheme</head><p>The core operation of a language model is estimating the conditional probability of a predicted next linguistic unit given a history of previous linguistic units. Figure <ref type="figure">7</ref>.1 illustrates a recurrent neural network language model that predicts the most likely next morpheme given a history of four immediately preceding morphemes, where each morpheme is encoded as a vector. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2.2">Prediction of next character</head><p>A closely related task applicable in the context of mobile text completion is the prediction of the next character given a preceding sequence of characters. In the polysynthetic language setting, it may be beneficial to augment such a model with a history of morphemes in situations where this information is available.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3">Neural morphological analysis</head><p>As discussed in &#167;2.1, finite-state morphological analyzers provide a mechanism for encoding linguistic knowledge in a finite-state transducer capable of analyzing a word and providing morpheme boundaries and other linguistically salient information about the underlying morphemes that comprise the word. Recent work has explored how a finite-state morphological analyzer can be used to bootstrap a neural morphological analyzer <ref type="bibr">(Micher, 2018b;</ref><ref type="bibr">Schwartz et al., 2019;</ref><ref type="bibr">Silfverberg and Tyers, 2019)</ref>. Building on that work, we propose a neural morphological analyzer that directly predicts morpheme vectors, rather than predicting a sequence of strings representing an analyzed form. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4">Tensor Product Representation</head><p>To satisfy the language model desiderata specified in &#167;7.1, we consider the Tensor Product Representation (TPR) proposed by <ref type="bibr">Smolensky (1990)</ref>. The use of TPRs provides a principled way of representing hierarchical symbolic information in vector spaces, such as those used as the input and output domains of neural networks. Developing a tensor-product-based representational scheme begins by decomposing a symbolic structure into roles and fillers. A symbolic structure can then be represented as the bindings of fillers to roles. Once decomposed, both roles and fillers are embedded into a vector space such that all roles are linearly independent from one another. Let b be a list of ordered pairs (i, j) representing filler i (with embedding vector fi ) being bound to role j (with embedding vector rj ). The tensor product representation T of the information is then given by</p><p>This TPR may itself be used as a filler and subsequently be bound to another role vector. This process results in a TPR that represents hierarchical compositional structure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4.1">Unbinding</head><p>TPRs are useful because they embed arbitrary symbolic structure in a vector space in such a way that simple linear algebra operations may be used to retrieve the form of the symbolic structure, including its compositional structure. The core operation in retrieving this structure is called unbinding. We may use unbinding to query a role for its filler. Unbinding may be accomplished by any of several exact or approximate strategies. Exact unbinding requires linear independence of the roles; however, recent (unpublished) work points to the accuracy of approximate unbinding even in densely packed TPRs. In this work, we use self-addressing unbinding, as it is quick to compute and proved sufficiently accurate for our purposes. Self-addressing unbinding retrieves the filler fi for the role ri by simply computing the inner product between the role vector and the TPR: This unbinding is exact if the role vectors are orthogonal to one another. Otherwise, the intrusion of the filler of role j, fj , into the unbound filler of the role i, fi , is given by In our case, since we have a fixed filler vocabulary, we were able to snap our unbindings to the filler with the highest cosine similarity to the unbound vector with sufficient accuracy to render this intrusion irrelevant. Other unbinding strategies involve computing an inverse or pseudoinverse of a matrix of role vectors to perform a change of basis and decrease the intrusion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.5">Morpheme vector representations from TPRs</head><p>We use TPRs ( &#167;7.4) to bridge the gap between the rich hierarchical symbolic information encoded in finite state morphological transducers (such as <ref type="bibr">Chen and Schwartz, 2018)</ref> and the morpheme vectors needed by the neural models described in &#167;7.2 and &#167;7.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.5.1">Morpheme TPRs</head><p>Given a language, a corpus of text in that language, and a finite-state morphological analyzer for that language, we can use the finite-state analyzer to obtain a morphological analysis for each word in the corpus. For each morpheme provided in an analysis, we extract a collection b of linguistically salient feature-value ordered pairs (i, j). Each linguistic feature j serves as a TPR role; each value i serves as a TPR filler. For each such feature j (such as noun case), we define rj to be a role vector representing that feature; for each value i (such as ABS) associated with feature j, we define fi to be a filler vector representing that value. This use of TPRs enables us to jointly encode latent structural information provided by a finite state transducer with surface information in a principled manner. This process is depicted in Figure <ref type="figure">7</ref>.5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.5.2">Learning morpheme vectors using an autoencoder</head><p>The morpheme tensors constructed in &#167;7.5.1 are potentially very high dimensional. Depending on how much linguistic information is encoded in each tensor, the morpheme tensors may consist of approximately 10 3 to 10 9 floating point values per tensor. Tensors of this size are far too large to be directly usable as morpheme representations in the neural models described in &#167;7.2 and &#167;7.3. To learn lower dimensional morpheme vectors, we make use of an autoencoder. The autoencoder is trained using the dictionary of previously constructed morpheme tensors. The trained autoencoder can be used to encode a low-dimensional morpheme vector from a high-dimensional morpheme tensor by running the morpheme tensor through the first half of the autoencoder, and can be used to obtain a high-dimensional morpheme tensor from a morpheme vector by running the morpheme vector though the latter half of the autoencoder.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.6">Unbinding loss</head><p>In order to effectively train the autoencoder in &#167;7.5.2, gold standard morpheme tensors must be compared against predicted morpheme tensors outputted by the autoencoder. However, the morpheme tensors are very high dimensional. In initial experiments, we used mean squared error as a loss function, but we found this was unable to converge for auto-encoding sparse TPRs.</p><p>To enable effective training of the autoencoder, we therefore define a novel loss function that makes use of the information encoded in the TPR. We define a loss function called unbinding loss that examines the unbinding properties of a predicted morpheme tensor to answer the question, "What filler is closest to the unbinding of each role in the TPR?" For simplicity, we assume the use of self-addressing unbinding in this section (which we also used in the work presented here), but the computations are analogous with other unbinding strategies, relying only on a fixed role and filler vocabulary and a fixed number of bindings. We call the output TPR T.</p><p>Given a predicted tensor, the first step to computing the unbinding loss is recursively unbind roles until the leaves of the structure are reached -that is, unbind each role until the result of unbinding is a single vector (rather than a higher-dimensional tensor). When this point is reached, we compute the cosine similarity between the result of unbinding and all the fillers in the vocabulary. For example, assume a depth-3 structure is encoded in a TPR, where the fillers are character embeddings, the second level is left-to-right positional roles, and the highest level is morpheme identity. If we want to see what is bound to the first position of the English cat morpheme in T , we would first unbind from T as follows (assuming self-addressing unbinding):</p><p>We then get the vector of similarities &#349;cat,1 between this filler and the each of character embedding vectors in the vocabulary matrix V as follows: <ref type="bibr">7.4)</ref> where V i V i denotes the column-wise vector norm of the vocabulary matrix (using Einstein summation notation).</p><p>This similarity vector can be used to define a probability distribution over possible fillers through the use of a softmax. We take the logarithm of the result of this computation to obtain log-probabilities. We call this distribution P .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>P = log</head><p>e &#349;cat,1 e &#349;cat,1 (7.5)</p><p>We then treat each filler vocabulary word (in this case, each character) as a class, and compute the negative loglikelihood loss over this probability distribution. The resulting loss for the first character of cat being c is then loss(&#349; cat,1 , c) = -&#349; cat,1,c + log( j e &#349;cat,1,j ). <ref type="bibr">(7.6)</ref> In this example, we focus on the loss for a single filler; however, as we consider tree-structured representations, the number of fillers needing to be checked is exponential with the depth of our representation. In practice, we were able to overcome this difficulty by parallelizing the independent matrix computations for the loss of all the position roles for a given morpheme, trading space for time. For more complex TPRs, a potential avenue would be to exploit the fact that most roles will be empty (and their unbindings thus a matrix of zeros) by replacing the loss computations for unbound roles with mean squared error (which need only push that part of the representation to 0).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Chapter 8</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Conclusions</head><p>In motivating this JSALT workshop on neural polysynthetic language modelling, we observed the following major assumptions (usually unstated) that are pervasive in most computational linguistics and natural language processing research:</p><p>&#8226; If a technique works well on English, the technique is likely to be "language agnostic" and is likely to work well on a large variety of other languages. Various other high-resource languages such as Spanish, French, German, or Chinese are sometimes used in place of English.</p><p>&#8226; For any given word stem, there will be a relatively small number of morphological variants of that stem.</p><p>&#8226; Most or all of the morphological variants of any given word stem will appear in a sufficiently large corpus to enable learning of robust statistics.</p><p>Our work was built around explicitly challenging all of these assumptions, using a variety of polysynthetic languages and a variety of natural language tasks. The polysynthetic languages that we chose to work with present numerous significant challenges. These languages are typologically very different from English and other widely-used high-resource languages. There is pervasive use of derivational and inflectional morphology. For most word stems, there are very large numbers of potential morphological variants, very few of which occur in any given corpus. For all of the selected languages (with the exception of Inuktitut), the corpus sizes are very small (less than 60, 000 sentences).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.1">Contribution 1: Resources</head><p>One contributing factor to the dearth of prior work on computational research on endangered polysynthetic languages is the lack of easily available corpus resources. Nearly all endangered languages are very low resource. Most CL and NLP researchers do not have the personal connections with members of endangered language communities that are often critical for obtaining data for use in research. In preparation for this workshop, our team gathered together text and speech data from various sources for a variety of polysynthetic languages. In cases where we have connections with indigenous community stakeholders and rights-holders, we have begun the process of discussions regarding community desires and possibilities for data distribution. For data that we have obtained permission to distribute, we have initiated a process of public data hosting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.2">Contribution 2: Machine Translation</head><p>The main contributions of our machine translation work during this workshop are as follows. With first access to the beta version 3.0 of the Nunavut Hansard <ref type="bibr">(Joanis et al., 2020)</ref>, we were able to provide feedback and best practices for preprocessing the dataset and shared knowledge about existing character and spelling variations in the dataset. This work contributed to the data release and publication of <ref type="bibr">Joanis et al. (2020)</ref>; that data is now being used in the Fifth Conference on Machine Translation (WMT20) Inuktitut-English news translation shared task.</p><p>Our work at the time constituted state-of-the-art performance on translation between Inuktitut and English. It has since been surpassed by <ref type="bibr">Joanis et al. (2020)</ref>, and we anticipate future improvements through the WMT20 shared task.</p><p>We collected empirical evidence on several well-known but unresolved challenges, such as best practices in token segmentation for MT into and out of polysynthetic languages, as well as an examination of how to evaluate MT into polysynthetic languages. We successfully used multilingual neural machine translation methods to improve translation quality into low-resource languages (St. Lawrence Island Yupik and Central Alaskan Yup'ik) using data from related languages (Inuktitut). Notably, our "low-resource" languages were lower resource than much of the literature, and we produced improvements without the use of large monolingual corpora (which are unavailable for these languages and many other languages of interest). We observed these improvements across both n-gram-oriented and semantic-oriented metrics.</p><p>There remain a number of open challenges in this space. We encourage caution in interpreting the automatic quality metrics, as we do not yet have human judgments of translation quality for the languages examined; human judgements from the WMT20 shared task may prove particularly valuable. Our initial results, using fairly conventional methods, for both multilingual and bilingual machine translation show promise, but we expect that there remains much room for improvement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.3">Contribution 3: Language Models</head><p>To our best knowledge, this paper represents the first attempt at modeling polysynthetic languages using a stateof-the-art RNN model and comparing their language modeling difficulty with that of other languages. We conduct language modeling experiments on four low-resource, polysynthetic languages (St. Lawrence Island Yupik, Central Alaskan Yup'ik, Inuktitut, Guaran&#237;) and two high-resource, morphologically poor languages (English, Spanish), using four different segmentation methods: character, BPE, Morfessor and FST. By comparing the perplexity measure at the character level, we show that the FST segmentation method worked the best for polysynthetic languages when it was available. While the Morfessor segmentation method might improve language modeling performance for some polysynthetic languages, all the other segmentation method we considered-character, BPE and Morfessor-failed to capture the rich morphology of polysynthetic languages better than the FST segmentation that is based on linguistic knowledge of the languages. We also compared the perplexity measure at the word level to illustrate how significantly difficult it is to model polysynthetic languages.</p><p>All in all, this presents an exciting starting point for a line of inquiries into modeling polysynthetic languages and utilizing the linguistic knowledge realized in FST in modeling such languages that are morphological rich and low resource. At the same time, we invite future research into linguistic characteristics that contribute to language modeling difficulty as we continue to investigate the effect of morphological complexity in our ongoing study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.4">Contribution 4: Mobile &amp; Speech Applications</head><p>As smartphones become ubiquitous in native communities, facilitating native-language communication through better technology will become an important aspect of language conservation and revitalization efforts. Building on freely available open source tools, we developed a pipeline for training neural language models that can run ondevice, and loading them as a predictive back-end for on-device keyboards. This effort lead to working keyboard prototypes for Guaran&#237; (grn) and St. Lawrence Island Yupik (grn) -the first ever input methods for these language varieties to include intelligent next-unit prediction and completion. Building the prototypes highlighted the unique requirements posed by polysynthetic languages. Their complex, productive morphology results in very long words, many of which would never appear in the training data available for language modeling, and which would be unwieldy to show to keyboard users as prediction candidates. We dealt with these problems by training character-level models that were aware of morpheme boundaries, and using morphemes rather than words as units of prediction.</p><p>The low-resource nature of most polysynthetic languages is particularly poignant for automatic speech recognition. While transfer learning can help alleviate some of the issues with data poverty, neural approaches to ASR are still not sufficient to enable usable systems.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://live.bible.is/bible/ESSWYI</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>bibles.org</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>It should be noted that Legislative Assembly of Nunavut discourse takes place in several Inuktut varieties, as well as English; a more detailed description of the construction and dialect situation of the Hansard will be available in<ref type="bibr">Joanis et al. (2020)</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>bible.com   </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_4"><p>While this was a pre-release at the time of this workshop, the data has now been made available publicly; see<ref type="bibr">Joanis et al. (2020)</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_5"><p>uniconv is distributed with Yudit: www.yudit.org</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_6"><p><ref type="bibr">Joanis et al. (2020)</ref> provides slightly updated scripts; we note that neither those scripts nor the ones described here fully conform to spelling and romanization conventions as described in the Nunavut Utilities (www.gov.nu.ca/culture-and-heritage/information/ computer-tools).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_7"><p>This consisted of rephrasings of entire verses, and was not present in all verses.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_8"><p>This keeps hyphenated suffixes attached, but has the downside of non-ideal interactions with subword segmentation, occasionally breaking suffixed biblical names into two parts, with the latter attached to the hyphen and Central Alaskan Yup'ik suffix.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_9"><p>https://uqausiit.ca/node/10333</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_10"><p>https://uqausiit.ca/verb-ending/vita</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_11"><p>https://uqausiit.ca/node/12189</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_12"><p>BLEU scores were computed using SacreBLEU<ref type="bibr">(Post, 2018)</ref>, compared to untokenized but punctuation-normalized references.BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.4.2   </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_13"><p>chrF scores were computed against untokenized but punctuation-normalized references using SacreBLEU with chrF2+case.mixed+numchars.6+numrefs.1+space.False+version.1.4.2 settings.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="11" xml:id="foot_14"><p><ref type="bibr">Joanis et al. (2020)</ref> finds that using underlying forms, but rejoining them before BPE segmentation, gives a performance improvement over deep forms alone in corpus alignment.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="12" xml:id="foot_15"><p>In this table and table 4.4, BLEU scores were computed against untokenized but punctuation-normalized references using SacreBLEU with BLEU+case.mixed+numrefs.1+smooth.exp+tok.13a+version.1.4.2 settings.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="13" xml:id="foot_16"><p>chrF scores were computed against untokenized but punctuation-normalized references using SacreBLEU with chrF2+case.mixed+numchars.6+numrefs.1+space.False+version.1.4.2 settings.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="14" xml:id="foot_17"><p>http://iguarani.com/</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="15" xml:id="foot_18"><p>BLEU scores were computed against untokenized but punctuation-normalized references using SacreBLEU withBLEU+case.lc+numrefs.1+smooth.exp+tok.13.a+version.1.3.7 settings.   </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_19"><p>https://github.com/rsennrich/subword-nmt</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_20"><p>https://github.com/divvun/giellakbd-android</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_21"><p>https://onnx.ai</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_22"><p>https://www.tensorflow.org/tutorials/sequences/recurrent</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_23"><p>Note that morpheme boundaries never appear in the user's input buffer according to this scheme. This is different from a system based entirely on words, as the relevant boundaries, spaces and punctuation symbols, are visible.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_24"><p>Though as noted in<ref type="bibr">Gales et al. (2017)</ref>, the BABEL corpora are small in comparison to other corpora used in end to end neural ASR.<ref type="bibr">Hannun et al. (2014)</ref>, for example, used 5,000 hours of data.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_25"><p>These sections are denoted as &lt;no-speech&gt; in the transcription files</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_26"><p>We would note that treating words as basic units can also be considered to be a language-specific trick designed for isolating and fusional languages.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_27"><p>https://www.skyandtelescope.com/astronomy-resources/how-many-stars-are-there</p></note>
		</body>
		</text>
</TEI>
