With the increasing interest in low-resource languages, unsupervised morphological segmentation has become an active area of research, where approaches based on Adaptor Grammars achieve state-of-the-art results. We demonstrate the power of harnessing linguistic knowledge as priors within Adaptor Grammars in a minimally-supervised learning fashion. We introduce two types of priors: 1) grammar definition, where we design language-specific grammars; and 2) linguistprovided affixes, collected by an expert in the language and seeded into the grammars. We use Japanese and Georgian as respective case studies for the two types of priors and introduce new datasets for these languages, with gold morphological segmentation for evaluation. We show that the use of priors results in error reductions of 8.9 % and 34.2 %, respectively, over the equivalent state-of-the-art unsupervised system
more »
« less
Lightweight morpheme labeling in context: Using structured linguistic representations to support linguistic analysis for the language documentation context
Linguistic analysis is a core task in the process of documenting, analyzing, and describing endangered and less-studied languages. In addition to providing insight into the properties of the language being studied, having tools to automatically label words in a language for grammatical category and morphological features can support a range of applications useful for language pedagogy and revitalization. At the same time, most modern NLP methods for these tasks require both large amounts of data in the language and compute costs well beyond the capacity of most research groups and language communities. In this paper, we present a gloss-to-gloss (g2g) model for linguistic analysis (specifically, morphological analysis and part-of-speech tagging) that is lightweight in terms of both data requirements and computational expense. The model is designed for the interlinear glossed text (IGT) format, in which we expect the source text of a sentence in a low-resource language, a translation of that sentence into a language of wider communication, and a detailed glossing of the morphological properties of each word in the sentence. We first produce silver standard parallel glossed data by automatically labeling the high-resource translation. The model then learns to transform source language morphological labels into output labels for the target language, mediated by a structured linguistic representation layer. We test the model on both low-resource and high-resource languages, and find that our simple CNN-based model achieves comparable performance to a state-of-the-art transformer-based model, at a fraction of the computational cost.
more »
« less
- Award ID(s):
- 2149404
- PAR ID:
- 10539619
- Publisher / Repository:
- Association for Computational Linguistics
- Date Published:
- Page Range / eLocation ID:
- 78 to 92
- Format(s):
- Medium: X
- Location:
- Toronto, Canada
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Many techniques in modern computational linguistics and natural language processing (NLP) make the assumption that approaches that work well on English and other widely used European (and sometimes Asian) languages are “language agnostic” – that is that they will also work across the typologically diverse languages of the world. In high-resource languages, especially those that are analytic rather than synthetic, a common approach is to treat morphologically-distinct variants of a common root (such as dog and dogs) as completely independent word types. Doing so relies on two main assumptions: that there exist a limited number of morphological inflections for any given root, and that most or all of those variants will appear in a large enough corpus (conditioned on assumptions about domain, etc.) so that the model can adequately learn statistics about each variant. Approaches like stemming, lemmatization, morphological analysis, subword segmentation, or other normalization techniques are frequently used when either of those assumptions are likely to be violated, particularly in the case of synthetic languages like Czech and Russian that have more inflectional morphology than English. Within the NLP literature, agglutinative languages like Finnish and Turkish are commonly held up as extreme examples of morphological complexity that challenge common modelling assumptions. Yet, when considering all of the world’s languages, Finnish and Turkish are closer to the average case in terms of synthesis. When we consider polysynthetic languages (those at the extreme of morphological complexity), even approaches like stemming, lemmatization, or subword modelling may not suffice. These languages have very high numbers of hapax legomena (words appearing only once in a corpus), underscoring the need for appropriate morphological handling of words, without which there is no hope for a model to capture enough statistical information about those words. Moreover, many of these languages have only very small text corpora, substantially magnifying these challenges. To this end, we examine the current state-of-the-art in language modelling, machine translation, and predictive text completion in the context of four polysynthetic languages: Guaraní, St. Lawrence Island Yupik, Central Alaskan Yup’ik, and Inuktitut. We have a particular focus on Inuit-Yupik, a highly challenging family of endangered polysynthetic languages that ranges geographically from Greenland through northern Canada and Alaska to far eastern Russia. The languages in this family are extraordinarily challenging from a computational perspective, with pervasive use of derivational morphemes in addition to rich sets of inflectional suffixes and phonological challenges at morpheme boundaries. Finally, we propose a novel framework for language modelling that combines knowledge representations from finite-state morphological analyzers with Tensor Product Representations (Smolensky, 1990) in order to enable successful neural language models capable of handling the full linguistic variety of typologically variant languages.more » « less
-
null (Ed.)Many techniques in modern computational linguistics and natural language processing (NLP) make the assumption that approaches that work well on English and other widely used European (and sometimes Asian) languages are “language agnostic” – that is that they will also work across the typologically diverse languages of the world. In high-resource languages, especially those that are analytic rather than synthetic, a common approach is to treat morphologically-distinct variants of a common root (such as dog and dogs) as completely independent word types. Doing so relies on two main assumptions: that there exist a limited number of morphological inflections for any given root, and that most or all of those variants will appear in a large enough corpus (conditioned on assumptions about domain, etc.) so that the model can adequately learn statistics about each variant. Approaches like stemming, lemmatization, morphological analysis, subword segmentation, or other normalization techniques are frequently used when either of those assumptions are likely to be violated, particularly in the case of synthetic languages like Czech and Russian that have more inflectional morphology than English. Within the NLP literature, agglutinative languages like Finnish and Turkish are commonly held up as extreme examples of morphological complexity that challenge common modelling assumptions. Yet, when considering all of the world’s languages, Finnish and Turkish are closer to the average case in terms of synthesis. When we consider polysynthetic languages (those at the extreme of morphological complexity), even approaches like stemming, lemmatization, or subword modelling may not suffice. These languages have very high numbers of hapax legomena (words appearing only once in a corpus), underscoring the need for appropriate morphological handling of words, without which there is no hope for a model to capture enough statistical information about those words. Moreover, many of these languages have only very small text corpora, substantially magnifying these challenges. To this end, we examine the current state-of-the-art in language modelling, machine translation, and predictive text completion in the context of four polysynthetic languages: Guaraní, St. Lawrence Island Yupik, Central Alaskan Yup’ik, and Inuktitut. We have a particular focus on Inuit-Yupik, a highly challenging family of endangered polysynthetic languages that ranges geographically from Greenland through northern Canada and Alaska to far eastern Russia. The languages in this family are extraordinarily challenging from a computational perspective, with pervasive use of derivational morphemes in addition to rich sets of inflectional suffixes and phonological challenges at morpheme boundaries. Finally, we propose a novel framework for language modelling that combines knowledge representations from finite-state morphological analyzers with Tensor Product Representations (Smolensky, 1990) in order to enable successful neural language models capable of handling the full linguistic variety of typologically variant languages.more » « less
-
Arabnia, Hamid; Deligiannidis, Leonidas; Tinetti, Fernando; Tran, Quoc-Nam (Ed.)Millions of people with hearing disabilities use sign language for communication, creating a communication gap with those who are not fluent in ASL (American Sign Language). This paper aims to introduce an ASL interpreter system using a smart-glasses-based augmented reality system. We begin by introducing and comparing two models that translate spoken language into ASL poses. The first system translates spoken text to ASL Gloss, an intermediate representation, before generating ASL poses. The second system directly translates the text to ASL poses. Our analysis shows that using ASL Gloss as an intermediate step significantly improves the translation speed. We then explore a system of encoding ASL pose videos for display on smart glasses. The chosen translation method has a BLEU score of 66.5801 and a rate of 1.825 milliseconds per gloss translation. Our algorithm for mapping gloss text to ASL videos obtained a mean squared error of 0.05, indicating that our system has good translational accuracy and a low mapping error.more » « less
-
This paper presents a “road map” for the annotation of semantic categories in typologically diverse languages, with potentially few linguistic resources, and often no existing computational resources. Past semantic annotation efforts have focused largely on high-resource languages, or relatively low-resource languages with a large number of native speakers. However, there are certain typological traits, namely the synthesis of multiple concepts into a single word, that are more common in languages with a smaller speech community. For example, what is expressed as a sentence in a more analytic language like English, may be expressed as a single word in a more synthetic language like Arapaho. This paper proposes solutions for annotating analytic and synthetic languages in a comparable way based on existing typological research, and introduces a road map for the annotation of languages with a dearth of resources.more » « less
An official website of the United States government

