skip to main content


Title: A Digital Corpus of St. Lawrence Island Yupik
St. Lawrence Island Yupik (ISO 639-3: ess) is an endangered polysynthetic language in the Inuit-Yupik language family indigenous to Alaska and Chukotka. This work presents a step-by-step pipeline for the digitization of written texts, and the first publicly available digital corpus for St. Lawrence Island Yupik, created using that pipeline. This corpus has great potential for future linguistic inquiry and research in NLP. It was also developed for use in Yupik language education and revitalization, with a primary goal of enabling easy access to Yupik texts by educators and by members of the Yupik community. A secondary goal is to support development of language technology such as spell-checkers, text-completion systems, interactive e-books, and language learning apps for use by the Yupik community.  more » « less
Award ID(s):
1761680
NSF-PAR ID:
10285560
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-4)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. St. Lawrence Island Yupik is an endangered language of the Bering Strait region. In this paper, we describe our work on Yupik jointly leveraging computational morphology and linguistic fieldwork, outlining the multilayer virtuous cycle that we continue to refine in our work to document and build tools for the language. After developing a preliminary morphological analyzer from an existing pedagogical grammar of Yupik, we used it to help analyze new word forms gathered through fieldwork. While in the eld, we augmented the analyzer to include in- sights into the lexicon, phonology, and morphology of the language as they were gained during elicitation sessions and subsequent data analysis. The analyzer and other tools we have developed are improved by a corpus that continues to grow through our digitization and documentation efforts, and the computational tools in turn allow us to improve and speed those same efforts. Through this process, we have successfully identified previously undescribed lexical, morphological, and phonological processes in Yupik while simultaneously increasing the coverage of the morphological analyzer. Given the polysynthetic nature of Yupik, a high-coverage morphological analyzer is a necessary prerequisite for the development of other high-level computational tools that have been requested by the Yupik community. 
    more » « less
  2. St. Lawrence Island Yupik is an endangered language of the Bering Strait region. In this paper, we describe our work on Yupik jointly leveraging computational morphology and linguistic fieldwork, outlining the multilayer virtuous cycle that we continue to refine in our work to document and build tools for the language. After developing a preliminary morphological analyzer from an existing pedagogical grammar of Yupik, we used it to help analyze new word forms gathered through fieldwork. While in the eld, we augmented the analyzer to include in- sights into the lexicon, phonology, and morphology of the language as they were gained during elicitation sessions and subsequent data analysis. The analyzer and other tools we have developed are improved by a corpus that continues to grow through our digitization and documentation efforts, and the computational tools in turn allow us to improve and speed those same efforts. Through this process, we have successfully identified previously undescribed lexical, morphological, and phonological processes in Yupik while simultaneously increasing the coverage of the morphological analyzer. Given the polysynthetic nature of Yupik, a high-coverage morphological analyzer is a necessary prerequisite for the development of other high-level computational tools that have been requested by the Yupik community. 
    more » « less
  3. null (Ed.)
    St. Lawrence Island Yupik, an endangered language of the Bering Strait region spoken by fewer than one thousand people in western Alaska and far eastern Russia, is currently in a state of generational transition. We survey the existing body of Yupik literature and pedagogical resources developed during the twentieth century, examine the context and use of Yupik in the current educational setting, and describe current challenges for teaching the language in the schools. We then outline our integrated approach to language documentation currently being applied to Yupik, and address how existing resources can be integrated into research and development processes in a way that both supports research efforts and results in tangible modern educational tools for the Yupik community on St. Lawrence Island, and eventually in Russia. This approach is intentionally designed to closely integrate research processes from language documentation and computational linguistics such that the results of each research endeavour positively support the other, and such that both disciplines concretely support community-based efforts to revitalize and teach the language. 
    more » « less
  4. St. Lawrence Island Yupik is an endangered polysynthetic language of the Bering Strait region. While conducting linguistic fieldwork between 2016 and 2019, we observed substantial support within the Yupik community for language revitalization and for resource development to support Yupik education. To that end, Chen & Schwartz (2018) implemented a finite-state morphological analyzer as a critical enabling technology for use in Yupik language education and technology. Chen & Schwartz (2018) reported a morphological analysis coverage rate of approximately 75% on a dataset of 60K Yupik tokens, leaving considerable room for improvement. In this work, we present a re-implementation of the Chen & Schwartz (2018) finite-state morphological analyzer for St. Lawrence Island Yupik that incorporates new linguistic insights; in particular, in this implementation we make use of the Paradigm Function Morphology (PFM) theory of morphology. We evaluate this new PFM-based morphological analyzer, and demonstrate that it consistently outperforms the existing analyzer of Chen & Schwartz (2018) with respect to accuracy and coverage rate across multiple datasets. 
    more » « less
  5. null (Ed.)
    This paper describes the development of the first Universal Dependencies (UD) treebank for St. Lawrence Island Yupik, an endangered language spoken in the Bering Strait region. While the UD guidelines provided a general framework for our annotations, language-specific decisions were made necessary by the rich morphology of the polysynthetic language. Most notably, we annotated a corpus at the morpheme level as well as the word level. The morpheme level annotation was conducted using an existing morphological analyzer and manual disambiguation. By comparing the two resulting annotation schemes, we argue that morpheme-level annotation is essential for polysynthetic languages like St. Lawrence Island Yupik. Word-level annotation results in degenerate trees for some Yupik sentences and often fails to capture syntactic relations that can be manifested at the morpheme level. Dependency parsing experiments provide further support for morpheme-level annotation. Implications for UD annotation of other polysynthetic languages are discussed. 
    more » « less