skip to main content


Title: A Digital Corpus of St. Lawrence Island Yupik
St. Lawrence Island Yupik (ISO 639-3: ess) is an endangered polysynthetic language in the Inuit-Yupik language family indigenous to Alaska and Chukotka. This work presents a step-by-step pipeline for the digitization of written texts, and the first publicly available digital corpus for St. Lawrence Island Yupik, created using that pipeline. This corpus has great potential for future linguistic inquiry and research in NLP. It was also developed for use in Yupik language education and revitalization, with a primary goal of enabling easy access to Yupik texts by educators and by members of the Yupik community. A secondary goal is to support development of language technology such as spell-checkers, text-completion systems, interactive e-books, and language learning apps for use by the Yupik community.  more » « less
Award ID(s):
1761680 2243445
PAR ID:
10285560
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages (ComputEL-4)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Akuzipik (Yupigestun/Yupik/St. Lawrence Island Yupik/Siberian Yupik/Chaplinski Yupik) is an endangered language belonging to the Yupik branch of the Inuit-Yupik-Unangan language family. It is currently spoken by 800-900 people in the Bering Strait region, mainly on St. Lawrence Island, Alaska (St. Lawrence Island Yupik), and on the coast of the Chukotka Peninsula, in Russia (Chaplinski Yupik) (de Reuse 1994; Schwartz et al. 2019). The linguistic differences between these two varieties seem to be minor and not affect mutual intelligibility (Krauss 1975). The language has been undergoing a rapid generational shift, beginning in the 1950s in Russia and in the 1990s in Alaska (Schwartz et al. 2019). 
    more » « less
  2. Akuzipik (Yupigestun/Yupik/St. Lawrence Island Yupik/Siberian Yupik/Chaplinski Yupik) is an endangered language belonging to the Yupik branch of the Inuit-Yupik-Unangan language family. It is currently spoken by 800-900 people in the Bering Strait region, mainly on St. Lawrence Island, Alaska (St. Lawrence Island Yupik), and on the coast of the Chukotka Peninsula, in Russia (Chaplinski Yupik) (de Reuse 1994; Schwartz et al. 2019). The linguistic differences between these two varieties seem to be minor and not affect mutual intelligibility (Krauss 1975). The language has been undergoing a rapid generational shift, beginning in the 1950s in Russia and in the 1990s in Alaska (Schwartz et al. 2019). 
    more » « less
  3. St. Lawrence Island Yupik is an endangered polysynthetic language of the Bering Strait region. While conducting linguistic fieldwork between 2016 and 2019, we observed substantial support within the Yupik community for language revitalization and for resource development to support Yupik education. To that end, Chen & Schwartz (2018) implemented a finite-state morphological analyzer as a critical enabling technology for use in Yupik language education and technology. Chen & Schwartz (2018) reported a morphological analysis coverage rate of approximately 75% on a dataset of 60K Yupik tokens, leaving considerable room for improvement. In this work, we present a re-implementation of the Chen & Schwartz (2018) finite-state morphological analyzer for St. Lawrence Island Yupik that incorporates new linguistic insights; in particular, in this implementation we make use of the Paradigm Function Morphology (PFM) theory of morphology. We evaluate this new PFM-based morphological analyzer, and demonstrate that it consistently outperforms the existing analyzer of Chen & Schwartz (2018) with respect to accuracy and coverage rate across multiple datasets. 
    more » « less
  4. St. Lawrence Island Yupik is a polysynthetic language indigenous to St. Lawrence Island, Alaska, and the Chukotka Peninsula of Russia. While the vast majority of St. Lawrence Islanders over the age of 40 are fluent L1 Yupik speakers, rapid language shift is underway among younger generations; language shift in Chukotka is even further advanced. This work presents a holistic proposal for language revitalization that takes into account numerous serious challenges, including the remote location of St. Lawrence Island and Chukotka, the high turnover rate among local teachers, socioeconomic challenges, and the lack of existing language learning materials. 
    more » « less
  5. null (Ed.)
    St. Lawrence Island Yupik, an endangered language of the Bering Strait region spoken by fewer than one thousand people in western Alaska and far eastern Russia, is currently in a state of generational transition. We survey the existing body of Yupik literature and pedagogical resources developed during the twentieth century, examine the context and use of Yupik in the current educational setting, and describe current challenges for teaching the language in the schools. We then outline our integrated approach to language documentation currently being applied to Yupik, and address how existing resources can be integrated into research and development processes in a way that both supports research efforts and results in tangible modern educational tools for the Yupik community on St. Lawrence Island, and eventually in Russia. This approach is intentionally designed to closely integrate research processes from language documentation and computational linguistics such that the results of each research endeavour positively support the other, and such that both disciplines concretely support community-based efforts to revitalize and teach the language. 
    more » « less