NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Can we teach language models to gloss endangered languages?

https://doi.org/10.18653/v1/2024.findings-emnlp.337

Ginn, Michael; Hulden, Mans; Palmer, Alexis (November 2024, Association for Computational Linguistics)

Interlinear glossed text (IGT) is a popular format in language documentation projects, where each morpheme is labeled with a descriptive annotation. Automating the creation of interlinear glossed text would be desirable to reduce annotator effort and maintain consistency across annotated corpora. Prior research has explored a number of statistical and neural methods for automatically producing IGT. As large language models (LLMs) have showed promising results across multilingual tasks, even for rare, endangered languages, it is natural to wonder whether they can be utilized for the task of generating IGT. We explore whether LLMs can be effective at the task of interlinear glossing with in-context learning, without any traditional training. We propose new approaches for selecting examples to provide in-context, observing that targeted selection can significantly improve performance. We find that LLM-based methods beat standard transformer baselines, despite requiring no training at all. These approaches still underperform state-of-the-art supervised systems for the task, but are highly practical for researchers outside of the NLP community, requiring minimal effort to use.
more » « less
Full Text Available
Decomposing Fusional Morphemes with Vector Embeddings

https://doi.org/10.18653/v1/2024.sigmorphon-1.7

Ginn, Michael; Palmer, Alexis (June 2024, Association for Computational Linguistics)

Distributional approaches have proven effective in modeling semantics and phonology through vector embeddings. We explore whether distributional representations can also effectively model morphological information. We train static vector embeddings over morphological sequences. Then, we explore morpheme categories for fusional morphemes, which encode multiple linguistic dimensions, and often have close relationships to other morphemes. We study whether the learned vector embeddings align with these linguistic dimensions, finding strong evidence that this is the case. Our work uses two low-resource languages, Uspanteko and Tsez, demonstrating that distributional morphological representations are effective even with limited data.
more » « less
Full Text Available
BELT: Building Endangered Language Technology

Ginn, Michael; Saavedra-Beltran, David; Robayo, Camilo; Palmer, Alexis (August 2024, Association for Computational Linguistics)

The development of language technology (LT) for an endangered language is often identified as a goal in language revitalization efforts, but developing such technologies is typically subject to additional methodological challenges as well as social and ethical concerns. In particular, LT development has too often taken on colonialist qualities, extracting language data, relying on outside experts, and denying the speakers of a language sovereignty over the technologies produced.We seek to avoid such an approach through the development of the Building Endangered Language Technology (BELT) website, an educational resource designed for speakers and community members with limited technological experience to develop LTs for their own language. Specifically, BELT provides interactive lessons on basic Python programming, coupled with projects to develop specific language technologies, such as spellcheckers or word games. In this paper, we describe BELT’s design, the motivation underlying many key decisions, and preliminary responses from learners.
more » « less
Full Text Available
Robust Generalization Strategies for Morpheme Glossing in an Endangered Language Documentation Context

https://doi.org/10.18653/v1/2023.genbench-1.7

Ginn, Michael; Palmer, Alexis (December 2023, Association for Computational Linguistics)

Generalization is of particular importance in resource-constrained settings, where the available training data may represent only a small fraction of the distribution of possible texts. We investigate the ability of morpheme labeling models to generalize by evaluating their performance on unseen genres of text, and we experiment with strategies for closing the gap between performance on in-distribution and out-of-distribution data. Specifically, we use weight decay optimization, output denoising, and iterative pseudo-labeling, and achieve a 2% improvement on a test set containing texts from unseen genres. All experiments are performed using texts written in the Mayan language Uspanteko.
more » « less
Full Text Available
On the Robustness of Neural Models for Full Sentence Transformation

https://doi.org/10.18653/v1/2024.americasnlp-1.19

Ginn, Michael; Marashian, Ali; Shandilya, Bhargav; Post, Claire; Rice, Enora; Vásquez, Juan; Mcgregor, Marie; Buchholz, Matthew; Hulden, Mans; Palmer, Alexis (June 2024, Association for Computational Linguistics)

This paper describes the LECS Lab submission to the AmericasNLP 2024 Shared Task on the Creation of Educational Materials for Indigenous Languages. The task requires transforming a base sentence with regards to one or more linguistic properties (such as negation or tense). We observe that this task shares many similarities with the well-studied task of word-level morphological inflection, and we explore whether the findings from inflection research are applicable to this task. In particular, we experiment with a number of augmentation strategies, finding that they can significantly benefit performance, but that not all augmented data is necessarily beneficial. Furthermore, we find that our character-level neural models show high variability with regards to performance on unseen data, and may not be the best choice when training data is limited.
more » « less
Full Text Available
GlossLM: A Massively Multilingual Corpus and Pretrained Model for Interlinear Glossed Text

https://doi.org/10.18653/v1/2024.emnlp-main.683

Ginn, Michael; Tjuatja, Lindia; He, Taiqi; Rice, Enora; Neubig, Graham; Palmer, Alexis; Levin, Lori (January 2024, Association for Computational Linguistics)

Full Text Available
Findings of the SIGMORPHON 2023 Shared Task on Interlinear Glossing

https://doi.org/10.18653/v1/2023.sigmorphon-1.20

Ginn, Michael; Moeller, Sarah; Palmer, Alexis; Stacey, Anna; Nicolai, Garrett; Hulden, Mans; Silfverberg, Miikka (July 2023, Association for Computational Linguistics)

This paper presents the findings of the SIGMORPHON 2023 Shared Task on Interlinear Glossing. This first iteration of the shared task explores glossing of a set of six typologically diverse languages: Arapaho, Gitksan, Lezgi, Natügu, Tsez and Uspanteko. The shared task encompasses two tracks: a resource-scarce closed track and an open track, where participants are allowed to utilize external data resources. Five teams participated in the shared task. The winning team Tü-CL achieved a 23.99%-point improvement over a baseline RoBERTa system in the closed track and a 17.42%-point improvement in the open track.
more » « less
Full Text Available

Search for: All records