TAMS: Translation-Assisted Morphological Segmentation

Rice, Enora; Marashian, Ali; Gessler, Luke; Palmer, Alexis; von_der_Wense, Katharina

Citation Details

Canonical morphological segmentation is the process of analyzing words into the standard (aka underlying) forms of their constituent morphemes.This is a core task in endangered language documentation, and NLP systems have the potential to dramatically speed up this process. In typical language documentation settings, training data for canonical morpheme segmentation is scarce, making it difficult to train high quality models. However, translation data is often much more abundant, and, in this work, we present a method that attempts to leverage translation data in the canonical segmentation task. We propose a character-level sequence-to-sequence model that incorporates representations of translations obtained from pretrained high-resource monolingual language models as an additional signal. Our model outperforms the baseline in a super-low resource setting but yields mixed results on training splits with more data. Additionally, we find that we can achieve strong performance even without needing difficult-to-obtain word level alignments. While further work is needed to make translations useful in higher-resource settings, our model shows promise in severely resource-constrained settings. more »

Award ID(s):: 2149404

PAR ID:: 10539618

Author(s) / Creator(s):: Rice, Enora; Marashian, Ali; Gessler, Luke; Palmer, Alexis; von_der_Wense, Katharina

Publisher / Repository:: Association for Computational Linguistics

Date Published:: 2024-08-01

Format(s):: Medium: X

Location:: Bangkok, Thailand

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this