How to encode arbitrarily complex morphology in word embeddings, no corpus needed

Schwartz, Lane; Haley, Coleman; Tyers, Francis

Citation Details

In this paper, we present a straightforward technique for constructing interpretable word embeddings from morphologically analyzed examples (such as interlinear glosses) for all of the world’s languages. Currently, fewer than 300-400 languages out of approximately 7000 have have more than a trivial amount of digitized texts; of those, between 100-200 languages (most in the Indo-European language family) have enough text data for BERT embeddings of reasonable quality to be trained. The word embeddings in this paper are explicitly designed to be both linguistically interpretable and fully capable of handling the broad variety found in the world’s diverse set of 7000 languages, regardless of corpus size or morphological characteristics. We demonstrate the applicability of our representation through examples drawn from a typologically diverse set of languages whose morphology includes prefixes, suffixes, infixes, circumfixes, templatic morphemes, derivational morphemes, inflectional morphemes, and reduplication. more »

Award ID(s):: 2243445 1761680

PAR ID:: 10505683

Author(s) / Creator(s):: Schwartz, Lane; Haley, Coleman; Tyers, Francis

Editor(s):: Serikov, Oleg; Voloshina, Ekaterina; Postnikova, Anna; Klyachko, Elena; Neminova, Ekaterina; Vylomova, Ekaterina; Shavrina, Tatiana; Le Ferrand, Eric; Malykh, Valentin; Tyers, Francis; Arkhangelskiy, Timofey; Mikhailov, Vladislav; Fenogenova, Alena

Publisher / Repository:: International Conference on Computational Linguistics

Date Published:: 2022-10-16

Journal Name:: Proceedings of the First Workshop on NLP Applications to Field Linguistics

Page Range / eLocation ID:: 64-76

Format(s):: Medium: X Other: pdf

Location:: Gyeongju, South Kore

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this