Morphology Matters: A Multilingual Language Modeling Analysis

Park, Hyunji Hayley; Zhang, Katherine J.; Haley, Coleman; Steimel, Kenneth; Liu, Han; Schwartz, Lane

doi:10.1162/tacl_a_00365

Citation Details

Morphology Matters: A Multilingual Language Modeling Analysis

Abstract Prior studies in multilingual language modeling (e.g., Cotterell et al., 2018; Mielke et al., 2019) disagree on whether or not inflectional morphology makes languages harder to model. We attempt to resolve the disagreement and extend those studies. We compile a larger corpus of 145 Bible translations in 92 languages and a larger number of typological features.1 We fill in missing typological data for several languages and consider corpus-based measures of morphological complexity in addition to expert-produced typological features. We find that several morphological measures are significantly associated with higher surprisal when LSTM models are trained with BPE-segmented data. We also investigate linguistically motivated subword segmentation strategies like Morfessor and Finite-State Transducers (FSTs) and find that these segmentation strategies yield better performance and reduce the impact of a language’s morphology on language modeling. more »

Award ID(s):: 1761680 2243445

PAR ID:: 10347126

Author(s) / Creator(s):: Park, Hyunji Hayley; Zhang, Katherine J.; Haley, Coleman; Steimel, Kenneth; Liu, Han; Schwartz, Lane

Date Published:: 2021-03-17

Journal Name:: Transactions of the Association for Computational Linguistics

Volume:: 9

ISSN:: 2307-387X

Page Range / eLocation ID:: 261 to 276

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.1162/tacl_a_00365

More Like this