skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Explaining patterns of fusion in morphological paradigms using the memory-surprisal tradeoff
Languages often express grammatical information through inflectional morphology, in which grammatical features are grouped into strings of morphemes. In this work, we propose that cross-linguistic generalizations about morphological fusion, in which multiple features are expressed through one morpheme, can be explained in part by optimization of processing efficiency, as formalized using the memory--surprisal tradeoff of Hahn et al. (2021). We show in a toy setting that fusion of highly informative neighboring morphemes can lead to greater processing efficiency under our processing model. Next, based on paradigm and frequency data from four languages, we consider both total fusion and gradable fusion using empirical measures developed by Rathi et al. (2021), and find that the degree of fusion is predicted by closeness of optimal morpheme ordering as determined by optimization of processing efficiency. Finally, we show that optimization of processing efficiency can successfully predict typological patterns involving suppletion.  more » « less
Award ID(s):
1947307
PAR ID:
10336254
Author(s) / Creator(s):
Editor(s):
J. Culbertson, A. Perfors
Date Published:
Journal Name:
Proceedings of the 44th Annual Conference of the Cognitive Science Society
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Gong, Y.; Kpogo, F. (Ed.)
    In acquiring morphology, the language learner faces the challenge of identifying both the form of morphemes and their location within words. For example, individuals acquiring Chamorro (Austronesian) must learn an agreement morpheme with the form -um- that is infixed before the first vowel of the stem (1a). This challenge is more difficult when a morpheme has multiple forms and/or locations: in some varieties of Chamorro, the same agreement morpheme appears as mu- prefixed on verbs beginning with a nasal/liquid consonant (1b). The learner could potentially overcome the acquisition challenge by employing strong inductive biases. This hypothesis is consistent with the typological finding that, across languages, morphemes occupy a restricted set of prosodically-defined locations (Yu, 2007) and there are strong correlations between morpheme form and position (Anderson, 1972). We conducted a series of artificial morphology experiments, modeled after the Chamorro pattern, that provide converging evidence for such inductive biases (Pierrehumbert & Nair, 1995; Staroverov & Finley, 2021). 
    more » « less
  2. Distributional approaches have proven effective in modeling semantics and phonology through vector embeddings. We explore whether distributional representations can also effectively model morphological information. We train static vector embeddings over morphological sequences. Then, we explore morpheme categories for fusional morphemes, which encode multiple linguistic dimensions, and often have close relationships to other morphemes. We study whether the learned vector embeddings align with these linguistic dimensions, finding strong evidence that this is the case. Our work uses two low-resource languages, Uspanteko and Tsez, demonstrating that distributional morphological representations are effective even with limited data. 
    more » « less
  3. Traditionally, many text-mining tasks treat individual word-tokens as the finest meaningful semantic granularity. However, in many languages and specialized corpora, words are composed by concatenating semantically meaningful subword structures. Word-level analysis cannot leverage the semantic information present in such subword structures. With regard to word embedding techniques, this leads to not only poor embeddings for infrequent words in long-tailed text corpora but also weak capabilities for handling out-of-vocabulary words. In this paper we propose MorphMine for unsupervised morpheme segmentation. MorphMine applies a parsimony criterion to hierarchically segment words into the fewest number of morphemes at each level of the hierarchy. This leads to longer shared morphemes at each level of segmentation. Experiments show that MorphMine segments words in a variety of languages into human-verified morphemes. Additionally, we experimentally demonstrate that utilizing MorphMine morphemes to enrich word embeddings consistently improves embedding quality on a variety of of embedding evaluations and a downstream language modeling task. 
    more » « less
  4. Linguistic typology generally divides synthetic languages into groups based on their morphological fusion. However, this measure has long been thought to be best considered a matter of degree. We present an information-theoretic measure, called informational fusion, to quantify the degree of fusion of a given set of morphological features in a surface form, which naturally provides such a graded scale. Informational fusion is able to encapsulate not only concatenative, but also nonconcatenative morphological systems (e.g. Arabic), abstracting away from any notions of morpheme segmentation. We then show, on a sample of twenty-one languages, that our measure recapitulates the usual linguistic classifications for concatenative systems, and provides new measures for nonconcatenative ones. We also evaluate the long-standing hypotheses that more frequent forms are more fusional, and that paradigm size anticorrelates with degree of fusion. We do not find evidence for the idea that languages have characteristic levels of fusion; rather, the degree of fusion varies across part-of-speech within languages. 
    more » « less
  5. Sentiment Analysis is a popular text classification task in natural language processing. It involves developing algorithms or machine learning models to determine the sentiment or opinion expressed in a piece of text. The results of this task can be used by business owners and product developers to understand their consumers’ perceptions of their products. Asides from customer feedback and product/service analysis, this task can be useful for social media monitoring (Martin et al., 2021). One of the popular applications of sentiment analysis is for classifying and detecting the positive and negative sentiments on movie reviews. Movie reviews enable movie producers to monitor the performances of their movies (Abhishek et al., 2020) and enhance the decision of movie viewers to know whether a movie is good enough and worth investing time to watch (Lakshmi Devi et al., 2020). However, the task has been under-explored for African languages compared to their western counterparts, ”high resource languages”, that are privileged to have received enormous attention due to the large amount of available textual data. African languages fall under the category of the low resource languages which are on the disadvantaged end because of the limited availability of data that gives them a poor representation (Nasim & Ghani, 2020). Recently, sentiment analysis has received attention on African languages in the Twitter domain for Nigerian (Muhammad et al., 2022) and Amharic (Yimam et al., 2020) languages. However, there is no available corpus in the movie domain. We decided to tackle the problem of unavailability of Yoru`ba´ data for movie sentiment analysis by creating the first Yoru`ba´ sentiment corpus for Nollywood movie reviews. Also, we develop sentiment classification models using state-of-the-art pre-trained language models like mBERT (Devlin et al., 2019) and AfriBERTa (Ogueji et al., 2021). 
    more » « less