skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: An Information-Theoretic Characterization of Morphological Fusion
Linguistic typology generally divides synthetic languages into groups based on their morphological fusion. However, this measure has long been thought to be best considered a matter of degree. We present an information-theoretic measure, called informational fusion, to quantify the degree of fusion of a given set of morphological features in a surface form, which naturally provides such a graded scale. Informational fusion is able to encapsulate not only concatenative, but also nonconcatenative morphological systems (e.g. Arabic), abstracting away from any notions of morpheme segmentation. We then show, on a sample of twenty-one languages, that our measure recapitulates the usual linguistic classifications for concatenative systems, and provides new measures for nonconcatenative ones. We also evaluate the long-standing hypotheses that more frequent forms are more fusional, and that paradigm size anticorrelates with degree of fusion. We do not find evidence for the idea that languages have characteristic levels of fusion; rather, the degree of fusion varies across part-of-speech within languages.  more » « less
Award ID(s):
1947307
PAR ID:
10336250
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
An Information-Theoretic Characterization of Morphological Fusion
Page Range / eLocation ID:
10115 to 10120
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. J. Culbertson, A. Perfors (Ed.)
    Languages often express grammatical information through inflectional morphology, in which grammatical features are grouped into strings of morphemes. In this work, we propose that cross-linguistic generalizations about morphological fusion, in which multiple features are expressed through one morpheme, can be explained in part by optimization of processing efficiency, as formalized using the memory--surprisal tradeoff of Hahn et al. (2021). We show in a toy setting that fusion of highly informative neighboring morphemes can lead to greater processing efficiency under our processing model. Next, based on paradigm and frequency data from four languages, we consider both total fusion and gradable fusion using empirical measures developed by Rathi et al. (2021), and find that the degree of fusion is predicted by closeness of optimal morpheme ordering as determined by optimization of processing efficiency. Finally, we show that optimization of processing efficiency can successfully predict typological patterns involving suppletion. 
    more » « less
  2. This paper discusses the need for including morphological features in Japanese Universal Dependencies (UD). In the current version (v2.11) of the Japanese UD treebanks, sentences are tokenized at the morpheme level, and almost no morphological feature annotation is used. However, Japanese is not an isolating language that lacks morphological inflection but is an agglutinative language. Given this situation, we introduce a tentative scheme for retokenization and morphological feature annotation for Japanese UD. Then, we measure and compare the morphological complexity of Japanese with other languages to demonstrate that the proposed tokenizations show similarities to synthetic languages reflecting the linguistic typology. 
    more » « less
  3. We quantify the linguistic complexity of different languages’ morphological systems. We verify that there is a statistically significant empirical trade-off between paradigm size and irregularity: A language’s inflectional paradigms may be either large in size or highly irregular, but never both. We define a new measure of paradigm irregularity based on the conditional entropy of the surface realization of a paradigm— how hard it is to jointly predict all the word forms in a paradigm from the lemma. We estimate irregularity by training a predictive model. Our measurements are taken on large morphological paradigms from 36 typologically diverse languages. 
    more » « less
  4. null (Ed.)
    Abstract Two differences between signed and spoken languages that have been widely discussed in the literature are: the degree to which morphology is expressed simultaneously (rather than sequentially), and the degree to which iconicity is used, particularly in predicates of motion and location, often referred to as classifier predicates. In this paper we analyze a set of properties marking agency and number in four sign languages for their crosslinguistic similarities and differences regarding simultaneity and iconicity. Data from American Sign Language (ASL), Italian Sign Language (LIS), British Sign Language (BSL), and Hong Kong Sign Language (HKSL) are analyzed. We find that iconic, cognitive, phonological, and morphological factors contribute to the distribution of these properties. We conduct two analyses—one of verbs and one of verb phrases. The analysis of classifier verbs shows that, as expected, all four languages exhibit many common formal and iconic properties in the expression of agency and number. The analysis of classifier verb phrases (VPs)—particularly, multiple-verb predicates—reveals (a) that it is grammatical in all four languages to express agency and number within a single verb, but also (b) that there is crosslinguistic variation in expressing agency and number across the four languages. We argue that this variation is motivated by how each language prioritizes, or ranks, several constraints. The rankings can be captured in Optimality Theory. Some constraints in this account, such as a constraint to be redundant, are found in all information systems and might be considered non-linguistic; however, the variation in constraint ranking in verb phrases reveals the grammatical and arbitrary nature of linguistic systems. 
    more » « less
  5. We evaluate corpus-based measures of linguistic complexity obtained using Universal Dependencies (UD) treebanks. We propose a method of estimating robustness of the complexity values obtained using a given measure and a given treebank. The results indicate that measures of syntactic complexity might be on average less robust than those of morphological complexity. We also estimate the validity of complexity measures by comparing the results for very similar languages and checking for unexpected differences. We show that some of those differences that arise can be diminished by using parallel treebanks and, more importantly from the practical point of view, by harmonizing the language-specific solutions in the UD annotation. 
    more » « less