skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Strident harmony from the perspective of an inductive learner
In languages with strident harmony, stridents within a particular domain are required to have the same minor place of articulation. Harmony is often required only of stridents within a root or stem morpheme, and doesn't trigger alternations. Harmony is also often quite local, applying exclusively or more strongly between stridents in the same or adjacent syllables. Finally, harmony may be morpheme specific, triggering alternations in some affixes but not others. All of these specifics of a given harmony pattern give rise to exceptions to harmony at the level of the word, and may require a morphologically parsed learning corpus in order to be acquired. This paper explores the learnability of strident harmony in text corpora from three languages: Nkore-Kiga (Bantu), Papantla Totonac (Totonacan) and Navajo (Athapaskan). The analyses show that word level exceptions largely obscure the harmony pattern as an overall phonotactic in a language. The three languages also serve as a test of the Projection Induction Learner (Gouskova and Gallagher 2020), which is found to be successful when the generalizations in the data are strong but may fail in the face of patterned exceptions.  more » « less
Award ID(s):
1724753
PAR ID:
10226343
Author(s) / Creator(s):
Date Published:
Journal Name:
Phonological Data and Analysis
Volume:
2
Issue:
8
ISSN:
2642-1828
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Phonological alternations are often specific to morphosyntactic context. For example, stress shift in English occurs in the presence of some suffixes, -al , but not others, -ing : "Equation missing" , "Equation missing" , "Equation missing" . In some cases a phonological process applies only in words of certain lexical categories. Previous theories have stipulated that such morphosyntactically conditioned phonology is word-bounded. In this paper we present a number of long-distance morphologically conditioned phonological effects, cases where phonological processes within one word are conditioned by another word or the presence of a morpheme in another word. We provide a model, Cophonologies by Phase, which extends Cophonology Theory, intended to capture word-internal and lexically specified phonological alternations, to cyclically generated syntactic constituents. We show that Cophonologies by Phase makes better predictions about the long-distance morphologically conditioned phonological effects we find across languages than previous frameworks. Furthermore, Cophonologies by Phase derives such effects without requiring the phonological component to directly reference syntactic features or structure. 
    more » « less
  2. null (Ed.)
    This paper describes the development of the first Universal Dependencies (UD) treebank for St. Lawrence Island Yupik, an endangered language spoken in the Bering Strait region. While the UD guidelines provided a general framework for our annotations, language-specific decisions were made necessary by the rich morphology of the polysynthetic language. Most notably, we annotated a corpus at the morpheme level as well as the word level. The morpheme level annotation was conducted using an existing morphological analyzer and manual disambiguation. By comparing the two resulting annotation schemes, we argue that morpheme-level annotation is essential for polysynthetic languages like St. Lawrence Island Yupik. Word-level annotation results in degenerate trees for some Yupik sentences and often fails to capture syntactic relations that can be manifested at the morpheme level. Dependency parsing experiments provide further support for morpheme-level annotation. Implications for UD annotation of other polysynthetic languages are discussed. 
    more » « less
  3. Unsupervised cross-lingual projection for part-of-speech (POS) tagging relies on the use of parallel data to project POS tags from a source language for which a POS tagger is available onto a target language across word-level alignments. The projected tags then form the basis for learning a POS model for the target language. However, languages with rich morphology often yield sparse word alignments because words corresponding to the same citation form do not align well. We hypothesize that for morphologically complex languages, it is more efficient to use the stem rather than the word as the core unit of abstraction. Our contributions are: 1) we propose an unsupervised stem-based cross-lingual approach for POS tagging for low-resource languages of rich morphology; 2) we further investigate morpheme-level alignment and projection; and 3) we examine whether the use of linguistic priors for morphological segmentation improves POS tagging. We conduct experiments using six source languages and eight morphologically complex target languages of diverse typologies. Our results show that the stem-based approach improves the POS models for all the target languages, with an average relative error reduction of 10.3% in accuracy per target language, and outperforms the word-based approach that operates on three-times more data for about two thirds of the language pairs we consider. Moreover, we show that morpheme-level alignment and projection and the use of linguistic priors for morphological segmentation further improve POS tagging. 
    more » « less
  4. Traditionally, many text-mining tasks treat individual word-tokens as the finest meaningful semantic granularity. However, in many languages and specialized corpora, words are composed by concatenating semantically meaningful subword structures. Word-level analysis cannot leverage the semantic information present in such subword structures. With regard to word embedding techniques, this leads to not only poor embeddings for infrequent words in long-tailed text corpora but also weak capabilities for handling out-of-vocabulary words. In this paper we propose MorphMine for unsupervised morpheme segmentation. MorphMine applies a parsimony criterion to hierarchically segment words into the fewest number of morphemes at each level of the hierarchy. This leads to longer shared morphemes at each level of segmentation. Experiments show that MorphMine segments words in a variety of languages into human-verified morphemes. Additionally, we experimentally demonstrate that utilizing MorphMine morphemes to enrich word embeddings consistently improves embedding quality on a variety of of embedding evaluations and a downstream language modeling task. 
    more » « less
  5. Gong, Y.; Kpogo, F. (Ed.)
    In acquiring morphology, the language learner faces the challenge of identifying both the form of morphemes and their location within words. For example, individuals acquiring Chamorro (Austronesian) must learn an agreement morpheme with the form -um- that is infixed before the first vowel of the stem (1a). This challenge is more difficult when a morpheme has multiple forms and/or locations: in some varieties of Chamorro, the same agreement morpheme appears as mu- prefixed on verbs beginning with a nasal/liquid consonant (1b). The learner could potentially overcome the acquisition challenge by employing strong inductive biases. This hypothesis is consistent with the typological finding that, across languages, morphemes occupy a restricted set of prosodically-defined locations (Yu, 2007) and there are strong correlations between morpheme form and position (Anderson, 1972). We conducted a series of artificial morphology experiments, modeled after the Chamorro pattern, that provide converging evidence for such inductive biases (Pierrehumbert & Nair, 1995; Staroverov & Finley, 2021). 
    more » « less