skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Using Machine Learning Systems to Investigate Phonological Representations
Language learning is a complex issue of interest to linguists, computer scientists, and psychologists alike. While the different fields approach these questions at different levels of granularity, findings in one field profoundly affect how the others proceed. My dissertation examines the perceptual and linguistic generalizations regarding the units that make up words (phonemes, morphemes, and vocal quality) in Polish and English to better understand how both humans and computers formulate these concepts in language. I use computational modeling and machine learning to investigate Polish morphophonology in two ways. First, I examine consonant clusters at the beginning of Polish words to see what parameters determine human-like learnability, compared to a survey of native speakers. I run several studies to compare learning with gradient or categorical data, each at the cluster, bigram, and featural level. Second, I examine Polish yer alternation and study whether machine learning approaches can generalize morphophonological information to target this pattern when given a larger Polish. Using low level neural networks and a classification-and-regression tree (CART) decision algorithm, I examine how well they use morphological and phonological information to make generalizations that capture a small subset of the Polish vocabulary. Additionally, I conduct a psycholinguistic experiment with English speakers to further establish what level of attention listeners may give when building phonological representations. I test this by extending a previous study finding that real word primes make rejection of nonword primes more difficult, determining that the effect generalizes across speakers. This research addresses a tension in modeling the computational problem of language learning between the formalization of representation and the mechanics of the learning apparatus. Different levels of abstraction can give more sophisticated insight into the data at hand, but at a cost that may not be representative of human learning. I argue that computational linguistic questions such as these provide an interesting window into the strengths and limitations of machine learning questions as compared to the human language learning faculty. [The dissertation citations contained here are published with the permission of ProQuest LLC. Further reproduction is prohibited without permission. Copies of dissertations may be obtained by Telephone (800) 1-800-521-0600. Web page: http://www.proquest.com/en-US/products/dissertations/individuals.shtml.] ERIC # ED663172  more » « less
Award ID(s):
2125295
PAR ID:
10615173
Author(s) / Creator(s):
Publisher / Repository:
ERIC
Date Published:
Format(s):
Medium: X
Institution:
Stony Brook University
Sponsoring Org:
National Science Foundation
More Like this
  1. The experimental study of artificial language learning has become a widely used means of investigating the predictions of theories of language learning and representation. Although much is now known about the generalizations that learners make from various kinds of data, relatively little is known about how those representations affect speech processing. This paper presents an event-related potential (ERP) study of brain responses to violations of lab-learned phonotactics. Novel words that violated a learned phonotactic constraint elicited a larger Late Positive Component (LPC) than novel words that satisfied it. Similar LPCs have been found for violations of natively acquired linguistic structure, as well as for violations of other types of abstract generalizations, such as musical structure. We argue that lab-learned phonotactic generalizations are represented abstractly and affect the evaluation of speech in a manner that is similar to natively acquired syntactic and phonological rules. 
    more » « less
  2. Goldwater, M; Angora, F; Hayes, B; Ong, D (Ed.)
    This study investigated how observing pitch gestures conveying lexical tones and representational gestures conveying word meanings when learning L2 Mandarin words differing in lexical tone affects their subsequent semantic and phonological processing in L1 English speakers using the N400 event-related potential (ERP). Larger N400s for English target words mismatching vs. matching Mandarin prime words in meaning were observed for words learned with pitch and representational gesture, but not no gesture. Additionally, larger N400s for Mandarin target words mismatching vs. matching Mandarin prime words in lexical tone were observed for words learned with pitch gesture, but not representational or no gesture. These findings provide the first ERP evidence that observing gestures conveying phonological and semantic information during L2 word learning enhances subsequent phonological and semantic processing of learned L2 words. 
    more » « less
  3. Abstract While research in heritage language phonology has found that transfer from the majority language can lead to divergent attainment in adult heritage language grammars, the extent to which language transfer develops during a heritage speaker's lifespan is understudied. To explore such cross-linguistic transfer, I examine the rate of glottalization between consonant-to-vowel sequences at word junctures produced by child and adult Spanish heritage speakers (i.e., HSs) in both languages. My results show that, in Spanish, child HSs produce greater rates of vowel-initial glottal phonation than their age-matched monolingually-raised Spanish counterparts, suggesting that the Spanish child HSs’ grammars are more permeable to transfer than those of the adult HSs. In English, child and adult HSs show similarly low rates of glottal phonation when compared to their age-matched monolingually-raised English speakers’ counterparts. The findings for English can be explained by either an account of transfer at the individual level or the community level. 
    more » « less
  4. Though the study of metrics and poetic verse has long informed phonological theory, studies of musical adaptation remain on the fringe of linguistic theory. In this paper, I argue that musical adaptation provides a unique window in speakers’ knowledge of their phonological system, which can provide crucial evidence for phonological theory. I draw on two case studies from my fieldwork in West Africa: tonal textsetting of sung folk music in Tommo So (Dogon, Mali) and the balafon surrogate language in Seenku (Mande, Burkina Faso). I show how results of these studies provide evidence for different levels of phonological grammar, the phonetics-phonology interface, and incomplete application of grammatical tone. Further, the case of the balafon surrogate language shows how studying music can be a valuable tool in language documentation and phonological description. Finally, preliminary study of Seenku tonal textsetting suggests important differences in the level of phonological encoding in vocal music vs. instrumental surrogate speech. 
    more » « less
  5. null (Ed.)
    One reason pretraining on self-supervised linguistic tasks is effective is that it teaches models features that are helpful for language understanding. However, we want pretrained models to learn not only to represent linguistic features, but also to use those features preferentially during fine-turning. With this goal in mind, we introduce a new English-language diagnostic set called MSGS (the Mixed Signals Generalization Set), which consists of 20 ambiguous binary classification tasks that we use to test whether a pretrained model prefers linguistic or surface generalizations during finetuning. We pretrain RoBERTa from scratch on quantities of data ranging from 1M to 1B words and compare their performance on MSGS to the publicly available RoBERTa_BASE. We find that models can learn to represent linguistic features with little pretraining data, but require far more data to learn to prefer linguistic generalizations over surface ones. Eventually, with about 30B words of pretraining data, RoBERTa_BASE does consistently demonstrate a linguistic bias with some regularity. We conclude that while self-supervised pretraining is an effective way to learn helpful inductive biases, there is likely room to improve the rate at which models learn which features matter. 
    more » « less