The study of language variation examines how language varies between and within different groups of speakers, shedding light on how we use language to construct identities and how social contexts affect language use. A common method is to identify instances of a certain linguistic feature - say, the zero copula construction - in a corpus, and analyze the feature’s distribution across speakers, topics, and other variables, to either gain a qualitative understanding of the feature’s function or systematically measure variation. In this paper, we explore the challenging task of automatic morphosyntactic feature detection in low-resource English varieties. We present a human-in-the-loop approach to generate and filter effective contrast sets via corpus-guided edits. We show that our approach improves feature detection for both Indian English and African American English, demonstrate how it can assist linguistic research, and release our fine-tuned models for use by other researchers.
more »
« less
Collecting Texts in Endangered Language: The Chickasaw Narrative Bootcamp
While data collection early in the Americanist tradition included texts as part of the Boasian triad, later developments in the generative tradition moved away from narratives. With a resurgence of attention to texts in both linguistic theory and language documentation, the literature on methodologies is growing (i.e., Chelliah 2001, Chafe 1980, Burton & Matthewson 2015). We outline our approach to collecting Chickasaw texts in what we call a ‘narrative bootcamp.’ Chickasaw is a severely threatened language and no longer in common daily use. Facilitating narrative collection with elder fluent speakers is an important goal, as is the cultivation of second language speakers and the training of linguists and tribal language professionals. Our bootcamps meet these goals. Moreover, we show many positive outcomes to this approach, including a positive sense of language use and ‘fun’ voiced by the elders, the corpus expansion that occurs by collecting and processing narratives onsite in the workshop, and field methods training for novices. Importantly, we find the sparking of personal recollections facilitates the collection of heretofore unrecorded narrative genres in Chickasaw. This approach offers an especially fruitful way to build and expand a text corpus for small communities of highly endangered languages.
more »
« less
- Award ID(s):
- 1263699
- PAR ID:
- 10039007
- Date Published:
- Journal Name:
- Language documentation and conservation
- Volume:
- 10
- ISSN:
- 1934-5275
- Page Range / eLocation ID:
- 522
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Narrative planning is the process of generating sequences of actions that form coherent and goal-oriented narratives. Classical implementations of narrative planning rely on heuristic search techniques to offer structured story generation but face challenges with scalability due to large branching factors and deep search requirements. Large Language Models (LLMs), with their extensive training on diverse linguistic datasets, excel in understanding and generating coherent narratives. However, their planning ability lacks the precision and structure needed for effective narrative planning. This paper explores a hybrid approach that uses LLMs as heuristic guides within classical search frameworks for narrative planning. We compare various prompt designs to generate LLM heuristic predictions and evaluate their performance against h+, hmax, and relaxed plan heuristics. Additionally, we analyze the ability of relaxed plans to predict the next action correctly, comparing it to the LLMs’ ability to make the same prediction. Our findings indicate that LLMs rarely exceed the accuracy of classical planning heuristics.more » « less
-
null (Ed.)Abstract It is now a common practice to compare models of human language processing by comparing how well they predict behavioral and neural measures of processing difficulty, such as reading times, on corpora of rich naturalistic linguistic materials. However, many of these corpora, which are based on naturally-occurring text, do not contain many of the low-frequency syntactic constructions that are often required to distinguish between processing theories. Here we describe a new corpus consisting of English texts edited to contain many low-frequency syntactic constructions while still sounding fluent to native speakers. The corpus is annotated with hand-corrected Penn Treebank-style parse trees and includes self-paced reading time data and aligned audio recordings. We give an overview of the content of the corpus, review recent work using the corpus, and release the data.more » « less
-
null (Ed.)St. Lawrence Island Yupik (ISO 639-3: ess) is an endangered polysynthetic language in the Inuit-Yupik language family indigenous to Alaska and Chukotka. This work presents a step-by-step pipeline for the digitization of written texts, and the first publicly available digital corpus for St. Lawrence Island Yupik, created using that pipeline. This corpus has great potential for future linguistic inquiry and research in NLP. It was also developed for use in Yupik language education and revitalization, with a primary goal of enabling easy access to Yupik texts by educators and by members of the Yupik community. A secondary goal is to support development of language technology such as spell-checkers, text-completion systems, interactive e-books, and language learning apps for use by the Yupik community.more » « less
-
Interactive narrative in games utilize a combination of dynamic adaptability and predefined story elements to support player agency and enhance player engagement. However, crafting such narratives requires significant manual authoring and coding effort to translate scripts to playable game levels. Advances in pretrained large language models (LLMs) have introduced the opportunity to procedurally generate narratives. This paper presents NarrativeGenie, a framework to generate narrative beats as a cohesive, partially ordered sequence of events that shapes narrative progressions from brief natural language instructions. By leveraging LLMs for reasoning and generation, NarrativeGenie, translates a designer’s story overview into a partially ordered event graph to enable player-driven narrative beat sequencing. Our findings indicate that NarrativeGenie can provide an easy and effective way for designers to generate an interactive game episode with narrative events that align with the intended story arc while at the same time granting players agency in their game experience. We extend our framework to dynamically direct the narrative flow by adapting real-time narrative interactions based on the current game state and player actions. Results demonstrate that NarrativeGenie generates narratives that are coherent and aligned with the designer’s vision.more » « less
An official website of the United States government

