skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Robust Generalization Strategies for Morpheme Glossing in an Endangered Language Documentation Context
Generalization is of particular importance in resource-constrained settings, where the available training data may represent only a small fraction of the distribution of possible texts. We investigate the ability of morpheme labeling models to generalize by evaluating their performance on unseen genres of text, and we experiment with strategies for closing the gap between performance on in-distribution and out-of-distribution data. Specifically, we use weight decay optimization, output denoising, and iterative pseudo-labeling, and achieve a 2% improvement on a test set containing texts from unseen genres. All experiments are performed using texts written in the Mayan language Uspanteko.  more » « less
Award ID(s):
2149404
PAR ID:
10539621
Author(s) / Creator(s):
;
Publisher / Repository:
Association for Computational Linguistics
Date Published:
Page Range / eLocation ID:
89 to 98
Format(s):
Medium: X
Location:
Singapore
Sponsoring Org:
National Science Foundation
More Like this
  1. Online texts—across genres, registers, domains, and styles—are riddled with human stereotypes, expressed in overt or subtle ways. Word embeddings, trained on these texts, perpetuate and amplify these stereotypes, and propagate biases to machine learning models that use word embeddings as features. In this work, we propose a method to debias word embeddings in multiclass settings such as race and religion, extending the work of (Bolukbasi et al., 2016) from the binary setting, such as binary gender. Next, we propose a novel methodology for the evaluation of multiclass debiasing. We demonstrate that our multiclass debiasing is robust and maintains the efficacy in standard NLP tasks. 
    more » « less
  2. While data collection early in the Americanist tradition included texts as part of the Boasian triad, later developments in the generative tradition moved away from narratives. With a resurgence of attention to texts in both linguistic theory and language documentation, the literature on methodologies is growing (i.e., Chelliah 2001, Chafe 1980, Burton & Matthewson 2015). We outline our approach to collecting Chickasaw texts in what we call a ‘narrative bootcamp.’ Chickasaw is a severely threatened language and no longer in common daily use. Facilitating narrative collection with elder fluent speakers is an important goal, as is the cultivation of second language speakers and the training of linguists and tribal language professionals. Our bootcamps meet these goals. Moreover, we show many positive outcomes to this approach, including a positive sense of language use and ‘fun’ voiced by the elders, the corpus expansion that occurs by collecting and processing narratives onsite in the workshop, and field methods training for novices. Importantly, we find the sparking of personal recollections facilitates the collection of heretofore unrecorded narrative genres in Chickasaw. This approach offers an especially fruitful way to build and expand a text corpus for small communities of highly endangered languages. 
    more » « less
  3. null (Ed.)
    Due to the extreme scarcity of customer failure data, it is challenging to reliably screen out those rare defects within a high-dimensional input feature space formed by the relevant parametric test measurements. In this paper, we study several unsupervised learning techniques based on six industrial test datasets, and propose to train a more robust unsupervised learning model by self-labeling the training data via a set of transformations. Using the labeled data we train a multi-class classifier through supervised training. The goodness of the multi-class classification decisions with respect to an unseen input data is used as a normality score to defect anomalies. Furthermore, we propose to use reversible information lossless transformations to retain the data information and boost the performance and robustness of the proposed self-labeling approach. 
    more » « less
  4. Vivi Nastase; Ellie Pavlick; Mohammad Taher Pilehvar; Jose Camacho-Collados; Alessandro Raganato (Ed.)
    This paper describes the evolution of the PropBank approach to semantic role labeling over the last two decades. During this time the PropBank frame files have been expanded to include non-verbal predicates such as adjectives, prepositions and multi-word expressions. The number of domains, genres and languages that have been PropBanked has also expanded greatly, creating an opportunity for much more challenging and robust testing of the generalization capabilities of PropBank semantic role labeling systems. We also describe the substantial effort that has gone into ensuring the consistency and reliability of the various annotated datasets and resources, to better support the training and evaluation of such systems 
    more » « less
  5. Ideology is at the core of political science research. Yet, there still does not exist general-purpose tools to characterize and predict ideology across different genres of text. To this end, we study Pretrained Language Models using novel ideology-driven pretraining objectives that rely on the comparison of articles on the same story written by media of different ideologies. We further collect a large-scale dataset, consisting of more than 3.6M political news articles, for pretraining. Our model POLITICS outperforms strong baselines and the previous state-of-the-art models on ideology prediction and stance detection tasks. Further analyses show that POLITICS is especially good at understanding long or formally written texts, and is also robust in few-shot learning scenarios. 
    more » « less