skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Endangered Data for Endangered Languages: Digitizing Print dictionaries
This paper describes on-going work in dictionary digitization, and in particular the processing of OCRed text into a structured lexicon. In processing the output of an OCR engine into structured data, we are faced with three problems: 1. Typographic errors; 2. Conversion from the dictionary’s visual layout into a lexicographically structured computer-readable format, such as XML; and 3. Converting each dictionary’s idiosyncratic structure into some standard tagging system. This paper deals mostly with the second issue, but touches on the first and third.  more » « less
Award ID(s):
1644606
PAR ID:
10039206
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the 2nd Workshop on the Use of Computational Methods in the Study of Endangered Languages
Page Range / eLocation ID:
85 to 91
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Advances in speech and language processing have enabled the creation of applications that could, in principle, accelerate the process of language documentation, as speech communities and linguists work on urgent language documentation and reclamationnprojects. However, such systems have yet to make a significant impact on language documentation, as resource requirements limit the broad applicability of these new techniques. We aim to exploit the framework of shared tasks to focus the technology research community on tasks which address key pain points in language documentation. Here we presentninitial steps in the implementation of these new shared tasks, through the creation of data sets drawn from endangered language repositories and baseline systems to perform segmentation and speaker labeling of these audio recordings—important enabling steps in the documentation process. This paper motivates these tasks with a use case, describes data set curation and baseline systems, and presents results on this data. We then highlight the challenges and ethical considerations in developing these speech processing tools and tasks to support endangered language documentation. 
    more » « less
  2. The memristor crossbar array has emerged as an intrinsically suitable matrix computation and low-power acceleration framework for DNN applications. Many techniques such as memristor-based weight pruning and memristor-based quantization have been studied. However, the high accuracy solution for the above techniques is still waiting for unraveling. In this paper, we propose a memristor-based DNN framework which combines both structured weight pruning and quantization by incorporating ADMM algorithm for better pruning and quantization performance. We also discover the non-optimality of the ADMM solution in weight pruning and the unused data path in a structured pruned model. We design a software-hardware co-optimization framework which contains the first proposed Network Purification and Unused Path Removal algorithms targeting on post-processing a structured pruned model after ADMM steps. By taking memristor hardware constraints into our whole framework, we achieve extreme high compression rate with minimum accuracy loss. For quantizing structured pruned model, our framework achieves nearly no accuracy loss after quantizing weights to 8-bit memristor weight representation. We share our models at anonymous link https://bit.ly/2VnMUy0. 
    more » « less
  3. The development of language technology (LT) for an endangered language is often identified as a goal in language revitalization efforts, but developing such technologies is typically subject to additional methodological challenges as well as social and ethical concerns. In particular, LT development has too often taken on colonialist qualities, extracting language data, relying on outside experts, and denying the speakers of a language sovereignty over the technologies produced.We seek to avoid such an approach through the development of the Building Endangered Language Technology (BELT) website, an educational resource designed for speakers and community members with limited technological experience to develop LTs for their own language. Specifically, BELT provides interactive lessons on basic Python programming, coupled with projects to develop specific language technologies, such as spellcheckers or word games. In this paper, we describe BELT’s design, the motivation underlying many key decisions, and preliminary responses from learners. 
    more » « less
  4. While data collection early in the Americanist tradition included texts as part of the Boasian triad, later developments in the generative tradition moved away from narratives. With a resurgence of attention to texts in both linguistic theory and language documentation, the literature on methodologies is growing (i.e., Chelliah 2001, Chafe 1980, Burton & Matthewson 2015). We outline our approach to collecting Chickasaw texts in what we call a ‘narrative bootcamp.’ Chickasaw is a severely threatened language and no longer in common daily use. Facilitating narrative collection with elder fluent speakers is an important goal, as is the cultivation of second language speakers and the training of linguists and tribal language professionals. Our bootcamps meet these goals. Moreover, we show many positive outcomes to this approach, including a positive sense of language use and ‘fun’ voiced by the elders, the corpus expansion that occurs by collecting and processing narratives onsite in the workshop, and field methods training for novices. Importantly, we find the sparking of personal recollections facilitates the collection of heretofore unrecorded narrative genres in Chickasaw. This approach offers an especially fruitful way to build and expand a text corpus for small communities of highly endangered languages. 
    more » « less
  5. Documenting endangered languages supports the historical preservation of diverse cultures. Automatic speech recognition (ASR), while potentially very useful for this task, has been underutilized for language documentation due to the challenges inherent in building robust models from extremely limited audio and text training resources. In this paper, we explore the utility of supplementing existing training resources using synthetic data, with a focus on Seneca, a morphologically complex endangered language of North America. We use transfer learning to train acoustic models using both the small amount of available acoustic training data and artificially distorted copies of that data. We then supplement the language model training data with verb forms generated by rule and sentences produced by an LSTM trained on the available text data. The addition of synthetic data yields reductions in word error rate, demonstrating the promise of data augmentation for this task. 
    more » « less