Improving ASR Output for Endangered Language Documentation

Jimerson, Robert; Simha, Kruthika; Ptucha, Ray; Prud'hommeaux, Emily

doi:10.21437/SLTU.2018-39

Citation Details

Improving ASR Output for Endangered Language Documentation

Documenting endangered languages supports the historical preservation of diverse cultures. Automatic speech recognition (ASR), while potentially very useful for this task, has been underutilized for language documentation due to the challenges inherent in building robust models from extremely limited audio and text training resources. In this paper, we explore the utility of supplementing existing training resources using synthetic data, with a focus on Seneca, a morphologically complex endangered language of North America. We use transfer learning to train acoustic models using both the small amount of available acoustic training data and artificially distorted copies of that data. We then supplement the language model training data with verb forms generated by rule and sentences produced by an LSTM trained on the available text data. The addition of synthetic data yields reductions in word error rate, demonstrating the promise of data augmentation for this task. more »

Award ID(s):: 1761562

PAR ID:: 10087204

Author(s) / Creator(s):: Jimerson, Robert; Simha, Kruthika; Ptucha, Ray; Prud'hommeaux, Emily

Date Published:: 2018-08-29

Journal Name:: The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages

Page Range / eLocation ID:: 187 to 191

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.21437/SLTU.2018-39

More Like this