Although the application of deep learning to automatic speech recognition (ASR) has resulted in dramatic reductions in word error rate for languages with abundant training data, ASR for languages with few resources has yet to benefit from deep learning to the same extent. In this paper, we investigate various methods of acoustic modeling and data augmentation with the goal of improving the accuracy of a deep learning ASR framework for a low-resource language with a high baseline word error rate. We compare several methods of generating synthetic acoustic training data via voice transformation and signal distortion, and we explore several strategies for integrating this data into the acoustic training pipeline. We evaluate our methods on an indigenous language of North America with minimal training resources. We show that training initially via transfer learning from an existing high-resource language acoustic model, refining weights using a heavily concentrated synthetic dataset, and finally fine-tuning to the target language using limited synthetic data reduces WER by 15% over just transfer learning using deep recurrent methods. Further, we show improvements over traditional frameworks by 19% using a similar multistage training with deep convolutional approaches.
more »
« less
Improving ASR Output for Endangered Language Documentation
Documenting endangered languages supports the historical preservation of diverse cultures. Automatic speech recognition (ASR), while potentially very useful for this task, has been underutilized for language documentation due to the challenges inherent in building robust models from extremely limited audio and text training resources. In this paper, we explore the utility of supplementing existing training resources using synthetic data, with a focus on Seneca, a morphologically complex endangered language of North America. We use transfer learning to train acoustic models using both the small amount of available acoustic training data and artificially distorted copies of that data. We then supplement the language model training data with verb forms generated by rule and sentences produced by an LSTM trained on the available text data. The addition of synthetic data yields reductions in word error rate, demonstrating the promise of data augmentation for this task.
more »
« less
- Award ID(s):
- 1761562
- PAR ID:
- 10087204
- Date Published:
- Journal Name:
- The 6th Intl. Workshop on Spoken Language Technologies for Under-Resourced Languages
- Page Range / eLocation ID:
- 187 to 191
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The application of deep learning to automatic speech recognition (ASR) has yielded dramatic accuracy increases for languages with abundant training data, but languages with limited training resources have yet to see accuracy improvements on this scale. In this paper, we compare a fully convolutional approach for acoustic modelling in ASR with a variety of established acoustic modeling approaches. We evaluate our method on Seneca, a low-resource endangered language spoken in North America. Our method yields word error rates up to 40% lower than those reported using both standard GMM-HMM approaches and established deep neural methods, with a substantial reduction in training time. These results show particular promise for languages like Seneca that are both endangered and lack extensive documentation.more » « less
-
We present a simple approach to improve direct speech-to-text translation (ST) when the source language is low-resource: we pre-train the model on a high-resource automatic speech recognition (ASR) task, and then fine-tune its parameters for ST. We demonstrate that our approach is effective by pre-training on 300 hours of English ASR data to improve SpanishEnglish ST from 10.8 to 20.2 BLEU when only 20 hours of Spanish-English ST training data are available. Through an ablation study, we find that the pre-trained encoder (acoustic model) accounts for most of the improvement, despite the fact that the shared language in these tasks is the target language text, not the source language audio. Applying this insight, we show that pre-training on ASR helps ST even when the ASR language differs from both source and target ST languages: pre-training on French ASR also improves Spanish-English ST. Finally, we show that the approach improves performance on a true low-resource task: pre-training on a combination of English ASR and French ASR improves Mboshi-French ST, where only 4 hours of data are available, from 3.5 to 7.1 BLEU.more » « less
-
Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named AUGPE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that AUGPE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe.more » « less
-
This thesis investigates the computational modeling of belief and related cognitive states as expressed in text and speech. Understanding how speakers or authors convey commitment, certainty, and emotions is crucial for language understanding, yet poses significant challenges for current NLP systems. We present a comprehensive study spanning multiple facets of belief prediction. We begin by re-examining the widely used FactBank corpus, correcting a critical projection error and establishing new state-of-the-art results for author-only belief prediction through multi-task learning and error analysis. We then tackle the more complex task of source-and-target belief prediction, introducing a novel generative framework using Flan-T5. This includes developing a structured database representation for FactBank and proposing a linearized tree generation approach, culminating in the BeLeaf system for visualization and analysis, which achieves state-of-the-art performance on both FactBank and the MDP corpus. With the rise of large language models (LLMs), we investigate their zero-shot capabilities for the source-and-target belief task. We propose Unified and Hybrid prompting frameworks, finding that while current LLMs struggle, particularly with nested beliefs, our Hybrid approach paired with reasoning-focused LLMs achieves new state-of-the-art results on FactBank. Finally, we explore the role of multimodality among multiple cognitive states. We present the first study on multimodal belief prediction using the CB-Prosody corpus, demonstrating that integrating audio features via fine-tuned Whisper models significantly improves performance over text-only BERT models. We further introduce Synthetic Audio Data (SAD), showing that even synthetic audio generated by TTS systems provides orthogonal, beneficial signals for various cognitive state tasks (belief, emotion, sentiment). We conclude by presenting OmniVox, the first systematic evaluation of omni-LLMs for zero-shot emotion recognition directly from audio, demonstrating their competitiveness with fine-tuned models and analyzing their acoustic reasoning capabilities.more » « less
An official website of the United States government

