Children’s automatic speech recognition (ASR) is always difficult due to, in part, the data scarcity problem, especially for kindergarten-aged kids. When data are scarce, the model might overfit to the training data, and hence good starting points for training are essential. Recently, meta-learning was proposed to learn model initialization (MI) for ASR tasks of different languages. This method leads to good performance when the model is adapted to an unseen language. How-ever, MI is vulnerable to overfitting on training tasks (learner overfitting). It is also unknown whether MI generalizes to other low-resource tasks. In this paper, we validate the effectiveness of MI in children’s ASR and attempt to alleviate the problem of learner overfitting. To achieve model-agnostic meta-learning (MAML), we regard children’s speech at each age as a different task. In terms of learner overfitting, we propose a task-level augmentation method by simulating new ages using frequency warping techniques. Detailed experiments are conducted to show the impact of task augmentation on each age for kindergarten-aged speech. As a result, our approach achieves a relative word error rate (WER) improvement of 51% over the baseline system with no augmentation or initialization.
LPC Augment: an LPC-based ASR Data Augmentation Algorithm for Low and Zero-Resource Children’s Dialects
This paper proposes a novel linear prediction coding-based data augmentation method for children’s low and zero resource dialect ASR. The data augmentation procedure consists of perturbing the formant peaks of the LPC spectrum during LPC analysis and reconstruction. The method is evaluated on two novel children’s speech datasets with one containing California English from the Southern California Area and the other containing a mix of Southern American English and African American English from the Atlanta, Georgia area. We test the proposed method in training both an HMM-DNN system and an end-to-end system to show model-robustness and demonstrate that the algorithm improves ASR performance, especially for zero resource dialect children’s task, as compared to common data augmentation methods such as VTLP, Speed Perturbation, and SpecAugment.
- Award ID(s):
- Publication Date:
- NSF-PAR ID:
- Journal Name:
- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
- Page Range or eLocation-ID:
- 8577 to 8581
- Sponsoring Org:
- National Science Foundation
More Like this
Toward Fully Autonomous Seismic Networks: Backprojecting Deep Learning-Based Phase Time Functions for Earthquake Monitoring on Continuous RecordingsAbstract Accurate and (near) real-time earthquake monitoring provides the spatial and temporal behaviors of earthquakes for understanding the nature of earthquakes, and also helps in regional seismic hazard assessments and mitigations. Because of the increase in both the quality and quantity of seismic data, an automated earthquake monitoring system is needed. Most of the traditional methods for detecting earthquake signals and picking phases are based on analyses of features in recordings of an individual earthquake and/or their differences from background noises. When seismicity is high, the seismograms are complicated, and, therefore, traditional analysis methods often fail. With the development of machine learning algorithms, earthquake signal detection and seismic phase picking can be more accurate using the features obtained from a large amount of earthquake recordings. We have developed an attention recurrent residual U-Net algorithm, and used data augmentation techniques to improve the accuracy of earthquake detection and seismic phase picking on complex seismograms that record multiple earthquakes. The use of probability functions of P and S arrivals and potential P and S arrival pairs of earthquakes can increase the computational efficiency and accuracy of backprojection for earthquake monitoring in large areas. We applied our workflow to monitor the earthquake activitymore »
The lack of adequate training data is one of the major hurdles in WiFi-based activity recognition systems. In this paper, we propose Wi-Fringe, which is a WiFi CSI-based devicefree human gesture recognition system that recognizes named gestures, i.e., activities and gestures that have a semantically meaningful name in English language, as opposed to arbitrary free-form gestures. Given a list of activities (only their names in English text), along with zero or more training examples (WiFi CSI values) per activity, Wi-Fringe is able to detect all activities at runtime. We show for the first time that by utilizing the state-of-the-art semantic representation of English words, which is learned from datasets like the Wikipedia (e.g., Google's word-to-vector ) and verb attributes learned from how a word is defined (e.g, American Heritage Dictionary), we can enhance the capability of WiFi-based named gesture recognition systems that lack adequate training examples per class. We propose a novel cross-domain knowledge transfer algorithm between radio frequency (RF) and text to lessen the burden on developers and end-users from the tedious task of data collection for all possible activities. To evaluate Wi-Fringe, we collect data from four volunteers in a multi-person apartment and an office building for amore »
We evaluate several publicly available off-the-shelf (commercial and research) automatic speech recognition (ASR) systems on dialogue agent-directed English speech from speakers with General American vs. non-American accents. Our results show that the performance of the ASR systems for non-American accents is considerably worse than for General American accents. Depending on the recognizer, the absolute difference in performance between General American accents and all non-American accents combined can vary approximately from 2% to 12%, with relative differences varying approximately between 16% and 49%. This drop in performance becomes even larger when we consider specific categories of non-American accents indicating a need for more diligent collection of and training on non-native English speaker data in order to narrow this performance gap. There are performance differences across ASR systems, and while the same general pattern holds, with more errors for non-American accents, there are some accents for which the best recognizer is different than in the overall case. We expect these results to be useful for dialogue system designers in developing more robust inclusive dialogue systems, and for ASR providers in taking into account performance requirements for different accents.
This paper describes experiments in training HMM-based text-to-speech (TTS) voices on data collected for Automatic Speech Recognition (ASR) training. We compare a number of filtering techniques designed to identify the best utterances from a noisy, multi-speaker corpus for training voices, to exclude speech containing noise and to include speech close in nature to more traditionally-collected TTS corpora. We also evaluate the use of automatic speech recognizers for intelligibility assessment in comparison with crowdsourcing methods. While the goal of this work is to develop natural-sounding and intelligible TTS voices in Low Resource Languages (LRLs) rapidly and easily, without the expense of recording data specifically for this purpose, we focus on English initially to identify the best filtering techniques and evaluation methods. We find that, when a large amount of data is available, selecting from the corpus based on criteria such as standard deviation of f0, fast speaking rate, and hypo-articulation produces the most intelligible voices.