skip to main content


Title: LPC Augment: an LPC-based ASR Data Augmentation Algorithm for Low and Zero-Resource Children’s Dialects
This paper proposes a novel linear prediction coding-based data augmentation method for children’s low and zero resource dialect ASR. The data augmentation procedure consists of perturbing the formant peaks of the LPC spectrum during LPC analysis and reconstruction. The method is evaluated on two novel children’s speech datasets with one containing California English from the Southern California Area and the other containing a mix of Southern American English and African American English from the Atlanta, Georgia area. We test the proposed method in training both an HMM-DNN system and an end-to-end system to show model-robustness and demonstrate that the algorithm improves ASR performance, especially for zero resource dialect children’s task, as compared to common data augmentation methods such as VTLP, Speed Perturbation, and SpecAugment.  more » « less
Award ID(s):
1734380
NSF-PAR ID:
10354608
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Page Range / eLocation ID:
8577 to 8581
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. IEEE SIGNAL PROCESSING SOCIETY (Ed.)
    This paper 1 presents a novel system which utilizes acoustic, phonological, morphosyntactic, and prosodic information for binary automatic dialect detection of African American English. We train this system utilizing adult speech data and then evaluate on both children’s and adults’ speech with unmatched training and testing scenarios. The proposed system combines novel and state-of-the-art architectures, including a multi-source transformer language model pre-trained on Twitter text data and fine-tuned on ASR transcripts as well as an LSTM acoustic model trained on self-supervised learning representations, in order to learn a comprehensive view of dialect. We show robust, explainable performance across recording conditions for different features for adult speech, but fusing multiple features is important for good results on children’s speech. 
    more » « less
  2. Children’s automatic speech recognition (ASR) is always difficult due to, in part, the data scarcity problem, especially for kindergarten-aged kids. When data are scarce, the model might overfit to the training data, and hence good starting points for training are essential. Recently, meta-learning was proposed to learn model initialization (MI) for ASR tasks of different languages. This method leads to good performance when the model is adapted to an unseen language. How-ever, MI is vulnerable to overfitting on training tasks (learner overfitting). It is also unknown whether MI generalizes to other low-resource tasks. In this paper, we validate the effectiveness of MI in children’s ASR and attempt to alleviate the problem of learner overfitting. To achieve model-agnostic meta-learning (MAML), we regard children’s speech at each age as a different task. In terms of learner overfitting, we propose a task-level augmentation method by simulating new ages using frequency warping techniques. Detailed experiments are conducted to show the impact of task augmentation on each age for kindergarten-aged speech. As a result, our approach achieves a relative word error rate (WER) improvement of 51% over the baseline system with no augmentation or initialization. 
    more » « less
  3. ISCA (Ed.)
    In this paper, we explore automatic prediction of dialect density of the African American English (AAE) dialect, where dialect density is defined as the percentage of words in an utterance that contain characteristics of the non-standard dialect. We investigate several acoustic and language modeling features, including the commonly used X-vector representation and ComParE feature set, in addition to information extracted from ASR transcripts of the audio files and prosodic information. To address issues of limited labeled data, we use a weakly supervised model to project prosodic and X-vector features into low-dimensional task-relevant representations. An XGBoost model is then used to predict the speaker's dialect density from these features and show which are most significant during inference. We evaluate the utility of these features both alone and in combination for the given task. This work, which does not rely on hand-labeled transcripts, is performed on audio segments from the CORAAL database. We show a significant correlation between our predicted and ground truth dialect density measures for AAE speech in this database and propose this work as a tool for explaining and mitigating bias in speech technology. 
    more » « less
  4. Model explanations that shed light on the model’s predictions are becoming a desired additional output of NLP models, alongside their predictions. Challenges in creating these explanations include making them trustworthy and faithful to the model’s predictions. In this work, we propose a novel framework for guiding model explanations by supervising them explicitly. To this end, our method, LEXPLAIN, uses task-related lexicons to directly supervise model explanations. This approach consistently improves the plausibility of model’s explanations without sacrificing performance on the task, as we demonstrate on sentiment analysis and toxicity detection. Our analyses show that our method also demotes spurious correlations (i.e., with respect to African American English dialect) on toxicity detection, improving fairness. 
    more » « less
  5. Abstract Accurate and (near) real-time earthquake monitoring provides the spatial and temporal behaviors of earthquakes for understanding the nature of earthquakes, and also helps in regional seismic hazard assessments and mitigations. Because of the increase in both the quality and quantity of seismic data, an automated earthquake monitoring system is needed. Most of the traditional methods for detecting earthquake signals and picking phases are based on analyses of features in recordings of an individual earthquake and/or their differences from background noises. When seismicity is high, the seismograms are complicated, and, therefore, traditional analysis methods often fail. With the development of machine learning algorithms, earthquake signal detection and seismic phase picking can be more accurate using the features obtained from a large amount of earthquake recordings. We have developed an attention recurrent residual U-Net algorithm, and used data augmentation techniques to improve the accuracy of earthquake detection and seismic phase picking on complex seismograms that record multiple earthquakes. The use of probability functions of P and S arrivals and potential P and S arrival pairs of earthquakes can increase the computational efficiency and accuracy of backprojection for earthquake monitoring in large areas. We applied our workflow to monitor the earthquake activity in southern California during the 2019 Ridgecrest sequence. The distribution of earthquakes determined by our method is consistent with that in the Southern California Earthquake Data Center (SCEDC) catalog. In addition, the number of earthquakes in our catalog is more than three times that of the SCEDC catalog. Our method identifies additional earthquakes that are close in origin times and/or locations, and are not included in the SCEDC catalog. Our algorithm avoids misidentification of seismic phases for earthquake location. In general, our algorithm can provide reliable earthquake monitoring on a large area, even during a high seismicity period. 
    more » « less