NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Estimating Shapley Values of Training Utterances for Automatic Speech Recognition Models

https://doi.org/10.1109/ICASSP49357.2023.10097237

Raza Syed, Ali; Mandel, Michael I. (June 2023, IEEE International Conference on Acoustics Speech and Signal Processing)

Data Valuation in machine learning is concerned with quantifying the relative contribution of a training example to a model’s performance. Quantifying the importance of training examples is useful for identifying high and low quality data to curate training datasets and for address data quality issues. Shapley values have gained traction in machine learning for curating training data and identifying data quality issues. While computing the Shapley values of training examples is computationally prohibitive, approximation methods have been used successfully for classification models in computer vision tasks. We investigate data valuation for Automatic Speech Recognition models which perform a structured prediction task and propose a method for estimating Shapley values for these models. We show that a proxy model can be learned for the acoustic model component of an end-to-end ASR and used to estimate Shapley values for acoustic frames. We present a method for using the proxy acoustic model to estimate Shapley values for variable length utterances and demonstrate that the Shapley values provide a signal of example quality.
more » « less
Full Text Available
ImportantAug: A Data Augmentation Agent for Speech

https://doi.org/10.1109/ICASSP43922.2022.9747003

Trinh, Viet Anh; Salami Kavaki, Hassan; Mandel, Michael I (May 2022, IEEE International Conference on Acoustics, Speech and Signal Processing)

We introduce ImportantAug, a technique to augment training data for speech classification and recognition models by adding noise to unimportant regions of the speech and not to important regions. Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds while minimizing its impact on recognition performance. The effectiveness of our method is illustrated on version two of the Google Speech Commands (GSC) dataset. On the standard GSC test set, it achieves a 23.3% relative error rate reduction compared to conventional noise augmentation which applies noise to speech without regard to where it might be most effective. It also provides a 25.4% error rate reduction compared to a baseline without data augmentation. Additionally, the proposed ImportantAug outperforms the conventional noise augmentation and the baseline on two test sets with additional noise added.
more » « less
Full Text Available
Directly Comparing the Listening Strategies of Humans and Machines

https://doi.org/10.1109/TASLP.2020.3040545

Trinh, Viet Anh; Mandel, Michael (January 2021, IEEE/ACM Transactions on Audio, Speech, and Language Processing)
null (Ed.)
Full Text Available
Identifying Important Time-Frequency Locations in Continuous Speech Utterances

https://doi.org/10.21437/Interspeech.2020-2637

Kavaki, Hassan Salami; Mandel, Michael I. (January 2020, Proceedings of Interspeech)
null (Ed.)
Human listeners use specific cues to recognize speech and recent experiments have shown that certain time-frequency regions of individual utterances are more important to their correct identification than others. A model that could identify such cues or regions from clean speech would facilitate speech recognition and speech enhancement by focusing on those important regions. Thus, in this paper we present a model that can predict the regions of individual utterances that are important to an automatic speech recognition (ASR) “listener” by learning to add as much noise as possible to these utterances while still permitting the ASR to correctly identify them. This work utilizes a continuous speech recognizer to recognize multi-word utterances and builds upon our previous work that performed the same process for an isolated word recognizer. Our experimental results indicate that our model can apply noise to obscure 90.5% of the spectrogram while leaving recognition performance nearly unchanged.
more » « less
Full Text Available
Large Scale Evaluation of Importance Maps in Automatic Speech Recognition

https://doi.org/10.21437/Interspeech.2020-2883

Trinh, Viet Anh; Mandel, Michael I. (January 2020, Proceedings of Interspeech)
null (Ed.)
This paper proposes a metric that we call the structured saliency benchmark (SSBM) to evaluate importance maps computed for automatic speech recognizers on individual utterances. These maps indicate time-frequency points of the utterance that are most important for correct recognition of a target word. Our evaluation technique is not only suitable for standard classification tasks, but is also appropriate for structured prediction tasks like sequence-to-sequence models. Additionally, we use this approach to perform a comparison of the importance maps created by our previously introduced technique using “bubble noise” to identify important points through correlation with a baseline approach based on smoothed speech energy and forced alignment. Our results show that the bubble analysis approach is better at identifying important speech regions than this baseline on 100 sentences from the AMI corpus.
more » « less
Full Text Available
The Bubble Noise Technique for Speech Perception Research

https://doi.org/10.1044/2019_PERS-19-00058

Mandel, Michael I.; Grover, Vikas; Zhao, Mengxuan; Choi, Jiyoung; Shafer, Valerie L. (December 2019, Perspectives of the ASHA Special Interest Groups)

Purpose The “bubble noise” technique has recently been introduced as a method to identify the regions in time–frequency maps (i.e., spectrograms) of speech that are especially important for listeners in speech recognition. This technique identifies regions of “importance” that are specific to the speech stimulus and the listener, thus permitting these regions to be compared across different listener groups. For example, in cross-linguistic and second-language (L2) speech perception, this method identifies differences in regions of importance in accomplishing decisions of phoneme category membership. This research note describes the application of bubble noise to the study of language learning for 3 different language pairs: Hindi English bilinguals' perception of the /v/–/w/ contrast in American English, native English speakers' perception of the tense/lax contrast for Korean fricatives and affricates, and native English speakers' perception of Mandarin lexical tone. Conclusion We demonstrate that this technique provides insight on what information in the speech signal is important for native/first-language listeners compared to nonnative/L2 listeners. Furthermore, the method can be used to examine whether L2 speech perception training is effective in bringing the listener's attention to the important cues.
more » « less
Full Text Available
Bubble Cooperative Networks for Identifying Important Speech Cues

https://doi.org/10.21437/Interspeech.2018-2377

Trinh, Viet Anh; McFee, Brian; Mandel, Michael I (September 2018, Interspeech 2018)

Predicting the intelligibility of noisy recordings is difficult and most current algorithms treat all speech energy as equally important to intelligibility. Our previous work on human perception used a listening test paradigm and correlational analysis to show that some energy is more important to intelligibility than other energy. In this paper, we propose a system called the Bubble Cooperative Network (BCN), which aims to predict important areas of individual utterances directly from clean speech. Given such a prediction, noise is added to the utterance in unimportant regions and then presented to a recognizer. The BCN is trained with a loss that encourages it to add as much noise as possible while preserving recognition performance, encouraging it to identify important regions precisely and place the noise everywhere else. Empirical evaluation shows that the BCN can obscure 97.7% of the spectrogram with noise while maintaining recognition accuracy for a simple speech recognizer that compares a noisy test utterance with a clean reference utterance. The masks predicted by a single BCN on several utterances show patterns that are similar to analyses derived from human listening tests that analyze each utterance separately, while exhibiting better generalization and less context-dependence than previous approaches.
more » « less
Full Text Available

Search for: All records