- PAR ID:
- 10341980
- Date Published:
- Journal Name:
- ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- Page Range / eLocation ID:
- 4678 to 4682
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Fu, Qian-Jie (Ed.)Face masks are an important tool for preventing the spread of COVID-19. However, it is unclear how different types of masks affect speech recognition in different levels of background noise. To address this, we investigated the effects of four masks (a surgical mask, N95 respirator, and two cloth masks) on recognition of spoken sentences in multi-talker babble. In low levels of background noise, masks had little to no effect, with no more than a 5.5% decrease in mean accuracy compared to a no-mask condition. In high levels of noise, mean accuracy was 2.8-18.2% lower than the no-mask condition, but the surgical mask continued to show no significant difference. The results demonstrate that different types of masks generally yield similar accuracy in low levels of background noise, but differences between masks become more apparent in high levels of noise.more » « less
-
Abstract Over the past two years, face masks have been a critical tool for preventing the spread of COVID-19. While previous studies have examined the effects of masks on speech recognition, much of this work was conducted early in the pandemic. Given that human listeners are able to adapt to a wide variety of novel contexts in speech perception, an open question concerns the extent to which listeners have adapted to masked speech during the pandemic. In order to evaluate this, we replicated Toscano and Toscano (PLOS ONE 16(2):e0246842, 2021), looking at the effects of several types of face masks on speech recognition in different levels of multi-talker babble noise. We also examined the effects of listeners’ self-reported frequency of encounters with masked speech and the effects of the implementation of public mask mandates on speech recognition. Overall, we found that listeners’ performance in the current experiment (with data collected in 2021) was similar to that of listeners in Toscano and Toscano (with data collected in 2020) and that performance did not differ based on mask experience. These findings suggest that listeners may have already adapted to masked speech by the time data were collected in 2020, are unable to adapt to masked speech, require additional context to be able to adapt, or that talkers also changed their productions over time. Implications for theories of perceptual learning in speech are discussed.
-
The joint analysis of audio and video is a powerful tool that can be applied to various contexts, including action, speech, and sound recognition, audio-visual video parsing, emotion recognition in affective computing, and self-supervised training of deep learning models. Solving these problems often involves tackling core audio-visual tasks, such as audio-visual source localization, audio-visual correspondence, and audio-visual source separation, which can be combined in various ways to achieve the desired results. This paper provides a review of the literature in this area, discussing the advancements, history, and datasets of audio-visual learning methods for various application domains. It also presents an overview of the reported performances on standard datasets and suggests promising directions for future research.more » « less
-
na (Ed.)Most existing audio-text emotion recognition studies have focused on the computational modeling aspects, including strategies for fusing the modalities. An area that has received less attention is understanding the role of proper temporal synchronization between the modalities in the model performance. This study presents a transformer-based model designed with a word-chunk concept, which offers an ideal framework to explore different strategies to align text and speech. The approach creates chunks with alternative alignment strategies with different levels of dependency on the underlying lexical boundaries. A key contribution of this study is the multi-scale chunk alignment strategy, which generates random alignments to create the chunks without considering lexical boundaries. For every epoch, the approach generates a different alignment for each sentence, serving as an effective regularization method for temporal dependency. Our experimental results based on the MSP-Podcast corpus indicate that providing precise temporal alignment information to create the audio-text chunks does not improve the performance of the system. The attention mechanisms in the transformer-based approach are able to compensate for imperfect synchronization between the modalities. However, using exact lexical boundaries makes the system highly vulnerable to missing modalities. In contrast, the model trained with the proposed multi-scale chunk regularization strategy using random alignment can significantly increase its robustness against missing data and remain effective, even under a single audio-only emotion recognition task. The code is available at: https://github.com/winston-lin-wei-cheng/MultiScale-Chunk-Regularizationmore » « less
-
Speech enhancement tasks have seen significant improvements with the advance of deep learning technology, but with the cost of increased computational complexity. In this study, we propose an adaptive boosting approach to learning locality sensitive hash codes, which represent audio spectra efficiently. We use the learned hash codes for single-channel speech denoising tasks as an alternative to a complex machine learning model, particularly to address the resource-constrained environments. Our adaptive boosting algorithm learns simple logistic regressors as the weak learners. Once trained, their binary classification results transform each spectrum of test noisy speech into a bit string. Simple bitwise operations calculate Hamming distance to find the K-nearest matching frames in the dictionary of training noisy speech spectra, whose associated ideal binary masks are averaged to estimate the denoising mask for that test mixture. Our proposed learning algorithm differs from AdaBoost in the sense that the projections are trained to minimize the distances between the self-similarity matrix of the hash codes and that of the original spectra, rather than the misclassification rate. We evaluate our discriminative hash codes on the TIMIT corpus with various noise types, and show comparative performance to deep learning methods in terms of denoising performance and complexity.more » « less