NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

What do MLLMs hear? Examining reasoning with text and audio encoder components in Multimodal Large Language Models.

Coban, Enis Berk; Mandel, Michael I; Devaney, Johanna (October 2024, NeurIPS Audio Imagination Workshop)

Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, notably in connecting ideas and adhering to logical rules to solve problems. These models have evolved to accommodate various data modalities, including sound and images, known as multimodal LLMs (MLLMs), which are capable of describing images or sound recordings. Previous work has demonstrated that when the LLM component in MLLMs is frozen, the audio or visual encoder serves to caption the sound or image input facilitating text-based reasoning with the LLM component. We are interested in using the LLM's reasoning capabilities in order to facilitate classification. In this paper, we demonstrate through a captioning/classification experiment that an audio MLLM cannot fully leverage its LLM's text-based reasoning when generating audio captions. We also consider how this may be due to MLLMs separately representing auditory and textual information such that it severs the reasoning pathway from the LLM to the audio encoder.
more » « less
Full Text Available
Towards High Resolution Weather Monitoring With Sound Data

https://doi.org/10.1109/ICASSP48485.2024.10445999

Çoban, Enis Berk; Perra, Megan; Mandel, Michael I (April 2024, IEEE)

Full Text Available
Estimating Shapley Values of Training Utterances for Automatic Speech Recognition Models

https://doi.org/10.1109/ICASSP49357.2023.10097237

Raza Syed, Ali; Mandel, Michael I. (June 2023, IEEE International Conference on Acoustics Speech and Signal Processing)

Data Valuation in machine learning is concerned with quantifying the relative contribution of a training example to a model’s performance. Quantifying the importance of training examples is useful for identifying high and low quality data to curate training datasets and for address data quality issues. Shapley values have gained traction in machine learning for curating training data and identifying data quality issues. While computing the Shapley values of training examples is computationally prohibitive, approximation methods have been used successfully for classification models in computer vision tasks. We investigate data valuation for Automatic Speech Recognition models which perform a structured prediction task and propose a method for estimating Shapley values for these models. We show that a proxy model can be learned for the acoustic model component of an end-to-end ASR and used to estimate Shapley values for acoustic frames. We present a method for using the proxy acoustic model to estimate Shapley values for variable length utterances and demonstrate that the Shapley values provide a signal of example quality.
more » « less
Full Text Available
ImportantAug: A Data Augmentation Agent for Speech

https://doi.org/10.1109/ICASSP43922.2022.9747003

Trinh, Viet Anh; Salami Kavaki, Hassan; Mandel, Michael I (May 2022, IEEE International Conference on Acoustics, Speech and Signal Processing)

We introduce ImportantAug, a technique to augment training data for speech classification and recognition models by adding noise to unimportant regions of the speech and not to important regions. Importance is predicted for each utterance by a data augmentation agent that is trained to maximize the amount of noise it adds while minimizing its impact on recognition performance. The effectiveness of our method is illustrated on version two of the Google Speech Commands (GSC) dataset. On the standard GSC test set, it achieves a 23.3% relative error rate reduction compared to conventional noise augmentation which applies noise to speech without regard to where it might be most effective. It also provides a 25.4% error rate reduction compared to a baseline without data augmentation. Additionally, the proposed ImportantAug outperforms the conventional noise augmentation and the baseline on two test sets with additional noise added.
more » « less
Full Text Available
EDANSA-2019: THE ECOACOUSTIC DATASET FROM ARCTIC NORTH SLOPE ALASKA

Çoban, Enis Berk; Perra, Megan; Pir, Dara; Mandel, Michael I. (January 2022, Workshop on the Detection and Classification of Acoustic Scenes and Events)

The arctic is warming at three times the rate of the global average, affecting the habitat and lifecycles of migratory species that reproduce there, like birds and caribou. Ecoacoustic monitoring can help efficiently track changes in animal phenology and behavior over large areas so that the impacts of climate change on these species can be better understood and potentially mitigated. We introduce here the Ecoacoustic Dataset from Arctic North Slope Alaska (EDANSA-2019), a dataset collected by a network of 100 autonomous recording units covering an area of 9000 square miles over the course of the 2019 summer season on the North Slope of Alaska and neighboring regions. We labeled over 27 hours of this dataset according to 28 tags with enough instances of 9 important environmental classes to train baseline convolutional recognizers. We are releasing this dataset and the corresponding baseline to the community to accelerate the recognition of these sounds and facilitate automated analyses of large-scale ecoacoustic databases.
more » « less
Full Text Available
Towards Large Scale Ecoacoustic Monitoring with Small Amounts of Labeled Data

https://doi.org/10.1109/WASPAA52581.2021.9632743

Coban, Enis Berk; Syed, Ali Raza; Pir, Dara; Mandel, Michael I (October 2021, IEEE Workshop on Applications of Signal Processing to Audio and Acoustics)

Arctic boreal forests are warming at a rate 2–3 times faster than the global average. It is important to understand the effects of this warming on the activities of animals that migrate to these environments annually to reproduce. Acoustic sensors can monitor a wide area relatively cheaply, producing large amounts of data that need to be automatically analyzed. In such scenarios, only a small proportion of the recorded data can be labeled by hand, thus we explore two methods for utilizing labels more efficiently: self-supervised learning using wav2vec 2.0 and data valuation using k-nearest neighbors approximations to compute Shapley values. We confirm that data augmentation and global temporal pooling improve performance by more than 30%, demonstrate for the first time the utility of Shapley data valuation for audio classification, and find that our wav2vec 2.0 model trained from scratch does not improve performance.
more » « less
Full Text Available
Identifying Important Time-Frequency Locations in Continuous Speech Utterances

https://doi.org/10.21437/Interspeech.2020-2637

Kavaki, Hassan Salami; Mandel, Michael I. (January 2020, Proceedings of Interspeech)
null (Ed.)
Human listeners use specific cues to recognize speech and recent experiments have shown that certain time-frequency regions of individual utterances are more important to their correct identification than others. A model that could identify such cues or regions from clean speech would facilitate speech recognition and speech enhancement by focusing on those important regions. Thus, in this paper we present a model that can predict the regions of individual utterances that are important to an automatic speech recognition (ASR) “listener” by learning to add as much noise as possible to these utterances while still permitting the ASR to correctly identify them. This work utilizes a continuous speech recognizer to recognize multi-word utterances and builds upon our previous work that performed the same process for an isolated word recognizer. Our experimental results indicate that our model can apply noise to obscure 90.5% of the spectrogram while leaving recognition performance nearly unchanged.
more » « less
Full Text Available
Large Scale Evaluation of Importance Maps in Automatic Speech Recognition

https://doi.org/10.21437/Interspeech.2020-2883

Trinh, Viet Anh; Mandel, Michael I. (January 2020, Proceedings of Interspeech)
null (Ed.)
This paper proposes a metric that we call the structured saliency benchmark (SSBM) to evaluate importance maps computed for automatic speech recognizers on individual utterances. These maps indicate time-frequency points of the utterance that are most important for correct recognition of a target word. Our evaluation technique is not only suitable for standard classification tasks, but is also appropriate for structured prediction tasks like sequence-to-sequence models. Additionally, we use this approach to perform a comparison of the importance maps created by our previously introduced technique using “bubble noise” to identify important points through correlation with a baseline approach based on smoothed speech energy and forced alignment. Our results show that the bubble analysis approach is better at identifying important speech regions than this baseline on 100 sentences from the AMI corpus.
more » « less
Full Text Available
Transfer Learning from Youtube Soundtracks to Tag Arctic Ecoacoustic Recordings

https://doi.org/10.1109/ICASSP40776.2020.9053338

Coban, Enis Berk; Pir, Dara; So, Richard; Mandel, Michael I (May 2020, IEEE Conference on Audio Speech and Signal Processing)

Sound provides a valuable tool for long-term monitoring of sensitive animal habitats at a spatial scale larger than camera traps or field observations, while also providing more details than satellite imagery. Currently, the ability to collect such recordings outstrips the ability to analyze them manually, necessitating the development of automatic analysis methods. While several datasets and models of large corpora of video soundtracks have recently been released, it is not clear to what extent these models will generalize to environmental recordings and the scientific questions of interest in analyzing them. This paper investigates this generalization in several ways and finds that models themselves display limited performance, however, their intermediate representations can be used to train successful models on small sets of labeled data.
more » « less
Full Text Available
The Bubble Noise Technique for Speech Perception Research

https://doi.org/10.1044/2019_PERS-19-00058

Mandel, Michael I.; Grover, Vikas; Zhao, Mengxuan; Choi, Jiyoung; Shafer, Valerie L. (December 2019, Perspectives of the ASHA Special Interest Groups)

Purpose The “bubble noise” technique has recently been introduced as a method to identify the regions in time–frequency maps (i.e., spectrograms) of speech that are especially important for listeners in speech recognition. This technique identifies regions of “importance” that are specific to the speech stimulus and the listener, thus permitting these regions to be compared across different listener groups. For example, in cross-linguistic and second-language (L2) speech perception, this method identifies differences in regions of importance in accomplishing decisions of phoneme category membership. This research note describes the application of bubble noise to the study of language learning for 3 different language pairs: Hindi English bilinguals' perception of the /v/–/w/ contrast in American English, native English speakers' perception of the tense/lax contrast for Korean fricatives and affricates, and native English speakers' perception of Mandarin lexical tone. Conclusion We demonstrate that this technique provides insight on what information in the speech signal is important for native/first-language listeners compared to nonnative/L2 listeners. Furthermore, the method can be used to examine whether L2 speech perception training is effective in bringing the listener's attention to the important cues.
more » « less
Full Text Available

« Prev Next »

Search for: All records