NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

The Interspeech 2025 Challenge on Speech Emotion Recognition in Naturalistic Conditions

https://doi.org/10.21437/Interspeech.2025-1972

Naini, Abinay Reddy; Goncalves, Lucas; Salman, Ali N; Mote, Pravin; Ulgen, Ismail R; Thebaud, Thomas; Velazquez, Laureano Moro; Garcia, Leibny Paola; Dehak, Najim; Sisman, Berrak; et al (August 2025, ISCA)

The Interspeech 2025 speech emotion recognition in natural istic conditions challenge builds on previous efforts to advance speech emotion recognition (SER) in real-world scenarios. The focus is on recognizing emotions from spontaneous speech, moving beyond controlled datasets. It provides a framework for speaker-independent training, development, and evaluation, with annotations for both categorical and dimensional tasks. The challenge attracted 93 research teams, whose models significantly improved state-of-the-art results over competitive baselines. This paper summarizes the challenge, focusing on the key outcomes. We analyze top-performing methods, emerging trends, and innovative directions. We highlight the effectiveness of combining foundational models based on audio and text to achieve robust SER systems. The competition website, with leaderboards, baseline code, and instructions, is available at: https://lab-msp.com/MSP-Podcast_Competition/IS2025/.
more » « less
Free, publicly-accessible full text available August 17, 2026
Slowness Regularized Contrastive Predictive Coding for Acoustic Unit Discovery

https://doi.org/10.1109/TASLP.2024.3350888

Bhati, Saurabhchand; Villalba, Jesús; Żelasko, Piotr; Moro-Velazquez, Laureano; Dehak, Najim (January 2024, IEEE/ACM Transactions on Audio, Speech, and Language Processing)

Full Text Available
Model-Based Fairness Metric for Speaker Verification

https://doi.org/10.1109/ASRU57964.2023.10389804

Jahan, Maliha; Moro-Velazquez, Laureano; Thebaud, Thomas; Dehak, Najim; Villalba, Jesús (December 2023, 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU))

Ensuring that technological advancements benefit all groups of people equally is crucial. The first step towards fairness is identifying existing inequalities. The naive comparison of group error rates may lead to wrong conclusions. We introduce a new method to determine whether a speaker verification system is fair toward several population subgroups. We propose to model miss and false alarm probabilities as a function of multiple factors, including the population group effects, e.g., male and female, and a series of confounding variables, e.g., speaker effects, language, nationality, etc. This model can estimate error rates related to a group effect without the influence of confounding effects. We experiment with a synthetic dataset where we control group and confounding effects. Our metric achieves significantly lower false positive and false negative rates w.r.t. baseline. We also experiment with VoxCeleb and NIST SRE21 datasets on different ASV systems and present our conclusions.
more » « less
Full Text Available
Discovering phonetic inventories with crosslingual automatic speech recognition

https://doi.org/10.1016/j.csl.2022.101358

Żelasko, Piotr; Feng, Siyuan; Moro Velázquez, Laureano; Abavisani, Ali; Bhati, Saurabhchand; Scharenborg, Odette; Hasegawa-Johnson, Mark; Dehak, Najim (July 2022, Computer Speech & Language)

Full Text Available
Unsupervised Speech Segmentation and Variable Rate Representation Learning Using Segmental Contrastive Predictive Coding

https://doi.org/10.1109/TASLP.2022.3180684

Bhati, Saurabhchand; Villalba, Jesus; Zelasko, Piotr; Moro-Velazquez, Laureano; Dehak, Najim (January 2022, IEEE/ACM Transactions on Audio, Speech, and Language Processing)

Full Text Available
Chunking Defense for Adversarial Attacks on ASR

https://doi.org/10.21437/Interspeech.2022-11096

Shao, Yiwen; Villalba, Jesus; Joshi, Sonal; Kataria, Saurabh; Khudanpur, Sanjeev; Dehak, Najim (January 2022, Proc. Interspeech 2022)

Full Text Available
Defense against Adversarial Attacks on Hybrid Speech Recognition System using Adversarial Fine-tuning with Denoiser

https://doi.org/10.21437/Interspeech.2022-10977

Joshi, Sonal; Kataria, Saurabh; Shao, Yiwen; Żelasko, Piotr; Villalba, Jesús; Khudanpur, Sanjeev; Dehak, Najim (January 2022, Proc. Interspeech 2022)

Full Text Available
The promise of AI and technology to improve quality of life and care for older adults

https://doi.org/10.1038/s43587-023-00430-0

Abadir, Peter M.; Chellappa, Rama; Choudhry, Niteesh; Demiris, George; Ganesan, Deepak; Karlawish, Jason; Li, Rose M.; Moore, Jason H.; Walston, Jeremy D.; Marlin, Benjamin; et al (June 2023, Nature Aging)

Full Text Available
Align or attend? Toward More Efficient and Accurate Spoken Word Discovery Using Speech-to-Image Retrieval

https://doi.org/10.1109/ICASSP39728.2021.9414418

Wang, Liming; Wang, Xinsheng; Hasegawa-Johnson, Mark; Scharenborg, Odette; Dehak, Najim (June 2021, ICASSP)
null (Ed.)
Multimodal word discovery (MWD) is often treated as a byproduct of the speech-to-image retrieval problem. However, our theoretical analysis shows that some kind of alignment/attention mechanism is crucial for a MWD system to learn meaningful word-level representation. We verify our theory by conducting retrieval and word discovery experiments on MSCOCO and Flickr8k, and empirically demonstrate that both neural MT with self-attention and statistical MT achieve word discovery scores that are superior to those of a state-of-the-art neural retrieval system, outperforming it by 2% and5% alignment F1 scores respectively.
more » « less
Full Text Available
How Phonotactics Affect Multilingual and Zero-Shot ASR Performance

https://doi.org/10.1109/ICASSP39728.2021.9414478

Feng, Siyuan; Zelasko, Piotr; Moro-Velazquez, Laureano; Abavisani, Ali; Hasegawa-Johnson, Mark; Scharenborg, Odette; Dehak, Najim (June 2021, ICASSP)
null (Ed.)
The idea of combining multiple languages’ recordings to train a single automatic speech recognition (ASR) model brings the promise of the emergence of universal speech representation. Recently, a Transformer encoder-decoder model has been shown to leverage multilingual data well in IPA transcriptions of languages presented during training. However, the representations it learned were not successful in zero-shot transfer to unseen languages. Because that model lacks an explicit factorization of the acoustic model (AM) and language model (LM), it is unclear to what degree the performance suffered from differences in pronunciation or the mismatch in phonotactics. To gain more insight into the factors limiting zero-shot ASR transfer, we replace the encoder-decoder with a hybrid ASR system consisting of a separate AM and LM. Then, we perform an extensive evaluation of monolingual, multilingual, and crosslingual (zero-shot) acoustic and language models on a set of 13 phonetically diverse languages. We show that the gain from modeling crosslingual phonotactics is limited, and imposing a too strong model can hurt the zero-shot transfer. Furthermore, we find that a multilingual LM hurts a multilingual ASR system’s performance, and retaining only the target language’s phonotactic data in LM training is preferable.
more » « less
Full Text Available

« Prev Next »

Search for: All records