Modeling cross-lingual speech emotion recognition (SER) has become more prevalent because of its diverse applications. Existing studies have mostly focused on technical approaches that adapt the feature, domain, or label across languages, without considering in detail the similarities be- tween the languages. This study focuses on domain adaptation in cross-lingual scenarios using phonetic constraints. This work is framed in a twofold manner. First, we analyze emotion-specific phonetic commonality across languages by identifying common vowels that are useful for SER modeling. Second, we leverage these common vowels as an anchoring mechanism to facilitate cross-lingual SER. We consider American English and Taiwanese Mandarin as a case study to demonstrate the potential of our approach. This work uses two in-the-wild natural emotional speech corpora: MSP-Podcast (American English), and BIIC-Podcast (Taiwanese Mandarin). The proposed unsupervised cross-lingual SER model using these phonetical anchors outperforms the baselines with a 58.64% of unweighted average recall (UAR).
more »
« less
Personalized Adaptation with Pre-trained Speech Encoders for Continuous Emotion Recognition
There are individual differences in expressive behaviors driven by cultural norms and personality. This between-person variation can result in reduced emotion recognition performance. Therefore, personalization is an important step in improving the generalization and robustness of speech emotion recognition. In this paper, to achieve unsupervised personalized emotion recognition, we first pre-train an encoder with learnable speaker embeddings in a self-supervised manner to learn robust speech representations conditioned on speakers. Second, we propose an unsupervised method to compensate for the label distribution shifts by finding similar speakers and leveraging their label distributions from the training set. Extensive experimental results on the MSP-Podcast corpus indicate that our method consistently outperforms strong personalization baselines and achieves state-of-the-art performance for valence estimation.
more »
« less
- Award ID(s):
- 2211550
- PAR ID:
- 10474297
- Publisher / Repository:
- ISCA
- Date Published:
- Journal Name:
- Proc. INTERSPEECH 2023
- Page Range / eLocation ID:
- 636-640
- Format(s):
- Medium: X
- Location:
- Dublin, Ireland
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The uncertainty in modeling emotions makes speech emotion recognition (SER) systems less reliable. An intuitive way to increase trust in SER is to reject predictions with low confidence. This approach assumes that an SER system is well calibrated, where highly confident predictions are often right and low confident predictions are often wrong. Hence, it is desirable to calibrate the confidence of SER classifiers. We evaluate the reliability of SER systems by exploring the relationship between confidence and accuracy, using the expected calibration error (ECE) metric. We develop a multi-label variant of the post-hoc temperature scaling (TS) method to calibrate SER systems, while preserving their accuracy. The best method combines an emotion co-occurrence weight penalty function, a class-balanced objective function, and the proposed multi-label TS calibration method. The experiments show the effectiveness of our developed multi-label calibration method in terms of ac- curacy and ECE.more » « less
-
Individual variability of expressive behaviors is a major challenge for emotion recognition systems. Personalized emotion recognition strives to adapt machine learning models to individual behaviors, thereby enhancing emotion recognition performance and overcoming the limitations of generalized emotion recognition systems. However, existing datasets for audiovisual emotion recognition either have a very low number of data points per speaker or include a limited number of speakers. The scarcity of data significantly limits the development and assessment of personalized models, hindering their ability to effectively learn and adapt to individual expressive styles. This paper introduces EmoCeleb: a large-scale, weakly labeled emotion dataset generated via cross-modal labeling. EmoCeleb comprises over 150 hours of audiovisual content from approximately 1,500 speakers, with a median of 50 utterances per speaker. This rich dataset provides a rich resource for developing and benchmarking personalized emotion recognition methods, including those requiring substantial data per individual, such as set learning approaches. We also propose SetPeER: a novel personalized emotion recognition architecture employing set learning. SetPeER effectively captures individual expressive styles by learning representative speaker features from limited data, achieving strong performance with as few as eight utterances per speaker. By leveraging set learning, SetPeER overcomes the limitations of previous approaches that struggle to learn effectively from limited data per individual. Through extensive experiments on EmoCeleb and established benchmarks, i.e, MSP-Podcast and MSP-Improv, we demonstrate the effectiveness of our dataset and the superior performance of SetPeER compared to existing methods for emotion recognition. Our work paves the way for more robust and accurate personalized emotion recognition systems.more » « less
-
In realistic speech enhancement settings for end-user devices, we often encounter only a few speakers and noise types that tend to reoccur in the specific acoustic environment. We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity. Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker, thus fulfilling the requirement for zero-shot learning. To complement the lack of clean speech, we employ the knowledge distillation framework: we distill the more advanced denoising results from an overly large teacher model, and use them as the pseudo target to train the small student model. This zero-shot learning procedure circumvents the process of collecting users' clean speech, a process that users are reluctant to comply due to privacy concerns and technical difficulty of recording clean voice. Experiments on various test-time conditions show that the proposed personalization method can significantly improve the compact models' performance during the test time. Furthermore, since the personalized models outperform larger non-personalized baseline models, we claim that personalization achieves model compression with no loss of denoising performance. As expected, the student models underperform the state-of-the-art teacher models.more » « less
-
null (Ed.)Speech emotion recognition (SER) plays an important role in multiple fields such as healthcare, human-computer interaction (HCI), and security and defense. Emotional labels are often annotated at the sentence-level (i.e., one label per sentence), resulting in a sequence-to-one recognition problem. Traditionally, studies have relied on statistical descriptions, which are com- puted over time from low level descriptors (LLDs), creating a fixed dimension sentence-level feature representation regardless of the duration of the sentence. However sentence-level features lack temporal information, which limits the performance of SER systems. Recently, new deep learning architectures have been proposed to model temporal data. An important question is how to extract emotion-relevant features with temporal infor- mation. This study proposes a novel data processing approach that extracts a fixed number of small chunks over sentences of different durations by changing the overlap between these chunks. The approach is flexible, providing an ideal frame- work to combine gated network or attention mechanisms with long short-term memory (LSTM) networks. Our experimental results based on the MSP-Podcast dataset demonstrate that the proposed method not only significantly improves recognition accuracy over alternative temporal-based models relying on LSTM, but also leads to computational efficiency.more » « less
An official website of the United States government

