Unsupervised domain adaptation offers significant potential for cross-lingual speech emotion recognition (SER). Most relevant studies have addressed this problem as a domain mismatch without considering phonetical emotional differences across languages. Our study explores universal discrete speech units obtained with vector quantization of wavLM representations from emotional speech in English, Taiwanese Mandarin, and Russian. We estimate cluster-wise distributions of quantized wavLM frames to quantify phonetic commonalities and differences across languages, vowels, and emotions. Our findings indicate that certain emotion-specific phonemes exhibit cross-linguistic similarities. The distribution of vowels varies with emotional content. Certain vowels across languages show close distributional proximity, offering anchor points for cross-lingual domain adaptation. We also propose and validate a method to quantify phoneme distribution similarities across languages.
more »
« less
This content will become publicly available on August 17, 2026
Vector Quantized Cross-lingual Unsupervised Domain Adaptation for Speech Emotion Recognition
Building speech emotion recognition (SER) models for low-resource languages is challenging due to the scarcity of labeled speech data. This limitation mandates the development of cross-lingual unsupervised domain adaptation techniques to effectively utilize labeled data from resource-rich languages. Inspired by the TransVQA framework, we propose a method that leverages a shared quantized feature space to enable knowledge transfer between labeled and unlabeled data across languages. The approach utilizes a quantized codebook to capture shared features, while reducing the domain gap, and aligning class distributions, thereby improving classification accuracy. Additionally, an information loss (InfoLoss) mechanism mitigates critical information loss during quantization. InfoLoss achieves this goal by minimizing the loss within the simplex of posterior class label distributions. The proposed method demonstrates superior performance compared to state-of-the-art baseline approaches. Index Terms: Speech Emotion Recognition, Cross-lingual Unsupervised Domain Adaptation, Discrete Features, InfoLoss
more »
« less
- Award ID(s):
- 2016719
- PAR ID:
- 10655474
- Publisher / Repository:
- ISCA
- Date Published:
- Page Range / eLocation ID:
- 126 to 130
- Format(s):
- Medium: X
- Location:
- Rotterdam, The Netherlands
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Modeling cross-lingual speech emotion recognition (SER) has become more prevalent because of its diverse applications. Existing studies have mostly focused on technical approaches that adapt the feature, domain, or label across languages, without considering in detail the similarities be- tween the languages. This study focuses on domain adaptation in cross-lingual scenarios using phonetic constraints. This work is framed in a twofold manner. First, we analyze emotion-specific phonetic commonality across languages by identifying common vowels that are useful for SER modeling. Second, we leverage these common vowels as an anchoring mechanism to facilitate cross-lingual SER. We consider American English and Taiwanese Mandarin as a case study to demonstrate the potential of our approach. This work uses two in-the-wild natural emotional speech corpora: MSP-Podcast (American English), and BIIC-Podcast (Taiwanese Mandarin). The proposed unsupervised cross-lingual SER model using these phonetical anchors outperforms the baselines with a 58.64% of unweighted average recall (UAR).more » « less
-
The prevalence of cross-lingual speech emotion recognition (SER) modeling has significantly increased due to its wide range of applications. Previous studies have primarily focused on technical strategies to adapt features, domains, and labels across languages, often overlooking the underlying universalities between the languages. In this study, we address the language adaptation challenge in cross-lingual scenarios by incorporating vowel-phonetic constraints. Our approach is structured in two main parts. Firstly, we investigate the vowel-phonetic commonalities associated with specific emotions across languages, particularly focusing on common vowels that prove to be valuable for SER modeling. Secondly, we utilize these identified common vowels as anchors to facilitate cross-lingual SER. To demonstrate the effectiveness of our approach, we conduct case studies using American English, Taiwanese Mandarin, and Russian using three naturalistic emotional speech corpora: the MSP-Podcast, BIIC-Podcast, and Dusha corpora. The proposed unsupervised cross-lingual SER model, leveraging this phonetic information, surpasses the performance of the baselines. This research provides insights into the importance of considering phonetic similarities across languages for effective language adaptation in cross-lingual SER scenarios.more » « less
-
Modern NLP applications have enjoyed a great boost utilizing neural networks models. Such deep neural models, however, are not applicable to most human languages due to the lack of annotated training data for various NLP tasks. Cross-lingual transfer learning (CLTL) is a viable method for building NLP models for a low-resource target language by leveraging labeled data from other (source) languages. In this work, we focus on the multilingual transfer setting where training data in multiple source languages is leveraged to further boost target language performance. Unlike most existing methods that rely only on language-invariant features for CLTL, our approach coherently utilizes both language invariant and language-specific features at instance level. Our model leverages adversarial networks to learn language-invariant features, and mixture-of-experts models to dynamically exploit the similarity between the target language and each individual source language1. This enables our model to learn effectively what to share between various languages in the multilingual setup. Moreover, when coupled with unsupervised multilingual embeddings, our model can operate in a zero-resource setting where neither target language training data nor cross-lingual resources are available. Our model achieves significant performance gains over prior art, as shown in an extensive set of experiments over multiple text classification and sequence tagging.more » « less
-
Cross-lingual speech emotion recognition (SER) is important for a wide range of everyday applications. While recent SER research relies heavily on large pretrained models for emotion training, existing studies often concentrate solely on the final transformer layer of these models. However, given the task-specific nature and hierarchical architecture of these models, each transformer layer encapsulates different levels of information. Leveraging this hierarchical structure, our study focuses on the information embedded across different layers. Through an examination of layer feature similarity across different languages, we propose a novel strategy called a layer-anchoring mechanism to facilitate emotion transfer in cross-lingual SER tasks. Our approach is evaluated using two distinct language affective corpora (MSP-Podcast and BIIC-Podcast), achieving a best UAR performance of 60.21% on the BIIC-podcast corpus. The analysis uncovers interesting insights into the behavior of popular pretrained models.more » « less
An official website of the United States government
