Previous studies on speech emotion recognition (SER) with categorical emotions have often formulated the task as a single-label classification problem, where the emotions are considered orthogonal to each other. However, previous studies have indicated that emotions can co-occur, especially for more ambiguous emotional sentences (e.g., a mixture of happiness and sur- prise). Some studies have regarded SER problems as a multi-label task, predicting multiple emotional classes. However, this formulation does not leverage the relation between emotions during training, since emotions are assumed to be independent. This study explores the idea that emotional classes are not necessarily independent and its implications on training SER models. In particular, we calculate the frequency of co-occurring emotions from perceptual evaluations in the train set to generate a matrix with class-dependent penalties, punishing more mistakes between distant emotional classes. We integrate the penalization matrix into three existing label-learning approaches (hard-label, multi-label, and distribution-label learn- ing) using the proposed modified loss. We train SER models using the penalty loss and commonly used cost functions for SER tasks. The evaluation of our proposed penalization matrix on the MSP-Podcast corpus shows important relative improvements in macro F1-score for hard-label learning (17.12%), multi-label learning (12.79%), and distribution-label learning (25.8%).
more »
« less
Dynamic Speech Emotion Recognition Using A Conditional Neural Process
The problem of predicting emotional attributes from speech has often focused on predicting a single value from a sentence or short speaking turn. These methods often ignore that natural emotions are both dynamic and dependent on context. To model the dynamic nature of emotions, we can treat the prediction of emotion from speech as a time-series problem. We refer to the problem of predicting these emotional traces as dynamic speech emotion recognition. Previous studies in this area have used models that treat all emotional traces as coming from the same underlying distribution. Since emotions are dependent on contextual information, these methods might obscure the context of an emotional interaction. This paper uses a neural process model with a segment-level speech emotion recognition (SER) model for this problem. This type of model leverages information from the time-series and predictions from the SER model to learn a prior that defines a distribution over emotional traces. Our proposed model performs 21% better than a bidirectional long short-term memory (BiLSTM) baseline when predicting emotional traces for valence.
more »
« less
- Award ID(s):
- 2016719
- PAR ID:
- 10532847
- Editor(s):
- na
- Publisher / Repository:
- IEEE
- Date Published:
- ISBN:
- 979-8-3503-4485-1
- Page Range / eLocation ID:
- 12036 to 12040
- Subject(s) / Keyword(s):
- Speech Emotion Recognition, Dynamic Speech Emotion Recognition, Time-Continuous Emotional Traces.
- Format(s):
- Medium: X
- Location:
- Seoul, Korea, Republic of
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The advancement of Speech Emotion Recognition (SER) is significantly dependent on the quality of emotional speech corpora used for model training. Researchers in the field of SER have developed various corpora by adjusting design parameters to enhance the reliability of the training source. For this study, we focus on exploring communication modes of collection, specifically analyzing spontaneous emotional speech patterns gathered during conversation or monologue. While conversations are acknowledged as effective for eliciting authentic emotional expressions, systematic analyses are necessary to confirm their reliability as a better source of emotional speech data. We investigate this research question from perceptual differences and acoustic variability present in both emotional speeches. Our analyses on multi-lingual corpora show that, first, raters exhibit higher consistency for conversation recordings when evaluating categorical emotions, and second, perceptions and acoustic patterns observed in conversational samples align more closely with expected trends discussed in relevant emotion literature. We further examine the impact of these differences on SER modeling, which shows that we can train a more robust and stable SER model by using conversation data. This work provides comprehensive evidence suggesting that conversation may offer a better source compared to monologue for developing an SER model.more » « less
-
Unsupervised domain adaptation offers significant potential for cross-lingual speech emotion recognition (SER). Most relevant studies have addressed this problem as a domain mismatch without considering phonetical emotional differences across languages. Our study explores universal discrete speech units obtained with vector quantization of wavLM representations from emotional speech in English, Taiwanese Mandarin, and Russian. We estimate cluster-wise distributions of quantized wavLM frames to quantify phonetic commonalities and differences across languages, vowels, and emotions. Our findings indicate that certain emotion-specific phonemes exhibit cross-linguistic similarities. The distribution of vowels varies with emotional content. Certain vowels across languages show close distributional proximity, offering anchor points for cross-lingual domain adaptation. We also propose and validate a method to quantify phoneme distribution similarities across languages.more » « less
-
Speech Emotion Recognition (SER) faces a distinct challenge compared to other speech-related tasks because the annotations will show the subjective emotional perceptions of different annotators. Previous SER studies often view the subjectivity of emotion perception as noise by using the majority rule or plurality rule to obtain the consensus labels. However, these standard approaches overlook the valuable information of labels that do not agree with the consensus and make it easier for the test set. Emotion perception can have co-occurring emotions in realistic conditions, and it is unnecessary to regard the disagreement between raters as noise. To bridge the SER into a multi-label task, we introduced an “all-inclusive rule,” which considers all available data, ratings, and distributional labels as multi-label targets and a complete test set. We demonstrated that models trained with multi-label targets generated by the proposed AR outperform conventional single-label methods across incomplete and complete test sets.more » « less
-
The emotional content of several databases are annotated with continuous-time (CT) annotations, providing traces with frame-by-frame scores describing the instantaneous value of an emotional attribute. However, having a single score describing the global emotion of a short segment is more convenient for several emotion recognition formulations. A common approach is to derive sentence-level (SL) labels from CT annotations by aggregating the values of the emotional traces across time and annotators. How similar are these aggregated SL labels from labels originally collected at the sentence level? The release of the MSP-Podcast (SL annotations) and MSP-Conversation (CT annotations) corpora provides the resources to explore the validity of aggregating SL labels from CT annotations. There are 2,884 speech segments that belong to both corpora. Using this set, this study (1) compares both types of annotations using statistical metrics, (2) evaluates their inter-evaluator agreements, and (3) explores the effect of these SL labels on speech emotion recognition (SER) tasks. The analysis reveals benefits of using SL labels derived from CT annotations in the estimation of valence. This analysis also provides insights on how the two types of labels differ and how that could affect a model.more » « less
An official website of the United States government

