NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Dynamic Speech Emotion Recognition Using A Conditional Neural Process

https://doi.org/10.1109/ICASSP48485.2024.10447805

Martinez-Lucas, Luz; Busso, Carlos (April 2024, IEEE)
na (Ed.)
The problem of predicting emotional attributes from speech has often focused on predicting a single value from a sentence or short speaking turn. These methods often ignore that natural emotions are both dynamic and dependent on context. To model the dynamic nature of emotions, we can treat the prediction of emotion from speech as a time-series problem. We refer to the problem of predicting these emotional traces as dynamic speech emotion recognition. Previous studies in this area have used models that treat all emotional traces as coming from the same underlying distribution. Since emotions are dependent on contextual information, these methods might obscure the context of an emotional interaction. This paper uses a neural process model with a segment-level speech emotion recognition (SER) model for this problem. This type of model leverages information from the time-series and predictions from the SER model to learn a prior that defines a distribution over emotional traces. Our proposed model performs 21% better than a bidirectional long short-term memory (BiLSTM) baseline when predicting emotional traces for valence.
more » « less
Full Text Available
Deep temporal clustering features for speech emotion recognition

https://doi.org/10.1016/j.specom.2023.103027

Lin, Wei-Cheng; Busso, Carlos (February 2024, Speech Communication)
na (Ed.)
Deep clustering is a popular unsupervised technique for feature representation learning. We recently proposed the chunk-based DeepEmoCluster framework for speech emotion recognition (SER) to adopt the concept of deep clustering as a novel semi-supervised learning (SSL) framework, which achieved improved recognition performances over conventional reconstruction-based approaches. However, the vanilla DeepEmoCluster lacks critical sentence- level temporal information that is useful for SER tasks. This study builds upon the DeepEmoCluster framework, creating a powerful SSL approach that leverages temporal information within a sentence. We propose two sentence-level temporal modeling alternatives using either the temporal-net or the triplet loss function, resulting in a novel temporal-enhanced DeepEmoCluster framework to capture essential temporal information. The key contribution to achieving this goal is the proposed sentence-level uniform sampling strategy, which preserves the original temporal order of the data for the clustering process. An extra network module (e.g., gated recurrent unit) is utilized for the temporal-net option to encode temporal information across the data chunks. Alternatively, we can impose additional temporal constraints by using the triplet loss function while training the DeepEmoCluster framework, which does not increase model complexity. Our experimental results based on the MSP-Podcast corpus demonstrate that the proposed temporal-enhanced framework significantly outperforms the vanilla DeepEmoCluster framework and other existing SSL approaches in regression tasks for the emotional attributes arousal, dominance, and valence. The improvements are observed in fully-supervised learning or SSL implementations. Further analyses validate the effectiveness of the proposed temporal modeling, showing (1) high temporal consistency in the cluster assignment, and (2) well-separated emotional patterns in the generated clusters.
more » « less
Full Text Available
Multimodal attention for lip synthesis using conditional generative adversarial networks

https://doi.org/10.1016/j.specom.2023.102959

Vidal, Andrea; Busso, Carlos (September 2023, Speech Communication)

Full Text Available
Enhancing Resilience to Missing Data in Audio-Text Emotion Recognition with Multi-Scale Chunk Regularization

https://doi.org/10.1145/3577190.3614110

Lin, Wei-Cheng; Goncalves, Lucas; Busso, Carlos (October 2023, ACM)
na (Ed.)
Most existing audio-text emotion recognition studies have focused on the computational modeling aspects, including strategies for fusing the modalities. An area that has received less attention is understanding the role of proper temporal synchronization between the modalities in the model performance. This study presents a transformer-based model designed with a word-chunk concept, which offers an ideal framework to explore different strategies to align text and speech. The approach creates chunks with alternative alignment strategies with different levels of dependency on the underlying lexical boundaries. A key contribution of this study is the multi-scale chunk alignment strategy, which generates random alignments to create the chunks without considering lexical boundaries. For every epoch, the approach generates a different alignment for each sentence, serving as an effective regularization method for temporal dependency. Our experimental results based on the MSP-Podcast corpus indicate that providing precise temporal alignment information to create the audio-text chunks does not improve the performance of the system. The attention mechanisms in the transformer-based approach are able to compensate for imperfect synchronization between the modalities. However, using exact lexical boundaries makes the system highly vulnerable to missing modalities. In contrast, the model trained with the proposed multi-scale chunk regularization strategy using random alignment can significantly increase its robustness against missing data and remain effective, even under a single audio-only emotion recognition task. The code is available at: https://github.com/winston-lin-wei-cheng/MultiScale-Chunk-Regularization
more » « less
Full Text Available
Preference Learning Labels by Anchoring on Consecutive Annotations

https://doi.org/10.21437/Interspeech.2023-1108

Naini, Abinay Reddy; Salman, Ali N.; Busso, Carlos (August 2023, Interspeech 2023)

An important task in human-computer interaction is to rank speech samples according to their expressive content. A preference learning framework is appropriate for obtaining an emotional rank for a set of speech samples. However, obtaining reliable labels for training a preference learning framework is a challenging task. Most existing databases provide sentence-level absolute attribute scores annotated by multiple raters, which have to be transformed to obtain preference labels. Previous studies have shown that evaluators anchor their absolute assessments on previously annotated samples. Hence, this study proposes a novel formulation for obtaining preference learning labels by only considering annotation trends assigned by a rater to consecutive samples within an evaluation session. The experiments show that the use of the proposed anchor-based ordinal labels leads to significantly better performance than models trained using existing alternative labels.
more » « less
Full Text Available
Learning Cross-Modal Audiovisual Representations with Ladder Networks for Emotion Recognition

https://doi.org/10.1109/ICASSP49357.2023.10096138

Goncalves, Lucas; Busso, Carlos (June 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023))

Representation learning is a challenging, but essential task in audiovisual learning. A key challenge is to generate strong cross-modal representations while still capturing discriminative information contained in unimodal features. Properly capturing this information is important to increase accuracy and robustness in audio-visual tasks. Focusing on emotion recognition, this study proposes novel cross-modal ladder networks to capture modality-specific in-formation while building strong cross-modal representations. Our method utilizes representations from a backbone network to implement unsupervised auxiliary tasks to reconstruct intermediate layer representations across the acoustic and visual networks. The skip connections between the cross-modal encoder and decoder provide powerful modality-specific and multimodal representations for emotion recognition. Our model on the CREMA-D corpus achieves high performance with precision, recall, and F1 scores over 80% on a six-class problem.
more » « less
Full Text Available
Role of Lexical Boundary Information in Chunk-Level Segmentation for Speech Emotion Recognition

https://doi.org/10.1109/ICASSP49357.2023.10096861

Lin, Wei-Cheng; Busso, Carlos (June 2023, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023))

Chunk-level speech emotion recognition (SER) is a common modeling scheme to obtain better recognition performance than sentence-level formulations. A key open question is the role of lexical boundary information in the process of splitting a sentence into small chunks. Is there any benefit in providing precise lexi- cal boundary information to segment the speech into chunks (e.g., word-level alignments)? This study analyzes the role of lexical boundary information by exploring alternative segmentation strategies for chunk-level SER. We compare six chunk-level segmentation strategies that either consider word-level alignments or traditional time-based segmentation methods by varying the number of chunks and the duration of the chunks. We conduct extensive experiments to evaluate these chunk-level segmentation approaches using multiples corpora, and multiple acoustic feature sets. The results show a minor contribution of the word-level timing boundaries, where centering the chunks around words does not lead to significant performance gains. Instead, the critical factor to effectively segment a sentence into data chunks is to define the number of chunks according to the number of spoken words in the sentence.
more » « less
Full Text Available
The Importance of Calibration: Rethinking Confidence and Performance of Speech Multi-label Emotion Classifiers

https://doi.org/10.21437/Interspeech.2023-1113

Chou, Huang-Cheng; Goncalves, Lucas; Leem, Seong-Gyun; Lee, Chi-Chun; Busso, Carlos (August 2023, Interspeech 2023)

The uncertainty in modeling emotions makes speech emotion recognition (SER) systems less reliable. An intuitive way to increase trust in SER is to reject predictions with low confidence. This approach assumes that an SER system is well calibrated, where highly confident predictions are often right and low confident predictions are often wrong. Hence, it is desirable to calibrate the confidence of SER classifiers. We evaluate the reliability of SER systems by exploring the relationship between confidence and accuracy, using the expected calibration error (ECE) metric. We develop a multi-label variant of the post-hoc temperature scaling (TS) method to calibrate SER systems, while preserving their accuracy. The best method combines an emotion co-occurrence weight penalty function, a class-balanced objective function, and the proposed multi-label TS calibration method. The experiments show the effectiveness of our developed multi-label calibration method in terms of ac- curacy and ECE.
more » « less
Full Text Available
Analyzing the Effect of Affective Priming on Emotional Annotations

https://doi.org/10.1109/ACII59096.2023.10388184

Martinez-Lucas, Luz; Salman, Ali; Leem, Seong-Gyun; Upadhyay, Shreya G; Lee, Chi-Chun; Busso, Carlos (September 2023, IEEE)
na (Ed.)
In the field of affective computing, emotional annotations are highly important for both the recognition and synthesis of human emotions. Researchers must ensure that these emotional labels are adequate for modeling general human perception. An unavoidable part of obtaining such labels is that human annotators are exposed to known and unknown stimuli before and during the annotation process that can affect their perception. Emotional stimuli cause an affective priming effect, which is a pre-conscious phenomenon in which previous emotional stimuli affect the emotional perception of a current target stimulus. In this paper, we use sequences of emotional annotations during a perceptual evaluation to study the effect of affective priming on emotional ratings of speech. We observe that previous emotional sentences with extreme emotional content push annotations of current samples to the same extreme. We create a sentence-level bias metric to study the effect of affective priming on speech emotion recognition (SER) modeling. The metric is used to identify subsets in the database with more affective priming bias intentionally creating biased datasets. We train and test SER models using the full and biased datasets. Our results show that although the biased datasets have low inter-evaluator agreements, SER models for arousal and dominance trained with those datasets perform the best. For valence, the models trained with the less-biased datasets perform the best.
more » « less
Full Text Available
Sequential Modeling by Leveraging Non-Uniform Distribution of Speech Emotion

https://doi.org/10.1109/TASLP.2023.3244527

Lin, Wei-Cheng; Busso, Carlos (January 2023, IEEE/ACM Transactions on Audio, Speech, and Language Processing)

Full Text Available

« Prev Next »

Search for: All records