NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Quantifying Emotional Similarity in Speech

https://doi.org/10.1109/TAFFC.2021.3127390

Harvill, John; Leem, Seong-Gyun; AbdelWahab, Mohammed; Lotfian, Reza; Busso, Carlos (April 2023, IEEE Transactions on Affective Computing)
NA (Ed.)
This study proposes the novel formulation of measuring emotional similarity between speech recordings. This formulation explores the ordinal nature of emotions by comparing emotional similarities instead of predicting an emotional attribute, or recognizing an emotional category. The proposed task determines which of two alternative samples has the most similar emotional content to the emotion of a given anchor. This task raises some interesting questions. Which is the emotional descriptor that provide the most suitable space to assess emotional similarities? Can deep neural networks (DNNs) learn representations to robustly quantify emotional similarities? We address these questions by exploring alternative emotional spaces created with attribute-based descriptors and categorical emotions. We create the representation using a DNN trained with the triplet loss function, which relies on triplets formed with an anchor, a positive example, and a negative example. We select a positive sample that has similar emotion content to the anchor, and a negative sample that has dissimilar emotion to the anchor. The task of our DNN is to identify the positive sample. The experimental evaluations demonstrate that we can learn a meaningful embedding to assess emotional similarities, achieving higher performance than human evaluators asked to complete the same task.
more » « less
Full Text Available
Generative Approach Using Soft-Labels to Learn Uncertainty in Predicting Emotional Attributes

https://doi.org/10.1109/ACII52823.2021.9597461

Sridhar, Kusha; Lin, Wei-Cheng; Busso, Carlos (September 2021, International Conference on Affective Computing and Intelligent Interaction (ACII 2021))

Full Text Available
Deepemocluster: a Semi-Supervised Framework for Latent Cluster Representation of Speech Emotions

https://doi.org/10.1109/ICASSP39728.2021.9414035

Lin, Wei-Cheng; Sridhar, Kusha; Busso, Carlos (June 2021, IEEE international conference on acoustics, speech and signal processing (ICASSP 2021))
null (Ed.)
Semi-supervised learning (SSL) is an appealing approach to resolve generalization problem for speech emotion recognition (SER) systems. By utilizing large amounts of unlabeled data, SSL is able to gain extra information about the prior distribution of the data. Typically, it can lead to better and robust recognition performance. Existing SSL approaches for SER include variations of encoder-decoder model structures such as autoencoder (AE) and variational autoencoders (VAEs), where it is difficult to interpret the learning mechanism behind the latent space. In this study, we introduce a new SSL framework, which we refer to as the DeepEmoCluster framework, for attribute-based SER tasks. The DeepEmoCluster framework is an end-to-end model with mel-spectrogram inputs, which combines a self-supervised pseudo labeling classification network with a supervised emotional attribute regressor. The approach encourages the model to learn latent representations by maximizing the emotional separation of K-means clusters. Our experimental results based on the MSP-Podcast corpus indicate that the DeepEmoCluster framework achieves competitive prediction performances in fully supervised scheme, outperforming baseline methods in most of the conditions. The approach can be further improved by incorporating extra unlabeled set. Moreover, our experimental results explicitly show that the latent clusters have emotional dependencies, enriching the geometric interpretation of the clusters.
more » « less
Full Text Available
Chunk-Level Speech Emotion Recognition: A General Framework of Sequence-to-One Dynamic Temporal Modeling

https://doi.org/10.1109/TAFFC.2021.3083821

Lin, Wei-Cheng; Busso, Carlos (May 2021, IEEE Transactions on Affective Computing)
null (Ed.)
A critical issue of current speech-based sequence-to-one learning tasks, such as speech emotion recognition (SER), is the dynamic temporal modeling for speech sentences with different durations. The goal is to extract an informative representation vector of the sentence from acoustic feature sequences with varied length. Traditional methods rely on static descriptions such as statistical functions or a universal background model (UBM), which are not capable of characterizing dynamic temporal changes. Recent advances in deep learning architectures provide promising results, directly extracting sentence-level representations from frame-level features. However, conventional cropping and padding techniques that deal with varied length sequences are not optimal, since they truncate or artificially add sentence-level information. Therefore, we propose a novel dynamic chunking approach, which maps the original sequences of different lengths into a fixed number of chunks that have the same duration by adjusting their overlap. This simple chunking procedure creates a flexible framework that can incorporate different feature extractions and sentence-level temporal aggregation approaches to cope, in a principled way, with different sequence-to-one tasks. Our experimental results based on three databases demonstrate that the proposed framework provides: 1) improvement in recognition accuracy, 2) robustness toward different temporal length predictions, and 3) high model computational efficiency advantages.
more » « less
Full Text Available
An Efficient Temporal Modeling Approach for Speech Emotion Recognition by Mapping Varied Duration Sentences into Fixed Number of Chunks

https://doi.org/10.21437/Interspeech.2020-2636

Lin, Wei-Cheng; Busso, Carlos (October 2020, Interspeech 2020)
null (Ed.)
Speech emotion recognition (SER) plays an important role in multiple fields such as healthcare, human-computer interaction (HCI), and security and defense. Emotional labels are often annotated at the sentence-level (i.e., one label per sentence), resulting in a sequence-to-one recognition problem. Traditionally, studies have relied on statistical descriptions, which are com- puted over time from low level descriptors (LLDs), creating a fixed dimension sentence-level feature representation regardless of the duration of the sentence. However sentence-level features lack temporal information, which limits the performance of SER systems. Recently, new deep learning architectures have been proposed to model temporal data. An important question is how to extract emotion-relevant features with temporal infor- mation. This study proposes a novel data processing approach that extracts a fixed number of small chunks over sentences of different durations by changing the overlap between these chunks. The approach is flexible, providing an ideal frame- work to combine gated network or attention mechanisms with long short-term memory (LSTM) networks. Our experimental results based on the MSP-Podcast dataset demonstrate that the proposed method not only significantly improves recognition accuracy over alternative temporal-based models relying on LSTM, but also leads to computational efficiency.
more » « less
Full Text Available
Modeling Uncertainty in Predicting Emotional Attributes from Spontaneous Speech

https://doi.org/10.1109/ICASSP40776.2020.9054237

Sridhar, Kusha; Busso, Carlos (May 2020, IEEE international conference on acoustics, speech and signal processing (ICASSP 2020))

A challenging task in affective computing is to build reliable speech emotion recognition (SER) systems that can accurately predict emotional attributes from spontaneous speech. To increase the trust in these SER systems, it is important to predict not only their accuracy, but also their confidence. An intriguing approach to predict uncertainty is Monte Carlo (MC) dropout, which obtains pre- dictions from multiple feed-forward passes through a deep neural network (DNN) by using dropout regularization in both training and inference. This study evaluates this approach with regression models to predict emotional attribute scores for valence, arousal and dom- inance. The analysis illustrates that predicting uncertainty in this problem is possible, where the performance is higher for samples in the test set with lower uncertainty. The study evaluates uncertainty estimation as a function of the emotional attributes, showing that samples with extreme values have lower uncertainty. Finally, we demonstrate the benefits of uncertainty estimation with reject option, where a classifier can decline to give a prediction when its confi- dence is low. By rejecting only 25% of the test set with the highest uncertainty, we achieve relative performance gains of 7.34% for arousal, 13.73% for valence and 8.79% for dominance.
more » « less
Full Text Available
Semi-Supervised Speech Emotion Recognition With Ladder Networks

https://doi.org/10.1109/TASLP.2020.3023632

Parthasarathy, Srinivas; Busso, Carlos (January 2020, IEEE/ACM Transactions on Audio, Speech, and Language Processing)
null (Ed.)
Full Text Available
End-to-end audiovisual speech activity detection with bimodal recurrent neural models

https://doi.org/10.1016/j.specom.2019.07.003

Tao, Fei; Busso, Carlos (October 2019, Speech Communication)

Full Text Available
Retrieving Speech Samples with Similar Emotional Content Using a Triplet Loss Function

https://doi.org/10.1109/ICASSP.2019.8683273

Harvill, John; AbdelWahab, Mohammed; Lotfian, Reza; Busso, Carlos (May 2019, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019))

The ability to identify speech with similar emotional content is valuable to many applications, including speech retrieval, surveillance, and emotional speech synthesis. While current formulations in speech emotion recognition based on classification or regression are not appropriate for this task, solutions based on preference learning offer appealing approaches for this task. This paper aims to find speech samples that are emotionally similar to an anchor speech sample provided as a query. This novel formulation opens interesting research questions. How well can a machine complete this task? How does the accuracy of automatic algorithms compare to the performance of a human performing this task? This study addresses these questions by training a deep learning model using a triplet loss function, mapping the acoustic features into an embedding that is discriminative for this task. The network receives an anchor speech sample and two competing speech samples, and the task is to determine which of the candidate speech sample conveys the closest emotional content to the emotion conveyed by the anchor. By comparing the results from our model with human perceptual evaluations, this study demonstrates that the proposed approach has performance very close to human performance in retrieving samples with similar emotional content.
more » « less
Full Text Available
Curriculum Learning for Speech Emotion Recognition From Crowdsourced Labels

https://doi.org/10.1109/TASLP.2019.2898816

Lotfian, Reza; Busso, Carlos (April 2019, IEEE/ACM Transactions on Audio, Speech, and Language Processing)

Full Text Available

« Prev Next »

Search for: All records