skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Retrieving Speech Samples with Similar Emotional Content Using a Triplet Loss Function
The ability to identify speech with similar emotional content is valuable to many applications, including speech retrieval, surveillance, and emotional speech synthesis. While current formulations in speech emotion recognition based on classification or regression are not appropriate for this task, solutions based on preference learning offer appealing approaches for this task. This paper aims to find speech samples that are emotionally similar to an anchor speech sample provided as a query. This novel formulation opens interesting research questions. How well can a machine complete this task? How does the accuracy of automatic algorithms compare to the performance of a human performing this task? This study addresses these questions by training a deep learning model using a triplet loss function, mapping the acoustic features into an embedding that is discriminative for this task. The network receives an anchor speech sample and two competing speech samples, and the task is to determine which of the candidate speech sample conveys the closest emotional content to the emotion conveyed by the anchor. By comparing the results from our model with human perceptual evaluations, this study demonstrates that the proposed approach has performance very close to human performance in retrieving samples with similar emotional content.  more » « less
Award ID(s):
1453781
PAR ID:
10099017
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2019)
Page Range / eLocation ID:
7400 to 7404
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. NA (Ed.)
    This study proposes the novel formulation of measuring emotional similarity between speech recordings. This formulation explores the ordinal nature of emotions by comparing emotional similarities instead of predicting an emotional attribute, or recognizing an emotional category. The proposed task determines which of two alternative samples has the most similar emotional content to the emotion of a given anchor. This task raises some interesting questions. Which is the emotional descriptor that provide the most suitable space to assess emotional similarities? Can deep neural networks (DNNs) learn representations to robustly quantify emotional similarities? We address these questions by exploring alternative emotional spaces created with attribute-based descriptors and categorical emotions. We create the representation using a DNN trained with the triplet loss function, which relies on triplets formed with an anchor, a positive example, and a negative example. We select a positive sample that has similar emotion content to the anchor, and a negative sample that has dissimilar emotion to the anchor. The task of our DNN is to identify the positive sample. The experimental evaluations demonstrate that we can learn a meaningful embedding to assess emotional similarities, achieving higher performance than human evaluators asked to complete the same task. 
    more » « less
  2. An important task in human-computer interaction is to rank speech samples according to their expressive content. A preference learning framework is appropriate for obtaining an emotional rank for a set of speech samples. However, obtaining reliable labels for training a preference learning framework is a challenging task. Most existing databases provide sentence-level absolute attribute scores annotated by multiple raters, which have to be transformed to obtain preference labels. Previous studies have shown that evaluators anchor their absolute assessments on previously annotated samples. Hence, this study proposes a novel formulation for obtaining preference learning labels by only considering annotation trends assigned by a rater to consecutive samples within an evaluation session. The experiments show that the use of the proposed anchor-based ordinal labels leads to significantly better performance than models trained using existing alternative labels. 
    more » « less
  3. JMIR (Ed.)
    Psychotherapy, particularly for youth, is a pressing challenge in the health care system. Traditional methods are resource-intensive, and there is a need for objective benchmarks to guide therapeutic interventions. Automated emotion detection from speech, using artificial intelligence, presents an emerging approach to address these challenges. Speech can carry vital information about emotional states, which can be used to improve mental health care services, especially when the person is suffering. ObjectiveThis study aims to develop and evaluate automated methods for detecting the intensity of emotions (anger, fear, sadness, and happiness) in audio recordings of patients’ speech. We also demonstrate the viability of deploying the models. Our model was validated in a previous publication by Alemu et al with limited voice samples. This follow-up study used significantly more voice samples to validate the previous model. MethodsWe used audio recordings of patients, specifically children with high adverse childhood experience (ACE) scores; the average ACE score was 5 or higher, at the highest risk for chronic disease and social or emotional problems; only 1 in 6 have a score of 4 or above. The patients’ structured voice sample was collected by reading a fixed script. In total, 4 highly trained therapists classified audio segments based on a scoring process of 4 emotions and their intensity levels for each of the 4 different emotions. We experimented with various preprocessing methods, including denoising, voice-activity detection, and diarization. Additionally, we explored various model architectures, including convolutional neural networks (CNNs) and transformers. We trained emotion-specific transformer-based models and a generalized CNN-based model to predict emotion intensities. ResultsThe emotion-specific transformer-based model achieved a test-set precision and recall of 86% and 79%, respectively, for binary emotional intensity classification (high or low). In contrast, the CNN-based model, generalized to predict the intensity of 4 different emotions, achieved test-set precision and recall of 83% for each. ConclusionsAutomated emotion detection from patients’ speech using artificial intelligence models is found to be feasible, leading to a high level of accuracy. The transformer-based model exhibited better performance in emotion-specific detection, while the CNN-based model showed promise in generalized emotion detection. These models can serve as valuable decision-support tools for pediatricians and mental health providers to triage youth to appropriate levels of mental health care services. 
    more » « less
  4. Speech emotion recognition (SER) is a challenging task due to the limited availability of real-world labeled datasets. Since it is easier to find unlabeled data, the use of self-supervised learning (SSL) has become an attractive alternative. This study proposes new pre-text tasks for SSL to improve SER. While our target application is SER, the proposed pre-text tasks include audio-visual formulations, leveraging the relationship between acoustic and facial features. Our proposed approach introduces three new unimodal and multimodal pre-text tasks that are carefully designed to learn better representations for predicting emotional cues from speech. Task 1 predicts energy variations (high or low) from a speech sequence. Task 2 uses speech features to predict facial activation (high or low) based on facial landmark movements. Task 3 performs a multi-class emotion recognition task on emotional labels obtained from combinations of action units (AUs) detected across a video sequence. We pre-train a network with 60.92 hours of unlabeled data, fine-tuning the model for the downstream SER task. The results on the CREMA-D dataset show that the model pre-trained on the proposed domain-specific pre-text tasks significantly improves the precision (up to 5.1%), recall (up to 4.5%), and F1-scores (up to 4.9%) of our SER system. 
    more » « less
  5. A challenging task in affective computing is to build reliable speech emotion recognition (SER) systems that can accurately predict emotional attributes from spontaneous speech. To increase the trust in these SER systems, it is important to predict not only their accuracy, but also their confidence. An intriguing approach to predict uncertainty is Monte Carlo (MC) dropout, which obtains pre- dictions from multiple feed-forward passes through a deep neural network (DNN) by using dropout regularization in both training and inference. This study evaluates this approach with regression models to predict emotional attribute scores for valence, arousal and dom- inance. The analysis illustrates that predicting uncertainty in this problem is possible, where the performance is higher for samples in the test set with lower uncertainty. The study evaluates uncertainty estimation as a function of the emotional attributes, showing that samples with extreme values have lower uncertainty. Finally, we demonstrate the benefits of uncertainty estimation with reject option, where a classifier can decline to give a prediction when its confi- dence is low. By rejecting only 25% of the test set with the highest uncertainty, we achieve relative performance gains of 7.34% for arousal, 13.73% for valence and 8.79% for dominance. 
    more » « less