skip to main content


Title: Quantifying Emotional Similarity in Speech
This study proposes the novel formulation of measuring emotional similarity between speech recordings. This formulation explores the ordinal nature of emotions by comparing emotional similarities instead of predicting an emotional attribute, or recognizing an emotional category. The proposed task determines which of two alternative samples has the most similar emotional content to the emotion of a given anchor. This task raises some interesting questions. Which is the emotional descriptor that provide the most suitable space to assess emotional similarities? Can deep neural networks (DNNs) learn representations to robustly quantify emotional similarities? We address these questions by exploring alternative emotional spaces created with attribute-based descriptors and categorical emotions. We create the representation using a DNN trained with the triplet loss function, which relies on triplets formed with an anchor, a positive example, and a negative example. We select a positive sample that has similar emotion content to the anchor, and a negative sample that has dissimilar emotion to the anchor. The task of our DNN is to identify the positive sample. The experimental evaluations demonstrate that we can learn a meaningful embedding to assess emotional similarities, achieving higher performance than human evaluators asked to complete the same task.  more » « less
Award ID(s):
2016719 1453781
NSF-PAR ID:
10532835
Author(s) / Creator(s):
; ; ; ;
Corporate Creator(s):
Editor(s):
NA
Publisher / Repository:
IEEE
Date Published:
Journal Name:
IEEE Transactions on Affective Computing
Volume:
14
Issue:
2
ISSN:
2371-9850
Page Range / eLocation ID:
1376 to 1390
Subject(s) / Keyword(s):
Speech emotion recognition, ordinal affective computing, representation learning of emotion similarity, triplet loss function, speech emotion retrieval
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The ability to identify speech with similar emotional content is valuable to many applications, including speech retrieval, surveillance, and emotional speech synthesis. While current formulations in speech emotion recognition based on classification or regression are not appropriate for this task, solutions based on preference learning offer appealing approaches for this task. This paper aims to find speech samples that are emotionally similar to an anchor speech sample provided as a query. This novel formulation opens interesting research questions. How well can a machine complete this task? How does the accuracy of automatic algorithms compare to the performance of a human performing this task? This study addresses these questions by training a deep learning model using a triplet loss function, mapping the acoustic features into an embedding that is discriminative for this task. The network receives an anchor speech sample and two competing speech samples, and the task is to determine which of the candidate speech sample conveys the closest emotional content to the emotion conveyed by the anchor. By comparing the results from our model with human perceptual evaluations, this study demonstrates that the proposed approach has performance very close to human performance in retrieving samples with similar emotional content. 
    more » « less
  2. Ecological momentary assessment (EMA) methodology was used to examine the emotional context of nonsuicidal self‐injury (NSSI). Forty‐seven adolescents and young adults used a novel smartphone app to monitor their emotional experiences,NSSIthoughts, and NSSI behaviors for 2 weeks. Momentary changes in both negative and positive emotions predicted greater intensity ofNSSIthoughts at the subsequent assessment, while only increases in negative emotion predictedNSSIbehaviors. Immediately followingNSSIbehaviors participants reported reduced high‐arousal negative emotions and increased low‐arousal positive emotions, suggesting thatNSSImay be an efficient and effective method of regulating emotion. Findings highlight the importance of addressing emotion regulation inNSSIinterventions.

     
    more » « less
  3. An important task in human-computer interaction is to rank speech samples according to their expressive content. A preference learning framework is appropriate for obtaining an emotional rank for a set of speech samples. However, obtaining reliable labels for training a preference learning framework is a challenging task. Most existing databases provide sentence-level absolute attribute scores annotated by multiple raters, which have to be transformed to obtain preference labels. Previous studies have shown that evaluators anchor their absolute assessments on previously annotated samples. Hence, this study proposes a novel formulation for obtaining preference learning labels by only considering annotation trends assigned by a rater to consecutive samples within an evaluation session. The experiments show that the use of the proposed anchor-based ordinal labels leads to significantly better performance than models trained using existing alternative labels. 
    more » « less
  4. Abstract

    Longstanding theories of emotion socialization postulate that caregiver emotional and behavioral reactions to a child's emotions together shape the child's emotion displays over time. Despite the notable importance of positive valence system function, the majority of research on caregiver emotion socialization focuses on negative valence system emotions. In the current project, we leveraged a relatively large cross‐sectional study of caregivers (N = 234; 93.59% White) of preschool aged children to investigate whether and to what degree, caregiver (1) emotional experiences, or (2) external behaviors, in the context of preschoolers’ positive emotion displays in caregiver–child interactions, are associated with children's general positive affect tendencies. Results indicated that, in the context of everyday caregiver–child interactions, caregiver‐reported positively valenced emotions but not approach behaviors were positively associated with child general positive affect tendencies. However, when examining specific caregiver behaviors in response to everyday child positive emotion displays, caregiver report of narrating the child's emotion and joining in the emotion with their child was positively associated with child general positive affect tendencies. Together, these results suggest that in everyday caregiver–child interactions, caregivers’ emotional experiences and attunement with the child play a role in shaping preschoolers’ overall tendencies toward positive affect.

     
    more » « less
  5. Previous studies on speech emotion recognition (SER) with categorical emotions have often formulated the task as a single-label classification problem, where the emotions are considered orthogonal to each other. However, previous studies have indicated that emotions can co-occur, especially for more ambiguous emotional sentences (e.g., a mixture of happiness and sur- prise). Some studies have regarded SER problems as a multi-label task, predicting multiple emotional classes. However, this formulation does not leverage the relation between emotions during training, since emotions are assumed to be independent. This study explores the idea that emotional classes are not necessarily independent and its implications on training SER models. In particular, we calculate the frequency of co-occurring emotions from perceptual evaluations in the train set to generate a matrix with class-dependent penalties, punishing more mistakes between distant emotional classes. We integrate the penalization matrix into three existing label-learning approaches (hard-label, multi-label, and distribution-label learn- ing) using the proposed modified loss. We train SER models using the penalty loss and commonly used cost functions for SER tasks. The evaluation of our proposed penalization matrix on the MSP-Podcast corpus shows important relative improvements in macro F1-score for hard-label learning (17.12%), multi-label learning (12.79%), and distribution-label learning (25.8%). 
    more » « less