skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Jointly Aligning and Predicting Continuous Emotion Annotations
Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network (multi-delay sinc network) that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the delayed sinc layer. It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.  more » « less
Award ID(s):
1651740
PAR ID:
10125068
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE Transactions on Affective Computing
ISSN:
2371-9850
Page Range / eLocation ID:
1 to 1
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Emotional annotation of data is important in affective computing for the analysis, recognition, and synthesis of emotions. As raters perceive emotion, they make relative comparisons with what they previously experienced, creating “anchors” that influence the annotations. This unconscious influence of the emotional content of previous stimuli in the perception of emotions is referred to as the affective priming effect. This phenomenon is also expected in annotations conducted with out-of-order segments, a common approach for annotating emotional databases. Can the affective priming effect introduce bias in the labels? If yes, how does this bias affect emotion recognition systems trained with these labels? This study presents a detailed analysis of the affective priming effect and its influence on speech emotion recognition (SER). The analysis shows that the affective priming effect affects emotional attributes and categorical emotion annotations. We observe that if annotators assign an extreme score to previous sentences for an emotional attribute (valence, arousal, or dominance), they will tend to annotate the next sentence closer to that extreme. We conduct SER experiments using the most biased sentences. We observe that models trained on the biased sentences perform the best and have the lowest prediction uncertainty. 
    more » « less
  2. Several sources of delay in an epidemic network might negatively affect the stability and robustness of the entire network. In this paper, a multi-delayed Susceptible-Infectious-Susceptible (SIS) model is applied on a metapopulation network, where the epidemic delays are categorized into local and global delays. While local delays result from intra-population lags such as symptom development duration or recovery period, global delays stem from inter-population lags, e.g., transition duration between subpopulations. The theoretical results for a network of subpopulations with identical linear SIS dynamics and different types of time-delay show that depending on the type of time-delay in the network, different eigenvalues of the underlying graph should be evaluated to obtain the feasible regions of stability. The delay-dependent stability of such epidemic networks has been analytically derived, which eliminates potentially expensive computations required by current algorithms. The effect of time-delay on the H2 norm-based performance of a class of epidemic networks with additive noise inputs and multiple delays is studied and the closed form of their performance measure is derived using the solution of delayed Lyapunov equations. As a case study, the theoretical findings are implemented on a network of United States’ busiest airports. 
    more » « less
  3. The emotional content of several databases are annotated with continuous-time (CT) annotations, providing traces with frame-by-frame scores describing the instantaneous value of an emotional attribute. However, having a single score describing the global emotion of a short segment is more convenient for several emotion recognition formulations. A common approach is to derive sentence-level (SL) labels from CT annotations by aggregating the values of the emotional traces across time and annotators. How similar are these aggregated SL labels from labels originally collected at the sentence level? The release of the MSP-Podcast (SL annotations) and MSP-Conversation (CT annotations) corpora provides the resources to explore the validity of aggregating SL labels from CT annotations. There are 2,884 speech segments that belong to both corpora. Using this set, this study (1) compares both types of annotations using statistical metrics, (2) evaluates their inter-evaluator agreements, and (3) explores the effect of these SL labels on speech emotion recognition (SER) tasks. The analysis reveals benefits of using SL labels derived from CT annotations in the estimation of valence. This analysis also provides insights on how the two types of labels differ and how that could affect a model. 
    more » « less
  4. Speech Emotion Recognition (SER) faces a distinct challenge compared to other speech-related tasks because the annotations will show the subjective emotional perceptions of different annotators. Previous SER studies often view the subjectivity of emotion perception as noise by using the majority rule or plurality rule to obtain the consensus labels. However, these standard approaches overlook the valuable information of labels that do not agree with the consensus and make it easier for the test set. Emotion perception can have co-occurring emotions in realistic conditions, and it is unnecessary to regard the disagreement between raters as noise. To bridge the SER into a multi-label task, we introduced an “all-inclusive rule,” which considers all available data, ratings, and distributional labels as multi-label targets and a complete test set. We demonstrated that models trained with multi-label targets generated by the proposed AR outperform conventional single-label methods across incomplete and complete test sets. 
    more » « less
  5. Detection of human emotions is an essential part of affect-aware human-computer interaction (HCI). In daily conversations, the preferred way of describing affects is by using categorical emotion labels (e.g., sad, anger, surprise). In categorical emotion classification, multiple descriptors (with different degrees of relevance) can be assigned to a sample. Perceptual evaluations have relied on primary and secondary emotions to capture the ambiguous nature of spontaneous recordings. Primary emotion is the most relevant category felt by the evaluator. Secondary emotions capture other emotional cues also conveyed in the stimulus. In most cases, the labels collected from the secondary emotions are discarded, since assigning a single class label to a sample is preferred from an application perspective. In this work, we take advantage of both types of annotations to improve the performance of emotion classification. We collect the labels from all the annotations available for a sample and generate primary and secondary emotion labels. A classifier is then trained using multitask learning with both primary and secondary emotions. We experimentally show that considering secondary emotion labels during the learning process leads to relative improvements of 7.9% in F1-score for an 8-class emotion classification task. 
    more » « less