skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Friday, July 12 until 2:00 AM ET on Saturday, July 13 due to maintenance. We apologize for the inconvenience.

Title: Jointly Aligning and Predicting Continuous Emotion Annotations
Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network (multi-delay sinc network) that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the delayed sinc layer. It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE Transactions on Affective Computing
Page Range / eLocation ID:
1 to 1
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Several sources of delay in an epidemic network might negatively affect the stability and robustness of the entire network. In this paper, a multi-delayed Susceptible-Infectious-Susceptible (SIS) model is applied on a metapopulation network, where the epidemic delays are categorized into local and global delays. While local delays result from intra-population lags such as symptom development duration or recovery period, global delays stem from inter-population lags, e.g., transition duration between subpopulations. The theoretical results for a network of subpopulations with identical linear SIS dynamics and different types of time-delay show that depending on the type of time-delay in the network, different eigenvalues of the underlying graph should be evaluated to obtain the feasible regions of stability. The delay-dependent stability of such epidemic networks has been analytically derived, which eliminates potentially expensive computations required by current algorithms. The effect of time-delay on the H2 norm-based performance of a class of epidemic networks with additive noise inputs and multiple delays is studied and the closed form of their performance measure is derived using the solution of delayed Lyapunov equations. As a case study, the theoretical findings are implemented on a network of United States’ busiest airports. 
    more » « less
  2. Detection of human emotions is an essential part of affect-aware human-computer interaction (HCI). In daily conversations, the preferred way of describing affects is by using categorical emotion labels (e.g., sad, anger, surprise). In categorical emotion classification, multiple descriptors (with different degrees of relevance) can be assigned to a sample. Perceptual evaluations have relied on primary and secondary emotions to capture the ambiguous nature of spontaneous recordings. Primary emotion is the most relevant category felt by the evaluator. Secondary emotions capture other emotional cues also conveyed in the stimulus. In most cases, the labels collected from the secondary emotions are discarded, since assigning a single class label to a sample is preferred from an application perspective. In this work, we take advantage of both types of annotations to improve the performance of emotion classification. We collect the labels from all the annotations available for a sample and generate primary and secondary emotion labels. A classifier is then trained using multitask learning with both primary and secondary emotions. We experimentally show that considering secondary emotion labels during the learning process leads to relative improvements of 7.9% in F1-score for an 8-class emotion classification task. 
    more » « less
  3. Bizley, Jennifer K. (Ed.)

    Hearing one’s own voice is critical for fluent speech production as it allows for the detection and correction of vocalization errors in real time. This behavior known as the auditory feedback control of speech is impaired in various neurological disorders ranging from stuttering to aphasia; however, the underlying neural mechanisms are still poorly understood. Computational models of speech motor control suggest that, during speech production, the brain uses an efference copy of the motor command to generate an internal estimate of the speech output. When actual feedback differs from this internal estimate, an error signal is generated to correct the internal estimate and update necessary motor commands to produce intended speech. We were able to localize the auditory error signal using electrocorticographic recordings from neurosurgical participants during a delayed auditory feedback (DAF) paradigm. In this task, participants hear their voice with a time delay as they produced words and sentences (similar to an echo on a conference call), which is well known to disrupt fluency by causing slow and stutter-like speech in humans. We observed a significant response enhancement in auditory cortex that scaled with the duration of feedback delay, indicating an auditory speech error signal. Immediately following auditory cortex, dorsal precentral gyrus (dPreCG), a region that has not been implicated in auditory feedback processing before, exhibited a markedly similar response enhancement, suggesting a tight coupling between the 2 regions. Critically, response enhancement in dPreCG occurred only during articulation of long utterances due to a continuous mismatch between produced speech and reafferent feedback. These results suggest that dPreCG plays an essential role in processing auditory error signals during speech production to maintain fluency.

    more » « less
  4. Delay-Differential Equations (DDEs) are the most common representation for systems with delay. However, the DDE representation has limitations. In network models with delay, the delayed channels are typically low-dimensional and accounting for this heterogeneity is challenging in the DDE framework. In addition, DDEs cannot be used to model difference equations. In this paper, we examine alternative representations for networked systems with delay and provide formulae for conversion between representations. First, we examine the Differential-Difference (DDF) formulation which allows us to represent the low-dimensional nature of delayed information. Next, we consider the coupled ODE-PDE framework and extend this to the recently developed Partial-Integral Equation (PIE) representation. The PIE framework has the advantage that it allows the H∞-optimal estimation and control problems to be solved efficiently using the recently developed software package PIETOOLS. In each case, we consider a very general class of networks, specifically accounting for four sources of delay - state delay, input delay, output delay, and process delay. Finally, we use a scalable network model of temperature control to show that the use of the DDF/PIE formulation allows for optimal control of a network with 40 users, 80 states, 40 delays, 40 inputs, and 40 disturbances. 
    more » « less
  5. Suicide is a serious public health concern in the U.S., taking the lives of over 47,000 people in 2017. Early detection of suicidal ideation is key to prevention. One promising approach to symptom monitoring is suicidal speech prediction, as speech can be passively collected and may indicate changes in risk. However, directly identifying suicidal speech is difficult, as characteristics of speech can vary rapidly compared with suicidal thoughts. Suicidal ideation is also associated with emotion dysregulation. Therefore, in this work, we focus on the detection of emotion from speech and its relation to suicide. We introduce the Ecological Measurement of Affect, Speech, and Suicide (EMASS) dataset, which contains phone call recordings of individuals recently discharged from the hospital following admission for suicidal ideation or behavior, along with controls. Participants self-report their emotion periodically throughout the study. However, the dataset is relatively small and has uncertain labels. Because of this, we find that most features traditionally used for emotion classification fail. We demonstrate how outside emotion datasets can be used to generate more relevant features, making this analysis possible. Finally, we use emotion predictions to differentiate healthy controls from those with suicidal ideation, providing evidence for suicidal speech detection using emotion. 
    more » « less