skip to main content


Title: Role of Regularization in the Prediction of Valence from Speech
Regularization plays a key role in improving the prediction of emotions using attributes such as arousal, valence and dominance. Regularization is particularly important with deep neural networks (DNNs), which have millions of parameters. While previous studies have reported competitive performance for arousal and dominance, the prediction results for valence using acoustic features are significantly lower. We hypothesize that higher regularization can lead to better results for valence. This study focuses on exploring the role of dropout as a form of regularization for valence, suggesting the need for higher regularization. We analyze the performance of regression models for valence, arousal and dominance as a function of the dropout probability. We observe that the optimum dropout rates are consistent for arousal and dominance. However, the optimum dropout rate for valence is higher. To understand the need for higher regularization for valence, we perform an empirical analysis to explore the nature of emotional cues conveyed in speech. We compare regression models with speakerdependent and speaker-independent partitions for training and testing. The experimental evaluation suggests stronger speaker dependent traits for valence. We conclude that higher regularization is needed for valence to force the network to learn global patterns that generalize across speakers.  more » « less
Award ID(s):
1453781
NSF-PAR ID:
10099019
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Interspeech 2018
Page Range / eLocation ID:
941 to 945
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A challenging task in affective computing is to build reliable speech emotion recognition (SER) systems that can accurately predict emotional attributes from spontaneous speech. To increase the trust in these SER systems, it is important to predict not only their accuracy, but also their confidence. An intriguing approach to predict uncertainty is Monte Carlo (MC) dropout, which obtains pre- dictions from multiple feed-forward passes through a deep neural network (DNN) by using dropout regularization in both training and inference. This study evaluates this approach with regression models to predict emotional attribute scores for valence, arousal and dom- inance. The analysis illustrates that predicting uncertainty in this problem is possible, where the performance is higher for samples in the test set with lower uncertainty. The study evaluates uncertainty estimation as a function of the emotional attributes, showing that samples with extreme values have lower uncertainty. Finally, we demonstrate the benefits of uncertainty estimation with reject option, where a classifier can decline to give a prediction when its confi- dence is low. By rejecting only 25% of the test set with the highest uncertainty, we achieve relative performance gains of 7.34% for arousal, 13.73% for valence and 8.79% for dominance. 
    more » « less
  2. Recognizing emotions using few attribute dimensions such as arousal, valence and dominance provides the flexibility to effectively represent complex range of emotional behaviors. Conventional methods to learn these emotional descriptors primarily focus on separate models to recognize each of these attributes. Recent work has shown that learning these attributes together regularizes the models, leading to better feature representations. This study explores new forms of regularization by adding unsupervised auxiliary tasks to reconstruct hidden layer representations. This auxiliary task requires the denoising of hidden representations at every layer of an auto-encoder. The framework relies on ladder networks that utilize skip connections between encoder and decoder layers to learn powerful representations of emotional dimensions. The results show that ladder networks improve the performance of the system compared to baselines that individually learn each attribute, and conventional denoising autoencoders. Furthermore, the unsupervised auxiliary tasks have promising potential to be used in a semi-supervised setting, where few labeled sentences are available. 
    more » « less
  3. This paper presents a method for extracting novel spectral features based on a sinusoidal model. The method is focused on characterizing the spectral shapes of audio signals using spectra peaks in frequency sub-bands. The extracted features are evaluated for predicting the levels of emotional dimensions, namely arousal and valence. Principal component regression, partial least squares regression, and deep convolutional neural network (CNN) models are used as prediction models for the levels of the emotional dimensions. The experimental results indicate that the proposed features include additional spectral information that common baseline features may not include. Since the quality of audio signals, especially timbre, plays a major role in affecting the perception of emotional valence in music, the inclusion of the presented features will contribute to decreasing the prediction error rate. 
    more » « less
  4. Multivariate pattern analysis (MVPA) of functional magnetic resonance imaging (fMRI) data has critically advanced the neuroanatomical understanding of affect processing in the human brain. Central to these advancements is the brain state, a temporally-succinct fMRI-derived pattern of neural activation, which serves as a processing unit. Establishing the brain state’s central role in affect processing, however, requires that it predicts multiple independent measures of affect. We employed MVPA-based regression to predict the valence and arousal properties of visual stimuli sampled from the International Affective Picture System (IAPS) along with the corollary skin conductance response (SCR) for demographically diverse healthy human participants (n = 19). We found that brain states significantly predicted the normative valence and arousal scores of the stimuli as well as the attendant individual SCRs. In contrast, SCRs significantly predicted arousal only. The prediction effect size of the brain state was more than three times greater than that of SCR. Moreover, neuroanatomical analysis of the regression parameters found remarkable agreement with regions long-established by fMRI univariate analyses in the emotion processing literature. Finally, geometric analysis of these parameters also found that the neuroanatomical encodings of valence and arousal are orthogonal as originally posited by the circumplex model of dimensional emotion. 
    more » « less
  5. Abstract

    Multivariate pattern analysis (MVPA) of functional magnetic resonance imaging (fMRI) data has critically advanced the neuroanatomical understanding of affect processing in the human brain. Central to these advancements is the brain state, a temporally-succinct fMRI-derived pattern of neural activation, which serves as a processing unit. Establishing the brain state’s central role in affect processing, however, requires that it predicts multiple independent measures of affect. We employed MVPA-based regression to predict the valence and arousal properties of visual stimuli sampled from the International Affective Picture System (IAPS) along with the corollary skin conductance response (SCR) for demographically diverse healthy human participants (n = 19). We found that brain states significantly predicted the normative valence and arousal scores of the stimuli as well as the attendant individual SCRs. In contrast, SCRs significantly predicted arousal only. The prediction effect size of the brain state was more than three times greater than that of SCR. Moreover, neuroanatomical analysis of the regression parameters found remarkable agreement with regions long-established by fMRI univariate analyses in the emotion processing literature. Finally, geometric analysis of these parameters also found that the neuroanatomical encodings of valence and arousal are orthogonal as originally posited by the circumplex model of dimensional emotion.

     
    more » « less