NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Episodic Memory For Domain-Adaptable, Robust Speech Emotion Recognition

https://doi.org/10.21437/Interspeech.2023-2111

Tavernor, James; Perez, Matthew; Mower Provost, Emily (August 2023, Interspeech)

Full Text Available
You're Not You When You're Angry: Robust Emotion Features Emerge by Recognizing Speakers

https://doi.org/10.1109/TAFFC.2021.3086050

Aldeneh, Zakaria; Mower Provost, Emily (June 2021, IEEE Transactions on Affective Computing)

Full Text Available
Learning Paralinguistic Features from Audiobooks through Style Voice Conversion

https://doi.org/10.18653/v1/2021.naacl-main.377

Aldeneh, Zakaria; Perez, Matthew; Mower Provost, Emily (June 2021, Annual Conference of the North American Chapter of the Association for Computational Linguistics (NAACL))

Full Text Available
Jointly Aligning and Predicting Continuous Emotion Annotations

https://doi.org/10.1109/taffc.2019.2917047

Khorram, Soheil; McInnis, Melvin; Mower Provost, Emily (May 2019, IEEE Transactions on Affective Computing)

Time-continuous dimensional descriptions of emotions (e.g., arousal, valence) allow researchers to characterize short-time changes and to capture long-term trends in emotion expression. However, continuous emotion labels are generally not synchronized with the input speech signal due to delays caused by reaction-time, which is inherent in human evaluations. To deal with this challenge, we introduce a new convolutional neural network (multi-delay sinc network) that is able to simultaneously align and predict labels in an end-to-end manner. The proposed network is a stack of convolutional layers followed by an aligner network that aligns the speech signal and emotion labels. This network is implemented using a new convolutional layer that we introduce, the delayed sinc layer. It is a time-shifted low-pass (sinc) filter that uses a gradient-based algorithm to learn a single delay. Multiple delayed sinc layers can be used to compensate for a non-stationary delay that is a function of the acoustic space. We test the efficacy of this system on two common emotion datasets, RECOLA and SEWA, and show that this approach obtains state-of-the-art speech-only results by learning time-varying delays while predicting dimensional descriptors of emotions.
more » « less
Full Text Available
Improving Cross-Corpus Speech Emotion Recognition with Adversarial Discriminative Domain Generalization (ADDoG)

https://doi.org/10.1109/taffc.2019.2916092

Gideon, John; McInnis, Melvin; Mower Provost, Emily (May 2019, IEEE Transactions on Affective Computing)

Automatic speech emotion recognition provides computers with critical context to enable user understanding. While methods trained and tested within the same dataset have been shown successful, they often fail when applied to unseen datasets. To address this, recent work has focused on adversarial methods to find more generalized representations of emotional speech. However, many of these methods have issues converging, and only involve datasets collected in laboratory conditions. In this paper, we introduce Adversarial Discriminative Domain Generalization (ADDoG), which follows an easier to train “meet in the middle“ approach. The model iteratively moves representations learned for each dataset closer to one another, improving cross-dataset generalization. We also introduce Multiclass ADDoG, or MADDoG, which is able to extend the proposed method to more than two datasets, simultaneously. Our results show consistent convergence for the introduced methods, with significantly improved results when not using labels from the target dataset. We also show how, in most cases, ADDoG and MADDoG can be used to improve upon baseline state-of-the-art methods when target dataset labels are added and in-the-wild data are considered. Even though our experiments focus on cross-corpus speech emotion, these methods could be used to remove unwanted factors of variation in other settings.
more » « less
Full Text Available
Trainable Time Warping: Aligning Time-series in the Continuous-time Domain

https://doi.org/10.1109/icassp.2019.8682322

Khorram, Soheil; McInnis, Melvin G; Mower Provost, Emily (May 2019, International Conference on Acoustics, Speech, and Signal Processing (ICASSP))

DTW calculates the similarity or alignment between two signals, subject to temporal warping. However, its computational complexity grows exponentially with the number of time-series. Although there have been algorithms developed that are linear in the number of time-series, they are generally quadratic in time-series length. The exception is generalized time warping (GTW), which has linear computational cost. Yet, it can only identify simple time warping functions. There is a need for a new fast, high-quality multisequence alignment algorithm. We introduce trainable time warping (TTW), whose complexity is linear in both the number and the length of time-series. TTW performs alignment in the continuoustime domain using a sinc convolutional kernel and a gradient-based optimization technique. We compare TTW and GTW on S5 UCR datasets in time-series averaging and classification. TTW outperforms GTW on 67.1% of the datasets for the averaging tasks, and 61.2% of the datasets for the classification tasks.
more » « less
Full Text Available
The PRIORI Emotion Dataset: Linking Mood to Emotion Detected In-the-Wild

Khorram, Soheil; Jaiswal, Mimansa; Gideon, John; McInnis, Melvin; Mower Provost, Emily. (October 2018, Interspeech)

Bipolar Disorder is a chronic psychiatric illness characterized by pathological mood swings associated with severe disruptions in emotion regulation. Clinical monitoring of mood is key to the care of these dynamic and incapacitating mood states. Frequent and detailed monitoring improves clinical sensitivity to detect mood state changes, but typically requires costly and limited resources. Speech characteristics change during both depressed and manic states, suggesting automatic methods applied to the speech signal can be effectively used to monitor mood state changes. However, speech is modulated by many factors, which renders mood state prediction challenging. We hypothesize that emotion can be used as an intermediary step to improve mood state prediction. This paper presents critical steps in developing this pipeline, including (1) a new in the wild emotion dataset, the PRIORI Emotion Dataset, collected from everyday smartphone conversational speech recordings, (2) activation/valence emotion recognition baselines on this dataset (PCC of 0.71 and 0.41, respectively), and (3) significant correlation between predicted emotion and mood state for individuals with bipolar disorder. This provides evidence and a working baseline for the use of emotion as a meta-feature for mood state monitoring.
more » « less
Full Text Available
Predicting the distribution of emotion perception: capturing inter-rater variability

https://doi.org/10.1145/3136755.3136792

Zhang, Biqiao; Essl, Georg; Mower Provost, Emily (November 2017, ACM International Conference on Multimodal Interaction)

Full Text Available
Automatic recognition of self-reported and perceived emotion: does joint modeling help?

https://doi.org/10.1145/2993148.2993173

Zhang, Biqiao; Essl, Georg; Mower Provost, Emily (October 2016, ACM International Conference on Multimodal Interaction)

Full Text Available

Search for: All records