Speech emotion recognition (SER) is a challenging task due to the limited availability of real-world labeled datasets. Since it is easier to find unlabeled data, the use of self-supervised learning (SSL) has become an attractive alternative. This study proposes new pre-text tasks for SSL to improve SER. While our target application is SER, the proposed pre-text tasks include audio-visual formulations, leveraging the relationship between acoustic and facial features. Our proposed approach introduces three new unimodal and multimodal pre-text tasks that are carefully designed to learn better representations for predicting emotional cues from speech. Task 1 predicts energy variations (high or low) from a speech sequence. Task 2 uses speech features to predict facial activation (high or low) based on facial landmark movements. Task 3 performs a multi-class emotion recognition task on emotional labels obtained from combinations of action units (AUs) detected across a video sequence. We pre-train a network with 60.92 hours of unlabeled data, fine-tuning the model for the downstream SER task. The results on the CREMA-D dataset show that the model pre-trained on the proposed domain-specific pre-text tasks significantly improves the precision (up to 5.1%), recall (up to 4.5%), and F1-scores (up to 4.9%) of our SER system.
more »
« less
Style Extractor For Facial Expression Recognition in the Presence of Speech
The performance of facial expression recognition (FER) systems has improved with recent advances in machine learning. While studies have reported impressive accuracies in detecting emotion from posed expressions in static images, there are still important challenges in developing FER systems for videos, especially in the presence of speech. Speech articulation modulates the orofacial area, changing the facial appearance. These facial movements induced by speech introduce noise, reducing the performance of an FER system. Solving this problem is important if we aim to study more naturalistic environment or applications in the wild. We propose a novel approach to compensate for lexical information that does not require phonetic information during inference. The approach relies on a style extractor model, which creates emotional-to-neutral transformations. The transformed facial representations are spatially contrasted with the original faces, highlighting the emotional information conveyed in the video. The results demonstrate that adding the proposed style extractor model to a dynamic FER system improves the performance by 7% (absolute) compared to a similar model with no style extractor. This novel feature representation also improves the generaliza- tion of the model.
more »
« less
- Award ID(s):
- 1718944
- PAR ID:
- 10287568
- Date Published:
- Journal Name:
- IEEE International Conference on Image Processing (ICIP 2020)
- Page Range / eLocation ID:
- 1806 to 1810
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Thefaceconveysablendofverbalandnonverbalinformation playing an important role in daily interaction. While speech articulation mostly affects the orofacial areas, emotional behaviors are externalized across the entire face. Considering the relation between verbal and non-verbal behaviors is important to create naturalistic facial movements for conversational agents (CAs). Furthermore, facial muscles connect areas across the face, creating principled relationships and dependencies between the movements that have to be taken into account. These relationships are ignored when facial movements across the face are sep- arately generated. This paper proposes to create speech-driven models that jointly capture the relationship not only between speech and facial movements, but also across facial movements. The input to the models are features extracted from speech that convey the verbal and emotional states of the speakers. We build our models with bidirectional long-short term memory (BLSTM) units which are shown to be very successful in modeling dependencies for sequential data. The objective and subjective evaluations of the results demonstrate the benefits of joint modeling of facial regions using this framework.more » « less
-
The ability to identify speech with similar emotional content is valuable to many applications, including speech retrieval, surveillance, and emotional speech synthesis. While current formulations in speech emotion recognition based on classification or regression are not appropriate for this task, solutions based on preference learning offer appealing approaches for this task. This paper aims to find speech samples that are emotionally similar to an anchor speech sample provided as a query. This novel formulation opens interesting research questions. How well can a machine complete this task? How does the accuracy of automatic algorithms compare to the performance of a human performing this task? This study addresses these questions by training a deep learning model using a triplet loss function, mapping the acoustic features into an embedding that is discriminative for this task. The network receives an anchor speech sample and two competing speech samples, and the task is to determine which of the candidate speech sample conveys the closest emotional content to the emotion conveyed by the anchor. By comparing the results from our model with human perceptual evaluations, this study demonstrates that the proposed approach has performance very close to human performance in retrieving samples with similar emotional content.more » « less
-
null (Ed.)Speech emotion recognition (SER) plays an important role in multiple fields such as healthcare, human-computer interaction (HCI), and security and defense. Emotional labels are often annotated at the sentence-level (i.e., one label per sentence), resulting in a sequence-to-one recognition problem. Traditionally, studies have relied on statistical descriptions, which are com- puted over time from low level descriptors (LLDs), creating a fixed dimension sentence-level feature representation regardless of the duration of the sentence. However sentence-level features lack temporal information, which limits the performance of SER systems. Recently, new deep learning architectures have been proposed to model temporal data. An important question is how to extract emotion-relevant features with temporal infor- mation. This study proposes a novel data processing approach that extracts a fixed number of small chunks over sentences of different durations by changing the overlap between these chunks. The approach is flexible, providing an ideal frame- work to combine gated network or attention mechanisms with long short-term memory (LSTM) networks. Our experimental results based on the MSP-Podcast dataset demonstrate that the proposed method not only significantly improves recognition accuracy over alternative temporal-based models relying on LSTM, but also leads to computational efficiency.more » « less
-
The increased ubiquitousness of small smart devices, such as cell- phones, tablets, smart watches and laptops, has led to unique user data, which can be locally processed. The sensors (e.g., microphones and webcam) and improved hardware of the new devices have al- lowed running deep learning models that 20 years ago would have been exclusive to high-end expensive machines. In spite of this progress, state-of-the-art algorithms for facial expression recognition (FER) rely on architectures that cannot be implemented on these devices due to computational and memory constraints. Alternatives involving cloud-based solutions impose privacy barriers that prevent their adoption or user acceptance in wide range of applications. This paper proposes a lightweight model that can run in real-time for image facial expression recognition (IFER) and video facial expression recognition (VFER). The approach relies on a personalization mechanism locally implemented for each subject by fine-tuning a central VFER model with unlabeled videos from a target subject. We train the IFER model to generate pseudo labels and we select the videos with the highest confident predictions to be used for adaptation. The adaptation is performed by implementing a federated learning strategy where the weights of the local model are averaged and used by the central VFER model. We demonstrate that this approach can improve not only the performance on the edge device providing personalized models to the users, but also the central VFER model. We implement a federated learning strategy where the weights of the local models are averaged and used by the central VFER. Within corpus and cross-corpus evaluations on two emotional databases demonstrate that edge models adapted with our personalization strategy achieve up to 13.1% gains in F1-scores. Furthermore, the federated learning implementation improves the mean micro F1-score of the central VFER model by up to 3.4%. The proposed lightweight solution is ideal for interactive user interfaces that preserve the data of the users.more » « less