Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion

Xie, Baijun; Sidulova, Mariia; Park, Chung Hyuk

doi:10.3390/s21144913

Citation Details

Robust Multimodal Emotion Recognition from Conversation with Transformer-Based Crossmodality Fusion

Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for multimodal emotion recognition during a conversation. Three separate models for audio, video and text modalities are structured and fine-tuned on the MELD. In this paper, a transformer-based crossmodality fusion with the EmbraceNet architecture is employed to estimate the emotion. The proposed multimodal network architecture can achieve up to 65% accuracy, which significantly surpasses any of the unimodal models. We provide multiple evaluation techniques applied to our work to show that our model is robust and can even outperform the state-of-the-art models on the MELD. more »

Award ID(s):: 1846658

PAR ID:: 10316813

Author(s) / Creator(s):: Xie, Baijun; Sidulova, Mariia; Park, Chung Hyuk

Date Published:: 2021-07-01

Journal Name:: Sensors

Volume:: 21

Issue:: 14

ISSN:: 1424-8220

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.3390/s21144913

More Like this