Learning Cross-Modal Audiovisual Representations with Ladder Networks for Emotion Recognition

Goncalves, Lucas; Busso, Carlos

doi:10.1109/ICASSP49357.2023.10096138

Citation Details

Learning Cross-Modal Audiovisual Representations with Ladder Networks for Emotion Recognition

Representation learning is a challenging, but essential task in audiovisual learning. A key challenge is to generate strong cross-modal representations while still capturing discriminative information contained in unimodal features. Properly capturing this information is important to increase accuracy and robustness in audio-visual tasks. Focusing on emotion recognition, this study proposes novel cross-modal ladder networks to capture modality-specific in-formation while building strong cross-modal representations. Our method utilizes representations from a backbone network to implement unsupervised auxiliary tasks to reconstruct intermediate layer representations across the acoustic and visual networks. The skip connections between the cross-modal encoder and decoder provide powerful modality-specific and multimodal representations for emotion recognition. Our model on the CREMA-D corpus achieves high performance with precision, recall, and F1 scores over 80% on a six-class problem. more »

Award ID(s):: 1718944

PAR ID:: 10441291

Author(s) / Creator(s):: Goncalves, Lucas; Busso, Carlos

Date Published:: 2023-06-04

Journal Name:: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2023)

Page Range / eLocation ID:: 1 to 5

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/ICASSP49357.2023.10096138

More Like this