skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Enhancing Resilience to Missing Data in Audio-Text Emotion Recognition with Multi-Scale Chunk Regularization
Most existing audio-text emotion recognition studies have focused on the computational modeling aspects, including strategies for fusing the modalities. An area that has received less attention is understanding the role of proper temporal synchronization between the modalities in the model performance. This study presents a transformer-based model designed with a word-chunk concept, which offers an ideal framework to explore different strategies to align text and speech. The approach creates chunks with alternative alignment strategies with different levels of dependency on the underlying lexical boundaries. A key contribution of this study is the multi-scale chunk alignment strategy, which generates random alignments to create the chunks without considering lexical boundaries. For every epoch, the approach generates a different alignment for each sentence, serving as an effective regularization method for temporal dependency. Our experimental results based on the MSP-Podcast corpus indicate that providing precise temporal alignment information to create the audio-text chunks does not improve the performance of the system. The attention mechanisms in the transformer-based approach are able to compensate for imperfect synchronization between the modalities. However, using exact lexical boundaries makes the system highly vulnerable to missing modalities. In contrast, the model trained with the proposed multi-scale chunk regularization strategy using random alignment can significantly increase its robustness against missing data and remain effective, even under a single audio-only emotion recognition task. The code is available at: https://github.com/winston-lin-wei-cheng/MultiScale-Chunk-Regularization  more » « less
Award ID(s):
2016719
PAR ID:
10532850
Author(s) / Creator(s):
; ;
Corporate Creator(s):
Editor(s):
na
Publisher / Repository:
ACM
Date Published:
ISBN:
9798400700552
Page Range / eLocation ID:
207 to 215
Subject(s) / Keyword(s):
multimodal emotion recognition, robust modeling
Format(s):
Medium: X
Location:
Paris France
Sponsoring Org:
National Science Foundation
More Like this
  1. Chunk-level speech emotion recognition (SER) is a common modeling scheme to obtain better recognition performance than sentence-level formulations. A key open question is the role of lexical boundary information in the process of splitting a sentence into small chunks. Is there any benefit in providing precise lexi- cal boundary information to segment the speech into chunks (e.g., word-level alignments)? This study analyzes the role of lexical boundary information by exploring alternative segmentation strategies for chunk-level SER. We compare six chunk-level segmentation strategies that either consider word-level alignments or traditional time-based segmentation methods by varying the number of chunks and the duration of the chunks. We conduct extensive experiments to evaluate these chunk-level segmentation approaches using multiples corpora, and multiple acoustic feature sets. The results show a minor contribution of the word-level timing boundaries, where centering the chunks around words does not lead to significant performance gains. Instead, the critical factor to effectively segment a sentence into data chunks is to define the number of chunks according to the number of spoken words in the sentence. 
    more » « less
  2. Emotion recognition is inherently a multimodal problem. Humans use both audible and visual cues to determine a person’s emotions. There has been extensive improvement in the methods we use to fuse audio and visual representations between two unimodal deep-learning models. However, there is a lack of accommodation for modalities that have a disparity in the amount of computational resources needed to provide the same amount of temporal information. As the sequence length increases, current methods often make simplifications such as discarding frames or cropping the sequence. This paper introduces a chunking methodology designed for cross-attention-based multimodal transformer architectures. The approach involves segmenting the visual input—the more computationally demanding modality—into chunks. Cross-attention is then performed between the encoded audio and visual features instead of the original sequence lengths of the unimodal backbones. Our method achieves significant improvements over conventional cross-attention techniques in the audio-visual domain for a six-class emotional recognition problem, demonstrating better F1 score, precision, and recall on the CREMA-D database while reducing computational overhead. 
    more » « less
  3. Decades of scientific research have been conducted on developing and evaluating methods for automated emotion recognition. With exponentially growing technology, there is a wide range of emerging applications that require emotional state recognition of the user. This paper investigates a robust approach for multimodal emotion recognition during a conversation. Three separate models for audio, video and text modalities are structured and fine-tuned on the MELD. In this paper, a transformer-based crossmodality fusion with the EmbraceNet architecture is employed to estimate the emotion. The proposed multimodal network architecture can achieve up to 65% accuracy, which significantly surpasses any of the unimodal models. We provide multiple evaluation techniques applied to our work to show that our model is robust and can even outperform the state-of-the-art models on the MELD. 
    more » « less
  4. null (Ed.)
    Cooperative 3D printing (C3DP) is a novel approach to additive manufacturing, where multiple printhead-carrying mobile robots work together cooperatively to print a desired part. The core of C3DP is the chunk-based printing strategy in which the desired part is first split into smaller chunks, and then the chunks are assigned to individual printing robots. These robots will work on the chunks simultaneously and in a scheduled sequence until the entire part is complete. Though promising, C3DP lacks proper framework that enables automatic chunking and scheduling given the available number of robots. In this study, we develop a computational framework that can automatically generate print schedule for specified number of chunks. The framework contains 1) a random generator that creates random print schedule using adjacency matrix which represents directed dependency tree (DDT) structure of chunks; 2) a set of geometric constraints against which the randomly generated schedules will be checked for validation; and 3) a printing time evaluation metric for comparing the performance of all valid schedules. With the developed framework, we present a case study by printing a large rectangular plate which has dimensions beyond what traditional desktop printers can print. The study showcases that our computation framework can successfully generate a variety of scheduling strategies for collision-free C3DP without any human interventions. 
    more » « less
  5. null (Ed.)
    Speech emotion recognition (SER) plays an important role in multiple fields such as healthcare, human-computer interaction (HCI), and security and defense. Emotional labels are often annotated at the sentence-level (i.e., one label per sentence), resulting in a sequence-to-one recognition problem. Traditionally, studies have relied on statistical descriptions, which are com- puted over time from low level descriptors (LLDs), creating a fixed dimension sentence-level feature representation regardless of the duration of the sentence. However sentence-level features lack temporal information, which limits the performance of SER systems. Recently, new deep learning architectures have been proposed to model temporal data. An important question is how to extract emotion-relevant features with temporal infor- mation. This study proposes a novel data processing approach that extracts a fixed number of small chunks over sentences of different durations by changing the overlap between these chunks. The approach is flexible, providing an ideal frame- work to combine gated network or attention mechanisms with long short-term memory (LSTM) networks. Our experimental results based on the MSP-Podcast dataset demonstrate that the proposed method not only significantly improves recognition accuracy over alternative temporal-based models relying on LSTM, but also leads to computational efficiency. 
    more » « less