The development of transformer-based models has resulted in significant advances in addressing various vision and NLP-based research challenges. However, the progress made in transformer-based methods has not been effectively applied to biosensor/physiological signal-based emotion recognition research. The reasons are that transformers require large training data, and most of the biosensor datasets are not large enough to train these models. To address this issue, we propose a novel Unified Biosensor–Vision Multimodal Transformer (UBVMT) architecture, which enables self-supervised pretraining by extracting Remote Photoplethysmography (rPPG) signals from videos in the large CMU-MOSEI dataset. UBVMT classifies emotions in the arousal-valence space by combining a 2D representation of ECG/PPG signals with facial information. As opposed to modality-specific architecture, our novel unified architecture of UBVMT consists of homogeneous transformer blocks that take as input the image-based representation of the biosensor signals and the corresponding face information for emotion representation. This minimal modality-specific design reduces the number of parameters in UBVMT by half compared to conventional multimodal transformer networks, enabling its application in our web-based system, where loading large models poses significant memory challenges. UBVMT is pretrained in a self-supervised manner by employing masked autoencoding to reconstruct masked patches of video frames and 2D scalogram images of ECG/PPG signals, and contrastive modeling to align face and ECG/PPG data. Extensive experiments on publicly available datasets show that our UBVMT-based model produces comparable results to state-of-the-art techniques.
more »
« less
Multimodal Neurophysiological Transformer for Emotion Recognition
Understanding neural function often requires multiple modalities of data, including electrophysiogical data, imaging techniques, and demographic surveys. In this paper, we introduce a novel neurophysiological model to tackle major challenges in modeling multimodal data. First, we avoid non-alignment issues between raw signals and extracted, frequency-domain features by addressing the issue of variable sampling rates. Second, we encode modalities through “cross-attention” with other modalities. Lastly, we utilize properties of our parent transformer architecture to model long-range dependencies between segments across modalities and assess intermediary weights to better understand how source signals affect prediction. We apply our Multimodal Neurophysiological Transformer (MNT) to predict valence and arousal in an existing open-source dataset. Experiments on non-aligned multimodal time-series show that our model performs similarly and, in some cases, outperforms existing methods in classification tasks. In addition, qualitative analysis suggests that MNT is able to model neural influences on autonomic activity in predicting arousal. Our architecture has the potential to be fine-tuned to a variety of downstream tasks, including for BCI systems.
more »
« less
- Award ID(s):
- 1934968
- PAR ID:
- 10397442
- Date Published:
- Journal Name:
- 2022 44th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC)
- Page Range / eLocation ID:
- 3563 to 3567
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Multimodal sentiment analysis is a core research area that studies speaker sentiment expressed from the language, visual, and acoustic modalities. The central challenge in multimodal learning involves inferring joint representations that can process and relate information from these modalities. However, existing work learns joint representations by requiring all modalities as input and as a result, the learned representations may be sensitive to noisy or missing modalities at test time. With the recent success of sequence to sequence (Seq2Seq) models in machine translation, there is an opportunity to explore new ways of learning joint representations that may not require all input modalities at test time. In this paper, we propose a method to learn robust joint representations by translating between modalities. Our method is based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input. We augment modality translations with a cycle consistency loss to ensure that our joint representations retain maximal information from all modalities. Once our translation model is trained with paired multimodal data, we only need data from the source modality at test time for final sentiment prediction. This ensures that our model remains robust from perturbations or missing information in the other modalities. We train our model with a coupled translationprediction objective and it achieves new state-of-the-art results on multimodal sentiment analysis datasets: CMU-MOSI, ICTMMMO, and YouTube. Additional experiments show that our model learns increasingly discriminative joint representations with more input modalities while maintaining robustness to missing or perturbed modalities.more » « less
-
Human state recognition is a critical topic with pervasive and important applications in human–machine systems. Multimodal fusion, which entails integrating metrics from various data sources, has proven to be a potent method for boosting recognition performance. Although recent multimodal-based models have shown promising results, they often fall short in fully leveraging sophisticated fusion strategies essential for modeling adequate cross-modal dependencies in the fusion representation. Instead, they rely on costly and inconsistent feature crafting and alignment. To address this limitation, we propose an end-to-end multimodal transformer framework for multimodal human state recognition called Husformer. Specifically, we propose using cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive load datasets [multimodal dataset for objective cognitive workload assessment on simultaneous tasks (MOCAS) and CogLoad] demonstrate that in the recognition of the human state, our Husformer outperforms both state-of-the-art multimodal baselines and the use of a single modality by a large margin, especially when dealing with raw multimodal features. We also conducted an ablation study to show the benefits of each component in Husformer. Experimental details and source code are available at https://github.com/SMARTlab-Purdue/Husformer.more » « less
-
Deep neural networks, including the Transformer architecture, have achieved remarkable performance in various time series tasks. However, their effectiveness in handling clinical time series data is hindered by specific challenges: 1) Sparse event sequences collected asynchronously with multivariate time series, and 2) Limited availability of labeled data. To address these challenges, we propose Our code is available at https://github.com/SigmaTsing/TransEHR.git . , a self-supervised Transformer model designed to encode multi-sourced asynchronous sequential data, such as structured Electronic Health Records (EHRs), efficiently. We introduce three pretext tasks for pre-training the Transformer model, utilizing large amounts of unlabeled structured EHR data, followed by fine-tuning on downstream prediction tasks using the limited labeled data. Through extensive experiments on three real-world health datasets, we demonstrate that our model achieves state-of-the-art performance on benchmark clinical tasks, including in-hospital mortality classification, phenotyping, and length-of-stay prediction. Our findings highlight the efficacy of in effectively addressing the challenges associated with clinical time series data, thus contributing to advancements in healthcare analytics.more » « less
-
Vision-language (VL) pre-training has recently received considerable attention. However, most existing end-to-end pre-training approaches either only aim to tackle VL tasks such as image-text retrieval, visual question answering (VQA) and image captioning that test high-level understanding of images, or only target region-level understanding for tasks such as phrase grounding and object detection. We present FIBER (Fusion-In-the-Backbone-based transformER), a new VL model architecture that can seamlessly handle both these types of tasks. Instead of having dedicated transformer layers for fusion after the uni-modal backbones, FIBER pushes multimodal fusion deep into the model by inserting cross-attention into the image and text backbones to better capture multimodal interactions. In addition, unlike previous work that is either only pre-trained on image-text data or on fine-grained data with box-level annotations, we present a two-stage pre-training strategy that uses both these kinds of data efficiently: (i) coarse-grained pre-training based on image-text data; followed by (ii) fine-grained pre-training based on image-text-box data. We conduct comprehensive experiments on a wide range of VL tasks, ranging from VQA, image captioning, and retrieval, to phrase grounding, referring expression comprehension, and object detection. Using deep multimodal fusion coupled with the two-stage pre-training, FIBER provides consistent performance improvements over strong baselines across all tasks, often outperforming methods using magnitudes more data. Code is released at https://github.com/microsoft/FIBER.more » « less
An official website of the United States government

