Human state recognition is a critical topic with pervasive and important applications in human–machine systems. Multimodal fusion, which entails integrating metrics from various data sources, has proven to be a potent method for boosting recognition performance. Although recent multimodal-based models have shown promising results, they often fall short in fully leveraging sophisticated fusion strategies essential for modeling adequate cross-modal dependencies in the fusion representation. Instead, they rely on costly and inconsistent feature crafting and alignment. To address this limitation, we propose an end-to-end multimodal transformer framework for multimodal human state recognition called Husformer. Specifically, we propose using cross-modal transformers, which inspire one modality to reinforce itself through directly attending to latent relevance revealed in other modalities, to fuse different modalities while ensuring sufficient awareness of the cross-modal interactions introduced. Subsequently, we utilize a self-attention transformer to further prioritize contextual information in the fusion representation. Extensive experiments on two human emotion corpora (DEAP and WESAD) and two cognitive load datasets [multimodal dataset for objective cognitive workload assessment on simultaneous tasks (MOCAS) and CogLoad] demonstrate that in the recognition of the human state, our Husformer outperforms both state-of-the-art multimodal baselines and the use of a single modality by a large margin, especially when dealing with raw multimodal features. We also conducted an ablation study to show the benefits of each component in Husformer. Experimental details and source code are available at https://github.com/SMARTlab-Purdue/Husformer. 
                        more » 
                        « less   
                    This content will become publicly available on April 1, 2026
                            
                            A Unified Biosensor–Vision Multi-Modal Transformer network for emotion recognition
                        
                    
    
            The development of transformer-based models has resulted in significant advances in addressing various vision and NLP-based research challenges. However, the progress made in transformer-based methods has not been effectively applied to biosensor/physiological signal-based emotion recognition research. The reasons are that transformers require large training data, and most of the biosensor datasets are not large enough to train these models. To address this issue, we propose a novel Unified Biosensor–Vision Multimodal Transformer (UBVMT) architecture, which enables self-supervised pretraining by extracting Remote Photoplethysmography (rPPG) signals from videos in the large CMU-MOSEI dataset. UBVMT classifies emotions in the arousal-valence space by combining a 2D representation of ECG/PPG signals with facial information. As opposed to modality-specific architecture, our novel unified architecture of UBVMT consists of homogeneous transformer blocks that take as input the image-based representation of the biosensor signals and the corresponding face information for emotion representation. This minimal modality-specific design reduces the number of parameters in UBVMT by half compared to conventional multimodal transformer networks, enabling its application in our web-based system, where loading large models poses significant memory challenges. UBVMT is pretrained in a self-supervised manner by employing masked autoencoding to reconstruct masked patches of video frames and 2D scalogram images of ECG/PPG signals, and contrastive modeling to align face and ECG/PPG data. Extensive experiments on publicly available datasets show that our UBVMT-based model produces comparable results to state-of-the-art techniques. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2114808
- PAR ID:
- 10628188
- Publisher / Repository:
- Elsevier Ltd.
- Date Published:
- Journal Name:
- Biomedical Signal Processing and Control
- Volume:
- 102
- Issue:
- C
- ISSN:
- 1746-8094
- Page Range / eLocation ID:
- 107232
- Subject(s) / Keyword(s):
- emotion recognition photoplethysmography signal, transformers, representation learning
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Baba, Justin S; Coté, Gerard L (Ed.)In this research, we examine the potential of measuring physiological variables, including heart rate (HR) and respiration rate (RR) on the upper arm using a wireless multimodal sensing system consisting of an accelerometer, a gyroscope, a three-wavelength photoplethysmography (PPG), single-sided electrocardiography (SS-ECG), and bioimpedance (BioZ). The study included collecting HR data when the subject was at rest and typing, and RR data when the subject was at rest. The data from three wavelengths of PPG and BioZ were collected and compared to the SS-ECG as the standard. The accelerometer and gyro signals were used to exclude data with excessive noise due to motion. The results showed that when the subject remained sedentary, the mean absolute error (MAE) for the HR calculation for all three wavelengths of the PPG modality was less than two bpm, while the BioZ was 3.5 bpm compared with SS-ECG HR. The MAE for typing increased for both modalities and was less than three bpm for all three wavelengths of the PPG but increased to 7.5 bpm for the BioZ. Regarding RR, both modalities resulted in RR within one breath per minute of the SS-ECG modality for the one breathing rate. Overall, all modalities on this upper arm wearable worked well when the subject was sedentary. Still, the SS-ECG and PPG showed less variability for the HR signal in the presence of motion during micro-motions such as typing.more » « less
- 
            This article presents a computational solution that enables continuous cardiac monitoring through cross-modality inference of electrocardiogram (ECG). While some smartwatches now allow users to obtain a 30-s ECG test by tapping a built-in bio-sensor, these short-term ECG tests often miss intermittent and asymptomatic abnormalities of cardiac functions. It is also infeasible to expect persistently active user participation for long-term continuous cardiac monitoring in order to capture these and other types of cardiac abnormalities. To alleviate the need for continuous user attention and active participation, we design a lightweight neural network that infers ECG from the photoplethysmogram (PPG) signal sensed at the skin surface by a wearable optical sensor. We also develop a diagnosis-oriented training strategy to enable the neural network to capture the pathological features of ECG, aiming to increase the utility of reconstructed ECG signals for screening cardiovascular diseases (CVDs). We also leverage model interpretation to obtain insights from data-driven models, for example, to reveal some associations between CVDs and ECG/PPG and to demonstrate how the neural network copes with motion artifacts in the ambulatory application. The experimental results on three datasets demonstrate the feasibility of inferring ECG from PPG, achieving a high fidelity of ECG reconstruction with only about 40000 parameters.more » « less
- 
            Recent Transformer-based contextual word representations, including BERT and XLNet, have shown state-of-the-art performance in multiple disciplines within NLP. Fine-tuning the trained contextual models on task-specific datasets has been the key to achieving superior performance downstream. While fine-tuning these pre-trained models is straightforward for lexical applications (applications with only language modality), it is not trivial for multimodal language (a growing area in NLP focused on modeling face-to-face communication). More specifically, this is due to the fact that pre-trained models don’t have the necessary components to accept two extra modalities of vision and acoustic. In this paper, we proposed an attachment to BERT and XLNet called Multimodal Adaptation Gate (MAG). MAG allows BERT and XLNet to accept multimodal nonverbal data during fine-tuning. It does so by generating a shift to internal representation of BERT and XLNet; a shift that is conditioned on the visual and acoustic modalities. In our experiments, we study the commonly used CMU-MOSI and CMU-MOSEI datasets for multimodal sentiment analysis. Fine-tuning MAG-BERT and MAG-XLNet significantly boosts the sentiment analysis performance over previous baselines as well as language-only fine-tuning of BERT and XLNet. On the CMU-MOSI dataset, MAG-XLNet achieves human-level multimodal sentiment analysis performance for the first time in the NLP community.more » « less
- 
            The inverse problem of inferring clinical gold-standard electrocardiogram (ECG) from photoplethysmogram (PPG) that can be measured by affordable wearable Internet of Healthcare Things (IoHT) devices is a research direction receiving growing attention. It combines the easy measurability of PPG and the rich clinical knowledge of ECG for long-term continuous cardiac monitoring. The prior art for reconstruction using a universal basis, such as discrete cosine transform (DCT), has limited fidelity for uncommon ECG shapes due to the lack of representative power. To better utilize the data and improve data representation, we design two dictionary learning frameworks, the cross-domain joint dictionary learning (XDJDL), and the label-consistent XDJDL (LC-XDJDL), to further improve the ECG inference quality and enrich the PPG-based diagnosis knowledge. Building on the K-SVD technique, the proposed joint dictionary learning frameworks extend the expressive power by optimizing simultaneously a pair of signal dictionaries for PPG and ECG with the transforms to relate their sparse codes and disease information. The proposed models are evaluated with a variety of PPG and ECG morphologies from two benchmark datasets that cover various age groups and disease types. The results show the proposed frameworks achieve better inference performance than previous methods with average Pearson coefficients being 0.88 using XDJDL and 0.92 using LC-XDJDL, suggesting an encouraging potential for ECG screening using PPG based on the proactively learned PPG-ECG relationship. By enabling the dynamic monitoring and analysis of the health status of an individual, the proposed frameworks contribute to the emerging digital twins paradigm for personalized healthcare.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
