<?xml version="1.0" encoding="UTF-8"?><rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:dcq="http://purl.org/dc/terms/"><records count="1" morepages="false" start="1" end="1"><record rownumber="1"><dc:product_type>Journal Article</dc:product_type><dc:title>A Unified Biosensor–Vision Multi-Modal Transformer network for emotion recognition</dc:title><dc:creator>Ali, Kamran; Hughes, Charles E</dc:creator><dc:corporate_author/><dc:editor/><dc:description>The development of transformer-based models has resulted in significant advances in addressing various vision and NLP-based research challenges. However, the progress made in transformer-based methods has not been effectively applied to biosensor/physiological signal-based emotion recognition research. The reasons are that
transformers require large training data, and most of the biosensor datasets are not large enough to train these models. To address this issue, we propose a novel Unified Biosensor–Vision Multimodal Transformer (UBVMT) architecture, which enables self-supervised pretraining by extracting Remote Photoplethysmography (rPPG)
signals from videos in the large CMU-MOSEI dataset. UBVMT classifies emotions in the arousal-valence space by combining a 2D representation of ECG/PPG signals with facial information. As opposed to modality-specific architecture, our novel unified architecture of UBVMT consists of homogeneous transformer blocks that take
as input the image-based representation of the biosensor signals and the corresponding face information for emotion representation. This minimal modality-specific design reduces the number of parameters in UBVMT by half compared to conventional multimodal transformer networks, enabling its application in our web-based
system, where loading large models poses significant memory challenges. UBVMT is pretrained in a self-supervised manner by employing masked autoencoding to reconstruct masked patches of video frames and 2D scalogram images of ECG/PPG signals, and contrastive modeling to align face and ECG/PPG data. Extensive experiments on publicly available datasets show that our UBVMT-based model produces comparable results to state-of-the-art techniques.</dc:description><dc:publisher>Elsevier Ltd.</dc:publisher><dc:date>2025-04-01</dc:date><dc:nsf_par_id>10628188</dc:nsf_par_id><dc:journal_name>Biomedical Signal Processing and Control</dc:journal_name><dc:journal_volume>102</dc:journal_volume><dc:journal_issue>C</dc:journal_issue><dc:page_range_or_elocation>107232</dc:page_range_or_elocation><dc:issn>1746-8094</dc:issn><dc:isbn/><dc:doi>https://doi.org/10.1016/j.bspc.2024.107232</dc:doi><dcq:identifierAwardId>2114808</dcq:identifierAwardId><dc:subject>emotion recognition</dc:subject><dc:subject>photoplethysmography signal, transformers, representation learning</dc:subject><dc:version_number/><dc:location/><dc:rights/><dc:institution/><dc:sponsoring_org>National Science Foundation</dc:sponsoring_org></record></records></rdf:RDF>