skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Multimodal Context-Based Continuous Authentication
We present a new multimodal, context-based dataset for continuous authentication. The dataset contains 27 subjects, with an age range of [8, 72], where data has been collected across multiple sessions while the subjects are watching videos meant to elicit an emotional response. Collected data includes accelerometer data, heart rate, electrodermal activity, skin temperature, and face videos. We also propose a baseline approach for fair comparisons when using the proposed dataset. The approach uses a combination of a pretrained backbone network with supervised contrastive loss for face. Time-series features are also extracted, from the physiological signals, which are used for classification. This approach, on the proposed dataset, results in an average accuracy, precision, and recall of 76.59%, 88.90, and 53.25, respectively, on electrical signals, and 90.39%, 98.77, and 75.71, respectively on face videos.  more » « less
Award ID(s):
2039373
PAR ID:
10495773
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
IEEE
Date Published:
ISBN:
979-8-3503-3726-6
Page Range / eLocation ID:
1 to 10
Format(s):
Medium: X
Location:
Ljubljana, Slovenia
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we propose a machine learning-based multi-stream framework to recognize American Sign Language (ASL) manual signs and nonmanual gestures (face and head movements) in real time from RGB-D videos. Our approach is based on 3D Convolutional Neural Networks (3D CNNs) by fusing the multi-modal features including hand gestures, facial expressions, and body poses from multiple channels (RGB, Depth, Motion, and Skeleton joints). To learn the overall temporal dynamics in a video, a proxy video is generated by selecting a subset of frames for each video which are then used to train the proposed 3D CNN model. We collected a new ASL dataset, ASL-100-RGBD, which contains 42 RGB-D videos captured by a Microsoft Kinect V2 camera. Each video consists of 100 ASL manual signs, along with RGB channel, Depth maps, Skeleton joints, Face features, and HD face. The dataset is fully annotated for each semantic region (i.e. the time duration of each sign that the human signer performs). Our proposed method achieves 92.88% accuracy for recognizing 100 ASL sign glosses in our newly collected ASL-100-RGBD dataset. The effectiveness of our framework for recognizing hand gestures from RGB-D videos is further demonstrated on a large-scale dataset, ChaLearn IsoGD, achieving the state-of-the-art results. 
    more » « less
  2. In this paper, we propose a novel, generalizable, and scalable idea that eliminates the need for collecting Radio Frequency (RF) measurements, when training RF sensing systems for human-motion-related activities. Existing learning-based RF sensing systems require collecting massive RF training data, which depends heavily on the particular sensing setup/involved activities. Thus, new data needs to be collected when the setup/activities change, significantly limiting the practical deployment of RF sensing systems. On the other hand, recent years have seen a growing, massive number of online videos involving various human activities/motions. In this paper, we propose to translate such already-available online videos to instant simulated RF data for training any human-motion-based RF sensing system, in any given setup. To validate our proposed framework, we conduct a case study of gym activity classification, where CSI magnitude measurements of three WiFi links are used to classify a person's activity from 10 different physical exercises. We utilize YouTube gym activity videos and translate them to RF by simulating the WiFi signals that would have been measured if the person in the video was performing the activity near the transceivers. We then train a classifier on the simulated data, and extensively test it with real WiFi data of 10 subjects performing the activities in 3 areas. Our system achieves a classification accuracy of 86% on activity periods, each containing an average of 5.1 exercise repetitions, and 81% on individual repetitions of the exercises. This demonstrates that our approach can generate reliable RF training data from already-available videos, and can successfully train an RF sensing system without any real RF measurements. The proposed pipeline can also be used beyond training and for analysis and design of RF sensing systems, without the need for massive RF data collection. 
    more » « less
  3. Face recognition systems are susceptible to presentation attacks such as printed photo attacks, replay attacks, and 3D mask attacks. These attacks, primarily studied in visible spectrum, aim to obfuscate or impersonate a person’s identity. This paper presents a unique multispectral video face database for face presentation attack using latex and paper masks. The proposed Multispectral Latex Mask based Video Face Presentation Attack (MLFP) database contains 1350 videos in visible, near infrared, and thermal spectrums. Since the database consists of videos of subjects without any mask as well as wearing ten different masks, the effect of identity concealment is analyzed in each spectrum using face recognition algorithms. We also present the performance of existing presentation attack detection algorithms on the proposed MLFP database. It is observed that the thermal imaging spectrum is most effective in detecting face presentation attacks. 
    more » « less
  4. null (Ed.)
    This paper introduces a new ROSbag-based multimodal affective dataset for emotional and cognitive states generated using the Robot Operating System (ROS). We utilized images and sounds from the International Affective Pictures System (IAPS) and the International Affective Digitized Sounds (IADS) to stimulate targeted emotions (happiness, sadness, anger, fear, surprise, disgust, and neutral), and a dual N-back game to stimulate different levels of cognitive workload. 30 human subjects participated in the user study; their physiological data were collected using the latest commercial wearable sensors, behavioral data were collected using hardware devices such as cameras, and subjective assessments were carried out through questionnaires. All data were stored in single ROSbag files rather than in conventional Comma-Separated Values (CSV) files. This not only ensures synchronization of signals and videos in a data set, but also allows researchers to easily analyze and verify their algorithms by connecting directly to this dataset through ROS. The generated affective dataset consists of 1,602 ROSbag files, and the size of the dataset is about 787GB. The dataset is made publicly available. We expect that our dataset can be a great resource for many researchers in the fields of affective computing, Human-Computer Interaction (HCI), and Human-Robot Interaction (HRI). 
    more » « less
  5. As virtual reality (VR) offers an unprecedented experience than any existing multimedia technologies, VR videos, or called 360-degree videos, have attracted considerable attention from academia and industry. How to quantify and model end users' perceived quality in watching 360-degree videos, or called QoE, resides the center for high-quality provisioning of these multimedia services. In this work, we present EyeQoE, a novel QoE assessment model for 360-degree videos using ocular behaviors. Unlike prior approaches, which mostly rely on objective factors, EyeQoE leverages the new ocular sensing modality to comprehensively capture both subjective and objective impact factors for QoE modeling. We propose a novel method that models eye-based cues into graphs and develop a GCN-based classifier to produce QoE assessment by extracting intrinsic features from graph-structured data. We further exploit the Siamese network to eliminate the impact from subjects and visual stimuli heterogeneity. A domain adaptation scheme named MADA is also devised to generalize our model to a vast range of unseen 360-degree videos. Extensive tests are carried out with our collected dataset. Results show that EyeQoE achieves the best prediction accuracy at 92.9%, which outperforms state-of-the-art approaches. As another contribution of this work, we have publicized our dataset on https://github.com/MobiSec-CSE-UTA/EyeQoE_Dataset.git. 
    more » « less