skip to main content


Title: ViTag: Online WiFi Fine Time Measurements Aided Vision-Motion Identity Association in Multi-person Environments
In this paper, we present ViTag to associate user identities across multimodal data, particularly those obtained from cameras and smartphones. ViTag associates a sequence of vision tracker generated bounding boxes with Inertial Measurement Unit (IMU) data and Wi-Fi Fine Time Measurements (FTM) from smartphones. We formulate the problem as association by sequence to sequence (seq2seq) translation. In this two-step process, our system first performs cross-modal translation using a multimodal LSTM encoder-decoder network (X-Translator) that translates one modality to another, e.g. reconstructing IMU and FTM readings purely from camera bounding boxes. Second, an association module finds identity matches between camera and phone domains, where the translated modality is then matched with the observed data from the same modality. In contrast to existing works, our proposed approach can associate identities in multi-person scenarios where all users may be performing the same activity. Extensive experiments in real-world indoor and outdoor environments demonstrate that online association on camera and phone data (IMU and FTM) achieves an average Identity Precision Accuracy (IDP) of 88.39% on a 1 to 3 seconds window, outperforming the state-of-the-art Vi-Fi (82.93%). Further study on modalities within the phone domain shows the FTM can improve association performance by 12.56% on average. Finally, results from our sensitivity experiments demonstrate the robustness of ViTag under different noise and environment variations.  more » « less
Award ID(s):
1901355 1919752 1901133
NSF-PAR ID:
10437904
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
2022 19th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON)
Page Range / eLocation ID:
19 to 27
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we present Vi-Fi, a multi-modal system that leverages a user’s smartphone WiFi Fine Timing Measurements (FTM) and inertial measurement unit (IMU) sensor data to associate the user detected on a camera footage with their corresponding smartphone identifier (e.g. WiFi MAC address). Our approach uses a recurrent multi-modal deep neural network that exploits FTM and IMU measurements along with distance between user and camera (depth information) to learn affinity matrices. As a baseline method for comparison, we also present a traditional non deep learning approach that uses bipartite graph matching. To facilitate evaluation, we collected a multi-modal dataset that comprises camera videos with depth information (RGB-D), WiFi FTM and IMU measurements for multiple participants at diverse real-world settings. Using association accuracy as the key metric for evaluating the fidelity of Vi-Fi in associating human users on camera feed with their phone IDs, we show that Vi-Fi achieves between 81% (real-time) to 91% (offline) association accuracy 
    more » « less
  2. null (Ed.)
    We demonstrate an application of finding target persons on a surveillance video. Each visually detected participant is tagged with a smartphone ID and the target person with the query ID is highlighted. This work is motivated by the fact that establishing associations between subjects observed in camera images and messages transmitted from their wireless devices can enable fast and reliable tagging. This is particularly helpful when target pedestrians need to be found on public surveillance footage, without the reliance on facial recognition. The underlying system uses a multi-modal approach that leverages WiFi Fine Timing Measurements (FTM) and inertial sensor (IMU) data to associate each visually detected individual with a corresponding smartphone identifier. These smartphone measurements are combined strategically with RGB-D information from the camera, to learn affinity matrices using a multi-modal deep learning network. 
    more » « less
  3. null (Ed.)
    Due to the importance of object detection in video analysis and image annotation, it is widely utilized in a number of computer vision tasks such as face recognition, autonomous vehicles, activity recognition, tracking objects and identity verification. Object detection does not only involve classification and identification of objects within images, but also involves localizing and tracing the objects by creating bounding boxes around the objects and labelling them with their respective prediction scores. Here, we leverage and discuss how connected vision systems can be used to embed cameras, image processing, Edge Artificial Intelligence (AI), and data connectivity capabilities for flood label detection. We favored the engineering definition of label detection that a label is a sequence of discrete measurable observations obtained using a capturing device such as web cameras, smart phone, etc. We built a Big Data service of around 1000 images (image annotation service) including the image geolocation information from various flooding events in the Carolinas (USA) with a total of eight different object categories. Our developed platform has several smart AI tools and task configurations that can detect objects’ edges or contours which can be manually adjusted with a threshold setting so as to best segment the image. The tool has the ability to train the dataset and predict the labels for large scale datasets which can be used as an object detector to drastically reduce the amount of time spent per object particularly for real-time image-based flood forecasting. This research is funded by the US National Science Foundation (NSF). 
    more » « less
  4. Multimodal sentiment analysis is a core research area that studies speaker sentiment expressed from the language, visual, and acoustic modalities. The central challenge in multimodal learning involves inferring joint representations that can process and relate information from these modalities. However, existing work learns joint representations by requiring all modalities as input and as a result, the learned representations may be sensitive to noisy or missing modalities at test time. With the recent success of sequence to sequence (Seq2Seq) models in machine translation, there is an opportunity to explore new ways of learning joint representations that may not require all input modalities at test time. In this paper, we propose a method to learn robust joint representations by translating between modalities. Our method is based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input. We augment modality translations with a cycle consistency loss to ensure that our joint representations retain maximal information from all modalities. Once our translation model is trained with paired multimodal data, we only need data from the source modality at test time for final sentiment prediction. This ensures that our model remains robust from perturbations or missing information in the other modalities. We train our model with a coupled translationprediction objective and it achieves new state-of-the-art results on multimodal sentiment analysis datasets: CMU-MOSI, ICTMMMO, and YouTube. Additional experiments show that our model learns increasingly discriminative joint representations with more input modalities while maintaining robustness to missing or perturbed modalities. 
    more » « less
  5. Reliably identifying and authenticating smart- phones is critical in our daily life since they are increasingly being used to manage sensitive data such as private messages and financial data. Recent researches on hardware fingerprinting show that each smartphone, regardless of the manufacturer or make, possesses a variety of hardware fingerprints that are unique, robust, and physically unclonable. There is a growing interest in designing and implementing hardware-rooted smart- phone authentication which authenticates smartphones through verifying the hardware fingerprints of their built-in sensors. Unfortunately, previous fingerprinting methods either involve large registration overhead or suffer from fingerprint forgery attacks, rendering them infeasible in authentication systems. In this paper, we propose ABC, a real-time smartphone Au- thentication protocol utilizing the photo-response non-uniformity (PRNU) of the Built-in Camera. In contrast to previous works that require tens of images to build reliable PRNU features for conventional cameras, we are the first to observe that one image alone can uniquely identify a smartphone due to the unique PRNU of a smartphone image sensor. This new discovery makes the use of PRNU practical for smartphone authentication. While most existing hardware fingerprints are vulnerable against forgery attacks, ABC defeats forgery attacks by verifying a smartphone’s PRNU identity through a challenge response protocol using a visible light communication channel. A user captures two time-variant QR codes and sends the two images to a server, which verifies the identity by fingerprint and image content matching. The time-variant QR codes can also defeat replay attacks. Our experiments with 16,000 images over 40 smartphones show that ABC can efficiently authenticate user devices with an error rate less than 0.5%. 
    more » « less