In this paper, we present ViTag to associate user identities across multimodal data, particularly those obtained from cameras and smartphones. ViTag associates a sequence of vision tracker generated bounding boxes with Inertial Measurement Unit (IMU) data and Wi-Fi Fine Time Measurements (FTM) from smartphones. We formulate the problem as association by sequence to sequence (seq2seq) translation. In this two-step process, our system first performs cross-modal translation using a multimodal LSTM encoder-decoder network (X-Translator) that translates one modality to another, e.g. reconstructing IMU and FTM readings purely from camera bounding boxes. Second, an association module finds identity matches between camera and phone domains, where the translated modality is then matched with the observed data from the same modality. In contrast to existing works, our proposed approach can associate identities in multi-person scenarios where all users may be performing the same activity. Extensive experiments in real-world indoor and outdoor environments demonstrate that online association on camera and phone data (IMU and FTM) achieves an average Identity Precision Accuracy (IDP) of 88.39% on a 1 to 3 seconds window, outperforming the state-of-the-art Vi-Fi (82.93%). Further study on modalities within the phone domain shows the FTM can improve association performance by 12.56% on average. Finally, results from our sensitivity experiments demonstrate the robustness of ViTag under different noise and environment variations.
more »
« less
ViFiT: Reconstructing Vision Trajectories from IMU and Wi-Fi Fine Time Measurements
Tracking subjects in videos is one of the most widely used functions in camera-based IoT applications such as security surveillance, smart city traffic safety enhancement, vehicle to pedestrian communication and so on. In computer vision domain, tracking is usually achieved by first detecting subjects, then associating detected bounding boxes across video frames. Typically, frames are transmitted to a remote site for processing, incurring high latency and network costs. To address this, we propose ViFiT, a transformerbased model that reconstructs vision bounding box trajectories from phone data (IMU and Fine Time Measurements). It leverages a transformer’s ability of better modeling long-term time series data. ViFiT is evaluated on Vi-Fi Dataset, a large-scale multimodal dataset in 5 diverse real world scenes, including indoor and outdoor environments. Results demonstrate that ViFiT outperforms the state-of-the-art approach for cross-modal reconstruction in LSTM Encoder-Decoder architecture X-Translator and achieves a high frame reduction rate as 97.76% with IMU and Wi-Fi data.
more »
« less
- Award ID(s):
- 1901355
- PAR ID:
- 10545840
- Publisher / Repository:
- ACM
- Date Published:
- ISBN:
- 9798400703645
- Page Range / eLocation ID:
- 13 to 18
- Subject(s) / Keyword(s):
- IoT smart city vehicle to pedestrian communication computer vision
- Format(s):
- Medium: X
- Location:
- Madrid Spain
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
With the increasing reliance on small Unmanned Aerial Systems (sUAS) for Emergency Response Scenarios, such as Search and Rescue, the integration of computer vision capabilities has become a key factor in mission success. Nevertheless, computer vision performance for detecting humans severely degrades when shifting from ground to aerial views. Several aerial datasets have been created to mitigate this problem, however, none of them has specifically addressed the issue of occlusion, a critical component in Emergency Response Scenarios. Natural, Occluded, Multi-scale Aerial Dataset (NOMAD) presents a benchmark for human detection under occluded aerial views, with five different aerial distances and rich imagery variance. NOMAD is composed of 100 different Actors, all performing sequences of walking, laying and hiding. It includes 42,825 frames, extracted from 5.4k resolution videos, and manually annotated with a bounding box and a label describing 10 different visibility levels, categorized according to the percentage of the human body visible inside the bounding box. This allows computer vision models to be evaluated on their detection performance across different ranges of occlusion. NOMAD is designed to improve the effectiveness of aerial search and rescue and to enhance collaboration between sUAS and humans, by providing a new benchmark dataset for human detection under occluded aerial views.more » « less
-
Today, video cameras are ubiquitously deployed. These cameras collect, stream, store, and analyze video footage for a variety of use cases, ranging from surveillance, retail analytics, architectural engineering, and more. At the same time, many citizens are becoming weary of the amount of personal data captured, along with the algorithms and datasets used to process video pipelines. This work investigates how users can opt-out of such pipelines by explicitly providing consent to be recorded. An ideal system should obfuscate or otherwise cleanse non-consenting user data, ideally before a user even enters the video processing pipeline itself. We present a system, called Consent-Box, that enables obfuscation of users without using complex or personally-identifying vision techniques. Instead, a user's location on a video frame is estimated via Wi-Fi localization of a user's mobile device. This estimation allows us to remove individuals from frames before those frames enter complex vision pipelines.more » « less
-
This paper presents ssLOTR (self-supervised learning on the rings), a system that shows the feasibility of designing self-supervised learning based techniques for 3D finger motion tracking using a custom-designed wearable inertial measurement unit (IMU) sensor with a minimal overhead of labeled training data. Ubiquitous finger motion tracking enables a number of applications in augmented and virtual reality, sign language recognition, rehabilitation healthcare, sports analytics, etc. However, unlike vision, there are no large-scale training datasets for developing robust machine learning (ML) models on wearable devices. ssLOTR designs ML models based on data augmentation and self-supervised learning to first extract efficient representations from raw IMU data without the need for any training labels. The extracted representations are further trained with small-scale labeled training data. In comparison to fully supervised learning, we show that only 15% of labeled training data is sufficient with self-supervised learning to achieve similar accuracy. Our sensor device is designed using a two-layer printed circuit board (PCB) to minimize the footprint and uses a combination of Polylactic acid (PLA) and Thermoplastic polyurethane (TPU) as housing materials for sturdiness and flexibility. It incorporates a system-on-chip (SoC) microcontroller with integrated WiFi/Bluetooth Low Energy (BLE) modules for real-time wireless communication, portability, and ubiquity. In contrast to gloves, our device is worn like rings on fingers, and therefore, does not impede dexterous finger motion. Extensive evaluation with 12 users depicts a 3D joint angle tracking accuracy of 9.07° (joint position accuracy of 6.55mm) with robustness to natural variation in sensor positions, wrist motion, etc, with low overhead in latency and power consumption on embedded platforms.more » « less
-
Inertial measurement units (IMUs) are an alternative to traditional optical motion capture systems that allow for data collection outside the lab and continuous monitoring for daily activities. In this study, a non-linear least squares optimization was used to convert IMU measurements into joint kinematics. The optimization calculates joint angles simultaneously over all time frames by optimizing B-spline nodes, without integrating any IMU measurements. This approach enables an accurate solution that works well with noisy experimental IMU data since integration errors are eliminated.more » « less
An official website of the United States government

