Over the last decade, facial landmark tracking and 3D reconstruction have gained considerable attention due to their numerous applications such as human-computer interactions, facial expression analysis, and emotion recognition, etc. Traditional approaches require users to be confined to a particular location and face a camera under constrained recording conditions (e.g., without occlusions and under good lighting conditions). This highly restricted setting prevents them from being deployed in many application scenarios involving human motions. In this paper, we propose the first single-earpiece lightweight biosensing system, BioFace-3D, that can unobtrusively, continuously, and reliably sense the entire facial movements, track 2D facial landmarks, and further render 3D facial animations. Our single-earpiece biosensing system takes advantage of the cross-modal transfer learning model to transfer the knowledge embodied in a high-grade visual facial landmark detection model to the low-grade biosignal domain. After training, our BioFace-3D can directly perform continuous 3D facial reconstruction from the biosignals, without any visual input. Without requiring a camera positioned in front of the user, this paradigm shift from visual sensing to biosensing would introduce new opportunities in many emerging mobile and IoT applications. Extensive experiments involving 16 participants under various settings demonstrate that BioFace-3D can accurately track 53 major facial landmarks with only 1.85 mm average error and 3.38\% normalized mean error, which is comparable with most state-of-the-art camera-based solutions. The rendered 3D facial animations, which are in consistency with the real human facial movements, also validate the system's capability in continuous 3D facial reconstruction.
more »
« less
I am an Earphone and I can Hear my Users Face: Facial Landmark Tracking using Smart Earphones
This paper presents EARFace , a system that shows the feasibility of tracking facial landmarks for 3D facial reconstruction using in-ear acoustic sensors embedded within smart earphones. This enables a number of applications in the areas of facial expression tracking, user-interfaces, AR/VR applications, affective computing, accessibility, etc. While conventional vision-based solutions break down under poor lighting, occlusions, and also suffer from privacy concerns, earphone platforms are robust to ambient conditions, while being privacy-preserving. In contrast to prior work on earable platforms that perform outer-ear sensing for facial motion tracking, EARFace shows the feasibility of completely in-ear sensing with a natural earphone form-factor, thus enhancing the comfort levels of wearing. The core intuition exploited by EARFace is that the shape of the ear canal changes due to the movement of facial muscles during facial motion. EARFace tracks the changes in shape of the ear canal by measuring ultrasonic channel frequency response (CFR) of the inner ear, ultimately resulting in tracking of the facial motion. A transformer based machine learning (ML) model is designed to exploit spectral and temporal relationships in the ultrasonic CFR data to predict the facial landmarks of the user with an accuracy of 1.83 mm. Using these predicted landmarks, a 3D graphical model of the face that replicates the precise facial motion of the user is then reconstructed. Domain adaptation is further performed by adapting the weights of layers using a group-wise and differential learning rate. This decreases the training overhead in EARFace . The transformer based ML model runs on smartphone devices with a processing latency of 13 ms and an overall low power consumption profile. Finally, usability studies indicate higher levels of comforts of wearing EARFace ’s earphone platform in comparison with alternative form-factors.
more »
« less
- PAR ID:
- 10464629
- Date Published:
- Journal Name:
- ACM Transactions on Internet of Things
- ISSN:
- 2691-1914
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This paper presents ssLOTR (self-supervised learning on the rings), a system that shows the feasibility of designing self-supervised learning based techniques for 3D finger motion tracking using a custom-designed wearable inertial measurement unit (IMU) sensor with a minimal overhead of labeled training data. Ubiquitous finger motion tracking enables a number of applications in augmented and virtual reality, sign language recognition, rehabilitation healthcare, sports analytics, etc. However, unlike vision, there are no large-scale training datasets for developing robust machine learning (ML) models on wearable devices. ssLOTR designs ML models based on data augmentation and self-supervised learning to first extract efficient representations from raw IMU data without the need for any training labels. The extracted representations are further trained with small-scale labeled training data. In comparison to fully supervised learning, we show that only 15% of labeled training data is sufficient with self-supervised learning to achieve similar accuracy. Our sensor device is designed using a two-layer printed circuit board (PCB) to minimize the footprint and uses a combination of Polylactic acid (PLA) and Thermoplastic polyurethane (TPU) as housing materials for sturdiness and flexibility. It incorporates a system-on-chip (SoC) microcontroller with integrated WiFi/Bluetooth Low Energy (BLE) modules for real-time wireless communication, portability, and ubiquity. In contrast to gloves, our device is worn like rings on fingers, and therefore, does not impede dexterous finger motion. Extensive evaluation with 12 users depicts a 3D joint angle tracking accuracy of 9.07° (joint position accuracy of 6.55mm) with robustness to natural variation in sensor positions, wrist motion, etc, with low overhead in latency and power consumption on embedded platforms.more » « less
-
Ear wearables (earables) are emerging platforms that are broadly adopted in various applications. There is an increasing demand for robust earables authentication because of the growing amount of sensitive information and the IoT devices that the earable could access. Traditional authentication methods become less feasible due to the limited input interface of earables. Nevertheless, the rich head-related sensing capabilities of earables can be exploited to capture human biometrics. In this paper, we propose EarSlide, an earable biometric authentication system utilizing the advanced sensing capacities of earables and the distinctive features of acoustic fingerprints when users slide their fingers on the face. It utilizes the inward-facing microphone of the earables and the face-ear channel of the ear canal to reliably capture the acoustic fingerprint. In particular, we study the theory of friction sound and categorize the characteristics of the acoustic fingerprints into three representative classes, pattern-class, ridge-groove-class, and coupling-class. Different from traditional fingerprint authentication only utilizes 2D patterns, we incorporate the 3D information in acoustic fingerprint and indirectly sense the fingerprint for authentication. We then design representative sliding gestures that carry rich information about the acoustic fingerprint while being easy to perform. It then extracts multi-class acoustic fingerprint features to reflect the inherent acoustic fingerprint characteristic for authentication. We also adopt an adaptable authentication model and a user behavior mitigation strategy to effectively authenticate legit users from adversaries. The key advantages of EarSlide are that it is resistant to spoofing attacks and its wide acceptability. Our evaluation of EarSlide in diverse real-world environments with intervals over one year shows that EarSlide achieves an average balanced accuracy rate of 98.37% with only one sliding gesture.more » « less
-
This paper presents iSpyU, a system that shows the feasibility of recognition of natural speech content played on a phone during conference calls (Skype, Zoom, etc) using a fusion of motion sensors such as accelerometer and gyroscope. While microphones require permissions from the user to be accessible by an app developer, the motion sensors are zero-permission sensors, thus accessible by a developer without alerting the user. This allows a malicious app to potentially eavesdrop on sensitive speech content played by the user's phone. In designing the attack, iSpyU tackles a number of technical challenges including: (i) Low sampling rate of motion sensors (500 Hz in comparison to 44 kHz for a microphone). (ii) Lack of availability of large-scale training datasets to train models for Automatic Speech Recognition (ASR) with motion sensors. iSpyU systematically addresses these challenges by a combination of techniques in synthetic training data generation, ASR modeling, and domain adaptation. Extensive measurement studies on modern smartphones show a word level accuracy of 53.3 - 59.9% over a dictionary of 2000-10000 words, and a character level accuracy of 70.0 - 74.8%. We believe such levels of accuracy poses a significant threat when viewed from a privacy perspective.more » « less
-
Emerging Virtual Reality (VR) displays with embedded eye trackers are currently becoming a commodity hardware (e.g., HTC Vive Pro Eye). Eye-tracking data can be utilized for several purposes, including gaze monitoring, privacy protection, and user authentication/identification. Identifying users is an integral part of many applications due to security and privacy concerns. In this paper, we explore methods and eye-tracking features that can be used to identify users. Prior VR researchers explored machine learning on motion-based data (such as body motion, head tracking, eye tracking, and hand tracking data) to identify users. Such systems usually require an explicit VR task and many features to train the machine learning model for user identification. We propose a system to identify users utilizing minimal eye-gaze-based features without designing any identification-specific tasks. We collected gaze data from an educational VR application and tested our system with two machine learning (ML) models, random forest (RF) and k-nearest-neighbors (kNN), and two deep learning (DL) models: convolutional neural networks (CNN) and long short-term memory (LSTM). Our results show that ML and DL models could identify users with over 98% accuracy with only six simple eye-gaze features. We discuss our results, their implications on security and privacy, and the limitations of our work.more » « less
An official website of the United States government

