skip to main content


This content will become publicly available on August 10, 2024

Title: I am an Earphone and I can Hear my Users Face: Facial Landmark Tracking using Smart Earphones
This paper presents EARFace , a system that shows the feasibility of tracking facial landmarks for 3D facial reconstruction using in-ear acoustic sensors embedded within smart earphones. This enables a number of applications in the areas of facial expression tracking, user-interfaces, AR/VR applications, affective computing, accessibility, etc. While conventional vision-based solutions break down under poor lighting, occlusions, and also suffer from privacy concerns, earphone platforms are robust to ambient conditions, while being privacy-preserving. In contrast to prior work on earable platforms that perform outer-ear sensing for facial motion tracking, EARFace shows the feasibility of completely in-ear sensing with a natural earphone form-factor, thus enhancing the comfort levels of wearing. The core intuition exploited by EARFace is that the shape of the ear canal changes due to the movement of facial muscles during facial motion. EARFace tracks the changes in shape of the ear canal by measuring ultrasonic channel frequency response (CFR) of the inner ear, ultimately resulting in tracking of the facial motion. A transformer based machine learning (ML) model is designed to exploit spectral and temporal relationships in the ultrasonic CFR data to predict the facial landmarks of the user with an accuracy of 1.83 mm. Using these predicted landmarks, a 3D graphical model of the face that replicates the precise facial motion of the user is then reconstructed. Domain adaptation is further performed by adapting the weights of layers using a group-wise and differential learning rate. This decreases the training overhead in EARFace . The transformer based ML model runs on smartphone devices with a processing latency of 13 ms and an overall low power consumption profile. Finally, usability studies indicate higher levels of comforts of wearing EARFace ’s earphone platform in comparison with alternative form-factors.  more » « less
Award ID(s):
2211018 2046972
NSF-PAR ID:
10464629
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Internet of Things
ISSN:
2691-1914
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Over the last decade, facial landmark tracking and 3D reconstruction have gained considerable attention due to their numerous applications such as human-computer interactions, facial expression analysis, and emotion recognition, etc. Traditional approaches require users to be confined to a particular location and face a camera under constrained recording conditions (e.g., without occlusions and under good lighting conditions). This highly restricted setting prevents them from being deployed in many application scenarios involving human motions. In this paper, we propose the first single-earpiece lightweight biosensing system, BioFace-3D, that can unobtrusively, continuously, and reliably sense the entire facial movements, track 2D facial landmarks, and further render 3D facial animations. Our single-earpiece biosensing system takes advantage of the cross-modal transfer learning model to transfer the knowledge embodied in a high-grade visual facial landmark detection model to the low-grade biosignal domain. After training, our BioFace-3D can directly perform continuous 3D facial reconstruction from the biosignals, without any visual input. Without requiring a camera positioned in front of the user, this paradigm shift from visual sensing to biosensing would introduce new opportunities in many emerging mobile and IoT applications. Extensive experiments involving 16 participants under various settings demonstrate that BioFace-3D can accurately track 53 major facial landmarks with only 1.85 mm average error and 3.38\% normalized mean error, which is comparable with most state-of-the-art camera-based solutions. The rendered 3D facial animations, which are in consistency with the real human facial movements, also validate the system's capability in continuous 3D facial reconstruction. 
    more » « less
  2. This paper presents ssLOTR (self-supervised learning on the rings), a system that shows the feasibility of designing self-supervised learning based techniques for 3D finger motion tracking using a custom-designed wearable inertial measurement unit (IMU) sensor with a minimal overhead of labeled training data. Ubiquitous finger motion tracking enables a number of applications in augmented and virtual reality, sign language recognition, rehabilitation healthcare, sports analytics, etc. However, unlike vision, there are no large-scale training datasets for developing robust machine learning (ML) models on wearable devices. ssLOTR designs ML models based on data augmentation and self-supervised learning to first extract efficient representations from raw IMU data without the need for any training labels. The extracted representations are further trained with small-scale labeled training data. In comparison to fully supervised learning, we show that only 15% of labeled training data is sufficient with self-supervised learning to achieve similar accuracy. Our sensor device is designed using a two-layer printed circuit board (PCB) to minimize the footprint and uses a combination of Polylactic acid (PLA) and Thermoplastic polyurethane (TPU) as housing materials for sturdiness and flexibility. It incorporates a system-on-chip (SoC) microcontroller with integrated WiFi/Bluetooth Low Energy (BLE) modules for real-time wireless communication, portability, and ubiquity. In contrast to gloves, our device is worn like rings on fingers, and therefore, does not impede dexterous finger motion. Extensive evaluation with 12 users depicts a 3D joint angle tracking accuracy of 9.07° (joint position accuracy of 6.55mm) with robustness to natural variation in sensor positions, wrist motion, etc, with low overhead in latency and power consumption on embedded platforms. 
    more » « less
  3. We present a System for Processing In-situ Bio-signal Data for Emotion Recognition and Sensing (SPIDERS)- a low-cost, wireless, glasses-based platform for continuous in-situ monitoring of user's facial expressions (apparent emotions) and real emotions. We present algorithms to provide four core functions (eye shape and eyebrow movements, pupillometry, zygomaticus muscle movements, and head movements), using the bio-signals acquired from three non-contact sensors (IR camera, proximity sensor, IMU). SPIDERS distinguishes between different classes of apparent and real emotion states based on the aforementioned four bio-signals. We prototype advanced functionalities including facial expression detection and real emotion classification with a landmarks and optical flow based facial expression detector that leverages changes in a user's eyebrows and eye shapes to achieve up to 83.87% accuracy, as well as a pupillometry-based real emotion classifier with higher accuracy than other low-cost wearable platforms that use sensors requiring skin contact. SPIDERS costs less than $20 to assemble and can continuously run for up to 9 hours before recharging. We demonstrate that SPIDERS is a truly wireless and portable platform that has the capability to impact a wide range of applications, where knowledge of the user's emotional state is critical. 
    more » « less
  4. Identifying people in photographs is a critical task in a wide variety of domains, from national security [7] to journalism [14] to human rights investigations [1]. The task is also fundamentally complex and challenging. With the world population at 7.6 billion and growing, the candidate pool is large. Studies of human face recognition ability show that the average person incorrectly identifies two people as similar 20–30% of the time, and trained police detectives do not perform significantly better [11]. Computer vision-based face recognition tools have gained considerable ground and are now widely available commercially, but comparisons to human performance show mixed results at best [2,10,16]. Automated face recognition techniques, while powerful, also have constraints that may be impractical for many real-world contexts. For example, face recognition systems tend to suffer when the target image or reference images have poor quality or resolution, as blemishes or discolorations may be incorrectly recognized as false positives for facial landmarks. Additionally, most face recognition systems ignore some salient facial features, like scars or other skin characteristics, as well as distinctive non-facial features, like ear shape or hair or facial hair styles. This project investigates how we can overcome these limitations to support person identification tasks. By adjusting confidence thresholds, users of face recognition can generally expect high recall (few false negatives) at the cost of low precision (many false positives). Therefore, we focus our work on the “last mile” of person identification, i.e., helping a user find the correct match among a large set of similarlooking candidates suggested by face recognition. Our approach leverages the powerful capabilities of the human vision system and collaborative sensemaking via crowdsourcing to augment the complementary strengths of automatic face recognition. The result is a novel technology pipeline combining collective intelligence and computer vision. We scope this project to focus on identifying soldiers in photos from the American Civil War era (1861– 1865). An estimated 4,000,000 soldiers fought in the war, and most were photographed at least once, due to decreasing costs, the increasing robustness of the format, and the critical events separating friends and family [17]. Over 150 years later, the identities of most of these portraits have been lost, but as museums and archives increasingly digitize and publish their collections online, the pool of reference photos and information has never been more accessible. Historians, genealogists, and collectors work tirelessly to connect names with faces, using largely manual identification methods [3,9]. Identifying people in historical photos is important for preserving material culture [9], correcting the historical record [13], and recognizing contributions of marginalized groups [4], among other reasons. 
    more » « less
  5. mmWave signals form a critical component of 5G and next-generation wireless networks, which are also being increasingly considered for sensing the environment around us to enable ubiquitous IoT applications. In this context, this paper leverages the properties of mmWave signals for tracking 3D finger motion for interactive IoT applications. While conventional vision-based solutions break down under poor lighting, occlusions, and also suffer from privacy concerns, mmWave signals work under typical occlusions and non-line-of-sight conditions, while being privacy-preserving. In contrast to prior works on mmWave sensing that focus on predefined gesture classification, this work performs continuous 3D finger motion tracking. Towards this end, we first observe via simulations and experiments that the small size of fingers coupled with specular reflections do not yield stable mmWave reflections. However, we make an interesting observation that focusing on the forearm instead of the fingers can provide stable reflections for 3D finger motion tracking. Muscles that activate the fingers extend through the forearm, whose motion manifests as vibrations on the forearm. By analyzing the variation in phases of reflected mmWave signals from the forearm, this paper designs mm4Arm, a system that tracks 3D finger motion. Nontrivial challenges arise due to the high dimensional search space, complex vibration patterns, diversity across users, hardware noise, etc. mm4Arm exploits anatomical constraints in finger motions and fuses them with machine learning architectures based on encoder-decoder and ResNets in enabling accurate tracking. A systematic performance evaluation with 10 users demonstrates a median error of 5.73° (location error of 4.07 mm) with robustness to multipath and natural variation in hand position/orientation. The accuracy is also consistent under non-line-of-sight conditions and clothing that might occlude the forearm. mm4Arm runs on smartphones with a latency of 19 ms and low energy overhead. 
    more » « less