skip to main content

This content will become publicly available on June 11, 2023

Title: EyeCoD: eye tracking system acceleration via flatcam-based algorithm & accelerator co-design
Eye tracking has become an essential human-machine interaction modality for providing immersive experience in numerous virtual and augmented reality (VR/AR) applications desiring high throughput (e.g., 240 FPS), small-form, and enhanced visual privacy. However, existing eye tracking systems are still limited by their: (1) large form-factor largely due to the adopted bulky lens-based cameras; (2) high communication cost required between the camera and backend processor; and (3) potentially concerned low visual privacy, thus prohibiting their more extensive applications. To this end, we propose, develop, and validate a lensless FlatCambased eye tracking algorithm and accelerator co-design framework dubbed EyeCoD to enable eye tracking systems with a much reduced form-factor and boosted system efficiency without sacrificing the tracking accuracy, paving the way for next-generation eye tracking solutions. On the system level, we advocate the use of lensless FlatCams instead of lens-based cameras to facilitate the small form-factor need in mobile eye tracking systems, which also leaves rooms for a dedicated sensing-processor co-design to reduce the required camera-processor communication latency. On the algorithm level, EyeCoD integrates a predict-then-focus pipeline that first predicts the region-of-interest (ROI) via segmentation and then only focuses on the ROI parts to estimate gaze directions, greatly reducing redundant computations and more » data movements. On the hardware level, we further develop a dedicated accelerator that (1) integrates a novel workload orchestration between the aforementioned segmentation and gaze estimation models, (2) leverages intra-channel reuse opportunities for depth-wise layers, (3) utilizes input feature-wise partition to save activation memory size, and (4) develops a sequential-write-parallel-read input buffer to alleviate the bandwidth requirement for the activation global buffer. On-silicon measurement and extensive experiments validate that our EyeCoD consistently reduces both the communication and computation costs, leading to an overall system speedup of 10.95×, 3.21×, and 12.85× over general computing platforms including CPUs and GPUs, and a prior-art eye tracking processor called CIS-GEP, respectively, while maintaining the tracking accuracy. Codes are available at « less
; ; ; ; ; ; ; ; ; ; ; ; ;
Award ID(s):
1934767 1937592
Publication Date:
Journal Name:
ISCA '22: Proceedings of the 49th Annual International Symposium on Computer Architecture
Page Range or eLocation-ID:
610 to 622
Sponsoring Org:
National Science Foundation
More Like this
  1. We present a first-of-its-kind ultra-compact intelligent camera system, dubbed i-FlatCam, including a lensless camera with a computational (Comp.) chip. It highlights (1) a predict-then-focus eye tracking pipeline for boosted efficiency without compromising the accuracy, (2) a unified compression scheme for single-chip processing and improved frame rate per second (FPS), and (3) dedicated intra-channel reuse design for depth-wise convolutional layers (DW-CONV) to increase utilization. i-FlatCam demonstrates the first eye tracking pipeline with a lensless camera and achieves 3.16 degrees of accuracy, 253 FPS, 91.49 µJ/Frame, and 6.7mm×8.9mm×1.2mm camera form factor, paving the way for next-generation Augmented Reality (AR) and Virtual Reality (VR) devices.
  2. We present a personalized, comprehensive eye-tracking solution based on tracking higher-order Purkinje images, suited explicitly for eyeglasses-style AR and VR displays. Existing eye-tracking systems for near-eye applications are typically designed to work for an on-axis configuration and rely on pupil center and corneal reflections (PCCR) to estimate gaze with an accuracy of only about 0.5°to 1°. These are often expensive, bulky in form factor, and fail to estimate monocular accommodation, which is crucial for focus adjustment within the AR glasses. Our system independently measures the binocular vergence and monocular accommodation using higher-order Purkinje reflections from the eye, extending the PCCR based methods. We demonstrate that these reflections are sensitive to both gaze rotation and lens accommodation and model the Purkinje images’ behavior in simulation. We also design and fabricate a user-customized eye tracker using cheap off-the-shelf cameras and LEDs. We use an end-to-end convolutional neural network (CNN) for calibrating the eye tracker for the individual user, allowing for robust and simultaneous estimation of vergence and accommodation. Experimental results show that our solution, specifically catering to individual users, outperforms state-of-the-art methods for vergence and depth estimation, achieving an accuracy of 0.3782°and 1.108 cm respectively.
  3. We describe the design and performance of a high-fidelity wearable head-, body-, and eye-tracking system that offers significant improvement over previous such devices. This device’s sensors include a binocular eye tracker, an RGB-D scene camera, a high-frame-rate scene camera, and two visual odometry sensors, for a total of ten cameras, which we synchronize and record from with a data rate of over 700 MB/s. The sensors are operated by a mini-PC optimized for fast data collection, and powered by a small battery pack. The device records a subject’s eye, head, and body positions, simultaneously with RGB and depth data from the subject’s visual environment, measured with high spatial and temporal resolution. The headset weighs only 1.4 kg, and the backpack with batteries 3.9 kg. The device can be comfortably worn by the subject, allowing a high degree of mobility. Together, this system overcomes many limitations of previous such systems, allowing high-fidelity characterization of the dynamics of natural vision.
  4. Early intervention to address developmental disability in infants has the potential to promote improved outcomes in neurodevelopmental structure and function [1]. Researchers are starting to explore Socially Assistive Robotics (SAR) as a tool for delivering early interventions that are synergistic with and enhance human-administered therapy. For SAR to be effective, the robot must be able to consistently attract the attention of the infant in order to engage the infant in a desired activity. This work presents the analysis of eye gaze tracking data from five 6-8 month old infants interacting with a Nao robot that kicked its leg as a contingent reward for infant leg movement. We evaluate a Bayesian model of lowlevel surprise on video data from the infants’ head-mounted camera and on the timing of robot behaviors as a predictor of infant visual attention. The results demonstrate that over 67% of infant gaze locations were in areas the model evaluated to be more surprising than average. We also present an initial exploration using surprise to predict the extent to which the robot attracts infant visual attention during specific intervals in the study. This work is the first to validate the surprise model on infants; our results indicate themore »potential for using surprise to inform robot behaviors that attract infant attention during SAR interactions.« less
  5. Lensless cameras are ultra-thin imaging systems that replace the lens with a thin passive optical mask and computation. Passive mask-based lensless cameras encode depth information in their measurements for a certain depth range. Early works have shown that this encoded depth can be used to perform 3D reconstruction of close-range scenes. However, these approaches for 3D reconstructions are typically optimization based and require strong hand-crafted priors and hundreds of iterations to reconstruct. Moreover, the reconstructions suffer from low resolution, noise, and artifacts. In this work, we proposeFlatNet3D—a feed-forward deep network that can estimate both depth and intensity from a single lensless capture. FlatNet3D is an end-to-end trainable deep network that directly reconstructs depth and intensity from a lensless measurement using an efficient physics-based 3D mapping stage and a fully convolutional network. Our algorithm is fast and produces high-quality results, which we validate using both simulated and real scenes captured using PhlatCam.