  1. Holography is a promising avenue for high-quality displays without requiring bulky, complex optical systems. While recent work has demonstrated accurate hologram generation of 2D scenes, high-quality holographic projections of 3D scenes has been out of reach until now. Existing multiplane 3D holography approaches fail to model wavefronts in the presence of partial occlusion while holographic stereogram methods have to make a fundamental tradeoff between spatial and angular resolution. In addition, existing 3D holographic display methods rely on heuristic encoding of complex amplitude into phase-only pixels which results in holograms with severe artifacts. Fundamental limitations of the input representation, wavefront modeling, and optimization methods prohibit artifact-free 3D holographic projections in today’s displays. To lift these limitations, we introduce hogel-free holography which optimizes for true 3D holograms, supporting both depth- and view-dependent effects for the first time. Our approach overcomes the fundamental spatio-angular resolution tradeoff typical to stereogram approaches. Moreover, it avoids heuristic encoding schemes to achieve high image fidelity over a 3D volume. We validate that the proposed method achieves 10 dB PSNR improvement on simulated holographic reconstructions. We also validate our approach on an experimental prototype with accurate parallax and depth focus effects.
    Free, publicly-accessible full text available October 31, 2023
  2. Eye tracking has already made its way to current commercial wearable display devices, and is becoming increasingly important for virtual and augmented reality applications. However, the existing model-based eye tracking solutions are not capable of conducting very accurate gaze angle measurements, and may not be sufficient to solve challenging display problems such as pupil steering or eyebox expansion. In this paper, we argue that accurate detection and localization of pupil in 3D space is a necessary intermediate step in model-based eye tracking. Existing methods and datasets either ignore evaluating the accuracy of 3D pupil localization or evaluate it only on synthetic data. To this end, we capture the first 3D pupilgaze-measurement dataset using a high precision setup with head stabilization and release it as the first benchmark dataset to evaluate both 3D pupil localization and gaze tracking methods. Furthermore, we utilize an advanced eye model to replace the commonly used oversimplified eye model. Leveraging the eye model, we propose a novel 3D pupil localization method with a deep learning-based corneal refraction correction. We demonstrate that our method outperforms the state-of-the-art works by reducing the 3D pupil localization error by 47.5% and the gaze estimation error by 18.7%. Our dataset andmore »codes can be found here: link.« less
    Free, publicly-accessible full text available October 1, 2023
  3. Recently, image-to-image translation (I2I) has met with great success in computer vision, but few works have paid attention to the geometric changes that occur during translation. The geometric changes are necessary to reduce the geometric gap between domains at the cost of breaking correspondence between translated images and original ground truth. We propose a novel geometry-aware semi-supervised method to preserve this correspondence while still allowing geometric changes. The proposed method takes a synthetic image-mask pair as input and produces a corresponding real pair. We also utilize an objective function to ensure consistent geometric movement of the image and mask through the translation. Extensive experiments illustrate that our method yields a 11.23% higher mean Intersection-Over-Union than the current methods on the downstream eye segmentation task. The generated image has a 15.9% decrease in Frechet Inception Distance indicating higher image quality.
    Free, publicly-accessible full text available June 8, 2023
  4. We envision a convenient telepresence system available to users anywhere, anytime. Such a system requires displays and sensors embedded in commonly worn items such as eyeglasses, wristwatches, and shoes. To that end, we present a standalone real-time system for the dynamic 3D capture of a person, relying only on cameras embedded into a head-worn device, and on Inertial Measurement Units (IMUs) worn on the wrists and ankles. Our prototype system egocentrically reconstructs the wearer's motion via learning-based pose estimation, which fuses inputs from visual and inertial sensors that complement each other, overcoming challenges such as inconsistent limb visibility in head-worn views, as well as pose ambiguity from sparse IMUs. The estimated pose is continuously re-targeted to a prescanned surface model, resulting in a high-fidelity 3D reconstruction. We demonstrate our system by reconstructing various human body movements and show that our visual-inertial learning-based method, which runs in real time, outperforms both visual-only and inertial-only approaches. We captured an egocentric visual-inertial 3D human pose dataset publicly available at for training and evaluating similar methods.
  5. We present a personalized, comprehensive eye-tracking solution based on tracking higher-order Purkinje images, suited explicitly for eyeglasses-style AR and VR displays. Existing eye-tracking systems for near-eye applications are typically designed to work for an on-axis configuration and rely on pupil center and corneal reflections (PCCR) to estimate gaze with an accuracy of only about 0.5°to 1°. These are often expensive, bulky in form factor, and fail to estimate monocular accommodation, which is crucial for focus adjustment within the AR glasses. Our system independently measures the binocular vergence and monocular accommodation using higher-order Purkinje reflections from the eye, extending the PCCR based methods. We demonstrate that these reflections are sensitive to both gaze rotation and lens accommodation and model the Purkinje images’ behavior in simulation. We also design and fabricate a user-customized eye tracker using cheap off-the-shelf cameras and LEDs. We use an end-to-end convolutional neural network (CNN) for calibrating the eye tracker for the individual user, allowing for robust and simultaneous estimation of vergence and accommodation. Experimental results show that our solution, specifically catering to individual users, outperforms state-of-the-art methods for vergence and depth estimation, achieving an accuracy of 0.3782°and 1.108 cm respectively.
  6. Holography is perhaps the only method demonstrated so far that can achieve a wide field of view (FOV) and a compact eyeglass-style form factor for augmented reality (AR) near-eye displays (NEDs). Unfortunately, the eyebox of such NEDs is impractically small ($\sim \lt$ 1 mm). In this paper, we introduce and demonstrate a design for holographic NEDs with a practical, wide eyebox of $\sim$ 10 mm and without any moving parts, based on holographic lenslets. In our design, a holographic optical element (HOE) based on a lenslet array was fabricated as the image combiner with expanded eyebox. A phase spatial light modulator (SLM) alters the phase of the incident laser light projected onto the HOE combiner such that the virtual image can be perceived at different focus distances, which can reduce the vergence-accommodation conflict (VAC). We have successfully implemented a bench-top prototype following the proposed design. The experimental results show effective eyebox expansion to a size of $\sim$ 10 mm. With further work, we hope that these design concepts can be incorporated into eyeglass-size NEDs.
  7. Human novel view synthesis aims to synthesize target views of a human subject given input images taken from one or more reference viewpoints. Despite significant advances in model-free novel view synthesis, existing methods present two major limitations when applied to complex shapes like humans. First, these methods mainly focus on simple and symmetric objects, e.g., cars and chairs, limiting their performances to fine-grained and asymmetric shapes. Second, existing methods cannot guarantee visual consistency across different adjacent views of the same object. To solve these problems, we present in this paper a learning framework for the novel view synthesis of human subjects, which explicitly enforces consistency across different generated views of the subject. Specifically, we introduce a novel multi-view supervision and an explicit rotational loss during the learning process, enabling the model to preserve detailed body parts and to achieve consistency between adjacent synthesized views. To show the superior performance of our approach, we present qualitative and quantitative results on the Multi-View Human Action (MVHA) dataset we collected (consisting of 3D human models animated with different Mocap sequences and captured from 54 different viewpoints), the Pose-Varying Human Model (PVHM) dataset, and ShapeNet. The qualitative and quantitative results demonstrate that our approachmore »outperforms the state-of-the-art baselines in both per-view synthesis quality, and in preserving rotational consistency and complex shapes (e.g. fine-grained details, challenging poses) across multiple adjacent views in a variety of scenarios, for both humans and rigid objects.« less
  8. Vedaldi, Andrea ; Bischof, Horst ; Brox, Thomas ; Frahm, Jan-Michael (Ed.)
    Novel view video synthesis aims to synthesize novel viewpoints videos given input captures of a human performance taken from multiple reference viewpoints and over consecutive time steps. Despite great advances in model-free novel view synthesis, existing methods present three limitations when applied to complex and time-varying human performance. First, these methods (and related datasets) mainly consider simple and symmetric objects. Second, they do not enforce explicit consistency across generated views. Third, they focus on static and non-moving objects. The fine-grained details of a human subject can therefore suffer from inconsistencies when synthesized across different viewpoints or time steps. To tackle these challenges, we introduce a human-specific framework that employs a learned 3D-aware representation. Specifically, we first introduce a novel siamese network that employs a gating layer for better reconstruction of the latent volumetric representation and, consequently, final visual results. Moreover, features from consecutive time steps are shared inside the network to improve temporal consistency. Second, we introduce a novel loss to explicitly enforce consistency across generated views both in space and in time. Third, we present the Multi-View Human Action (MVHA) dataset, consisting of near 1200 synthetic human performance captured from 54 viewpoints. Experiments on the MVHA, Pose-Varying Human Modelmore »and ShapeNet datasets show that our method outperforms the state-of-the-art baselines both in view generation quality and spatio-temporal consistency.« less