We investigate exocentric-to-egocentric cross-view translation, which aims to generate a first-person (egocentric) view of an actor based on a video recording that captures the actor from a third-person (exocentric) perspective. To this end, we propose a generative framework called Exo2Ego that decouples the translation process into two stages: high-level structure transformation, which explicitly encourages cross-view correspondence between exocentric and egocentric views, and a diffusion-based pixel-level hallucination, which incorporates a hand layout prior to enhance the fidelity of the generated egocentric view. To pave the way for future advancements in this field, we curate a comprehensive exo-to-ego cross-view translation benchmark. It consists of a diverse collection of synchronized ego-exo tabletop activity video pairs sourced from three public datasets: H2O, Aria Pilot, and Assembly101. The experimental results validate that Exo2Ego delivers photorealistic video results with clear hand manipulation details and outperforms several baselines in terms of both synthesis quality and generalization ability to new actions. 
                        more » 
                        « less   
                    This content will become publicly available on December 12, 2025
                            
                            Ego-to-Exo: Interfacing Third Person Visuals from Egocentric Views in Real-time for Improved ROV Teleoperation
                        
                    
    
            Underwater ROVs (Remotely Operated Vehicles) are unmanned submersibles designed for exploring and operating in the depths of the ocean. Despite using high-end cameras, typical teleoperation engines based on first-person (egocentric) views limit a surface operator’s ability to maneuver the ROV in complex deep-water missions. In this paper, we present an interactive teleoperation interface that enhances the operational capabilities via increased situational awareness. This is accomplished by (i) offering on-demand third-person (exocentric) visuals from past egocentric views, and (ii) facilitating enhanced peripheral information with augmented ROV pose in real-time. We achieve this by integrating a 3D geometry-based Ego-to-Exo view synthesis algorithm into a monocular SLAM system for accurate trajectory estimation. The proposed closed-form solution only uses past egocentric views from the ROV and a SLAM backbone for pose estimation, which makes it portable to existing ROV platforms. Unlike data-driven solutions, it is invariant to applications and waterbody-specific scenes. We validate the geometric accuracy of the proposed framework through extensive experiments of 2-DOF indoor navigation and 6-DOF underwater cave exploration in challenging low-light conditions. A subjective evaluation on 15 human teleoperators further confirms the effectiveness of the integrated features for improved teleoperation. We demonstrate the benefits of dynamic Ego-to-Exo view generation and real-time pose rendering for remote ROV teleoperation by following navigation guides such as cavelines inside underwater caves. This new way of interactive ROV teleoperation opens up promising opportunities for future research in subsea telerobotics. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10614050
- Publisher / Repository:
- International Symposium of Robotics Research (ISRR)
- Date Published:
- Format(s):
- Medium: X
- Location:
- Los Angeles, CA, USA
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Egocentric and exocentric perspectives of human action differ significantly, yet overcoming this extreme viewpoint gap is critical in augmented reality and robotics. We propose VIEWPOINTROSETTA, an approach that unlocks large-scale unpaired ego and exo video data to learn clip-level viewpoint-invariant video representations. Our framework introduces (1) a diffusion-based Rosetta Stone Translator (RST), which, leveraging a moderate amount of synchronized multi-view videos, serves as a translator in feature space to decipher the alignment between unpaired ego and exo data, and (2) a dual encoder that aligns unpaired data representations through contrastive learning with RST-based synthetic feature augmentation and soft alignment. To evaluate the learned features in a standardized setting, we construct a new cross-view benchmark using Ego-Exo4D, covering cross-view retrieval, action recognition, and skill assessment tasks. Our framework demonstrates superior cross-view understanding compared to previous view-invariant learning and ego video representation learning approaches, and opens the door to bringing vast amounts of traditional third-person video to bear on the more nascent first-person setting.more » « less
- 
            Abstract. We present 4Diff, a 3D-aware diffusion model addressing the exo-to-ego viewpoint translation task—generating first-person (egocentric) view images from the corresponding third-person (exocentric) images. Building on the diffusion model’s ability to generate photorealistic images, we propose a transformer-based diffusion model that incorporates geometry priors through two mechanisms: (i) egocentric point cloud rasterization and (ii) 3D-aware rotary cross-attention. Egocentric point cloud rasterization converts the input exocentric image into an egocentric layout, which is subsequently used by a diffusion image transformer. As a component of the diffusion transformer’s denoiser block, the 3D-aware rotary cross-attention further incorporates 3D information and semantic features from the source exocentric view. Our 4Diff achieves state-of-the-art results on the challenging and diverse Ego-Exo4D multiview dataset and exhibits robust generalization to novel environments not encountered during training. Our code, processed data, and pretrained models are publicly available at https://klauscc.github.io/4diff.more » « less
- 
            We present EgoRenderer, a system for rendering full-body neural avatars of a person captured by a wearable, egocentric fisheye camera that is mounted on a cap or a VR headset. Our system renders photorealistic novel views of the actor and her motion from arbitrary virtual camera locations. Rendering full-body avatars from such egocentric images come with unique challenges due to the top-down view and large distortions. We tackle these challenges by decomposing the rendering process into several steps, including texture synthesis, pose construction, and neural image translation. For texture synthesis, we propose Ego-DPNet, a neural network that infers dense correspondences between the input fisheye images and an underlying parametric body model, and to extract textures from egocentric inputs. In addition, to encode dynamic appearances, our approach also learns an implicit texture stack that captures detailed appearance variation across poses and viewpoints. For correct pose generation, we first estimate body pose from the egocentric view using a parametric model. We then synthesize an external free-viewpoint pose image by projecting the parametric model to the user-specified target viewpoint. We next combine the target pose image and the textures into a combined feature image, which is transformed into the output color image using a neural image translation network. Experimental evaluations show that EgoRenderer is capable of generating realistic free-viewpoint avatars of a person wearing an egocentric camera. Comparisons to several baselines demonstrate the advantages of our approach.more » « less
- 
            Abstract ROV operations are mainly performed via a traditional control kiosk and limited data feedback methods, such as the use of joysticks and camera view displays equipped on a surface vessel. This traditional setup requires significant personnel on board (POB) time and imposes high requirements for personnel training. This paper proposes a virtual reality (VR) based haptic-visual ROV teleoperation system that can substantially simplify ROV teleoperation and enhance the remote operator's situational awareness. This study leverages the recent development in Mixed Reality (MR) technologies, sensory augmentation, sensing technologies, and closed-loop control, to visualize and render complex underwater environmental data in an intuitive and immersive way. The raw sensor data will be processed with physics engine systems and rendered as a high-fidelity digital twin model in game engines. Certain features will be visualized and displayed via the VR headset, whereas others will be manifested as haptic and tactile cues via our haptic feedback systems. We applied a simulation approach to test the developed system. With our developed system, a high-fidelity subsea environment is reconstructed based on the sensor data collected from an ROV including the bathymetric, hydrodynamic, visual, and vehicle navigational measurements. Specifically, the vehicle is equipped with a navigation sensor system for real-time state estimation, an acoustic Doppler current profiler for far-field flow measurement, and a bio-inspired artificial literal-line hydrodynamic sensor system for near-field small-scale hydrodynamics. Optimized game engine rendering algorithms then visualize key environmental features as augmented user interface elements in a VR headset, such as color-coded vectors, to indicate the environmental impact on the performance and function of the ROV. In addition, augmenting environmental feedback such as hydrodynamic forces are translated into patterned haptic stimuli via a haptic suit for indicating drift-inducing flows in the near field. A pilot case study was performed to verify the feasibility and effectiveness of the system design in a series of simulated ROV operation tasks. ROVs are widely used in subsea exploration and intervention tasks, playing a critical role in offshore inspection, installation, and maintenance activities. The innovative ROV teleoperation feedback and control system will lower the barrier for ROV pilot jobs.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
