Egocentric and exocentric perspectives of human action differ significantly, yet overcoming this extreme viewpoint gap is critical in augmented reality and robotics. We propose VIEWPOINTROSETTA, an approach that unlocks large-scale unpaired ego and exo video data to learn clip-level viewpoint-invariant video representations. Our framework introduces (1) a diffusion-based Rosetta Stone Translator (RST), which, leveraging a moderate amount of synchronized multi-view videos, serves as a translator in feature space to decipher the alignment between unpaired ego and exo data, and (2) a dual encoder that aligns unpaired data representations through contrastive learning with RST-based synthetic feature augmentation and soft alignment. To evaluate the learned features in a standardized setting, we construct a new cross-view benchmark using Ego-Exo4D, covering cross-view retrieval, action recognition, and skill assessment tasks. Our framework demonstrates superior cross-view understanding compared to previous view-invariant learning and ego video representation learning approaches, and opens the door to bringing vast amounts of traditional third-person video to bear on the more nascent first-person setting. 
                        more » 
                        « less   
                    This content will become publicly available on March 10, 2026
                            
                            VIVAR: learning view-invariant embedding for video action recognition
                        
                    
    
            Deep learning has achieved state-of-the-art video action recognition (VAR) performance by comprehending action-related features from raw video. However, these models often learn to jointly encode auxiliary view (viewpoints and sensor properties) information with primary action features, leading to performance degradation under novel views and security concerns by revealing sensor types and locations. Here, we systematically study these shortcomings of VAR models and develop a novel approach, VIVAR, to learn view-invariant spatiotemporal action features removing view information. In particular, we leverage contrastive learning to separate actions and jointly optimize adversarial loss that aligns view distributions to remove auxiliary view information in the deep embedding space using the unlabeled synchronous multiview (MV) video to learn view-invariant VAR system. We evaluate VIVAR using our in-house large-scale time synchronous MV video dataset containing 10 actions with three angular viewpoints and sensors in diverse environments. VIVAR successfully captures view-invariant action features, improves inter and intra-action clusters’ quality, and outperforms SoTA models consistently with 8% more accuracy. We additionally perform extensive studies with our datasets, model architectures, multiple contrastive learning, and view distribution alignments to provide VIVAR insights. We open-source our code and dataset to facilitate further research in view-invariant systems. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1750936
- PAR ID:
- 10587414
- Editor(s):
- Liang, Xuefeng
- Publisher / Repository:
- SPIE
- Date Published:
- ISBN:
- 9781510689237
- Page Range / eLocation ID:
- 11
- Format(s):
- Medium: X
- Location:
- Kuala Lumpur, Malaysia
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Vedaldi, Andrea; Bischof, Horst; Brox, Thomas; Frahm, Jan-Michael (Ed.)Novel view video synthesis aims to synthesize novel viewpoints videos given input captures of a human performance taken from multiple reference viewpoints and over consecutive time steps. Despite great advances in model-free novel view synthesis, existing methods present three limitations when applied to complex and time-varying human performance. First, these methods (and related datasets) mainly consider simple and symmetric objects. Second, they do not enforce explicit consistency across generated views. Third, they focus on static and non-moving objects. The fine-grained details of a human subject can therefore suffer from inconsistencies when synthesized across different viewpoints or time steps. To tackle these challenges, we introduce a human-specific framework that employs a learned 3D-aware representation. Specifically, we first introduce a novel siamese network that employs a gating layer for better reconstruction of the latent volumetric representation and, consequently, final visual results. Moreover, features from consecutive time steps are shared inside the network to improve temporal consistency. Second, we introduce a novel loss to explicitly enforce consistency across generated views both in space and in time. Third, we present the Multi-View Human Action (MVHA) dataset, consisting of near 1200 synthetic human performance captured from 54 viewpoints. Experiments on the MVHA, Pose-Varying Human Model and ShapeNet datasets show that our method outperforms the state-of-the-art baselines both in view generation quality and spatio-temporal consistency.more » « less
- 
            null (Ed.)Human novel view synthesis aims to synthesize target views of a human subject given input images taken from one or more reference viewpoints. Despite significant advances in model-free novel view synthesis, existing methods present two major limitations when applied to complex shapes like humans. First, these methods mainly focus on simple and symmetric objects, e.g., cars and chairs, limiting their performances to fine-grained and asymmetric shapes. Second, existing methods cannot guarantee visual consistency across different adjacent views of the same object. To solve these problems, we present in this paper a learning framework for the novel view synthesis of human subjects, which explicitly enforces consistency across different generated views of the subject. Specifically, we introduce a novel multi-view supervision and an explicit rotational loss during the learning process, enabling the model to preserve detailed body parts and to achieve consistency between adjacent synthesized views. To show the superior performance of our approach, we present qualitative and quantitative results on the Multi-View Human Action (MVHA) dataset we collected (consisting of 3D human models animated with different Mocap sequences and captured from 54 different viewpoints), the Pose-Varying Human Model (PVHM) dataset, and ShapeNet. The qualitative and quantitative results demonstrate that our approach outperforms the state-of-the-art baselines in both per-view synthesis quality, and in preserving rotational consistency and complex shapes (e.g. fine-grained details, challenging poses) across multiple adjacent views in a variety of scenarios, for both humans and rigid objects.more » « less
- 
            3D CT point clouds reconstructed from the original CT images are naturally represented in real-world coordinates. Compared with CT images, 3D CT point clouds contain invariant geometric features with irregular spatial distributions from multiple viewpoints. This paper rethinks pulmonary nodule detection in CT point cloud representations. We first extract the multi-view features from a sparse convolutional (SparseConv) encoder by rotating the point clouds with different angles in the world coordinate. Then, to simultaneously learn the discriminative and robust spatial features from various viewpoints, a nodule proposal optimization schema is proposed to obtain coarse nodule regions by aggregating consistent nodule proposals prediction from multi-view features. Last, the multi-level features and semantic segmentation features extracted from a SparseConv decoder are concatenated with multi-view features for final nodule region regression. Experiments on the benchmark dataset (LUNA16) demonstrate the feasibility of applying CT point clouds in lung nodule detection task. Furthermore, we observe that by combining multi-view predictions, the performance of the proposed framework is greatly improved compared to single-view, while the interior texture features of nodules from images are more suitable for detecting nodules in small sizes.more » « less
- 
            Abstract MotivationAccurately predicting the likelihood of interaction between two objects (compound–protein sequence, user–item, author–paper, etc.) is a fundamental problem in Computer Science. Current deep-learning models rely on learning accurate representations of the interacting objects. Importantly, relationships between the interacting objects, or features of the interaction, offer an opportunity to partition the data to create multi-views of the interacting objects. The resulting congruent and non-congruent views can then be exploited via contrastive learning techniques to learn enhanced representations of the objects. ResultsWe present a novel method, Contrastive Stratification for Interaction Prediction (CSI), to stratify (partition) a dataset in a manner that can be exploited via Contrastive Multiview Coding to learn embeddings that maximize the mutual information across congruent data views. CSI assigns a key and multiple views to each data point, where data partitions under a particular key form congruent views of the data. We showcase the effectiveness of CSI by applying it to the compound–protein sequence interaction prediction problem, a pressing problem whose solution promises to expedite drug delivery (drug–protein interaction prediction), metabolic engineering, and synthetic biology (compound–enzyme interaction prediction) applications. Comparing CSI with a baseline model that does not utilize data stratification and contrastive learning, and show gains in average precision ranging from 13.7% to 39% using compounds and sequences as keys across multiple drug–target and enzymatic datasets, and gains ranging from 16.9% to 63% using reaction features as keys across enzymatic datasets. Availability and implementationCode and dataset available at https://github.com/HassounLab/CSI.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
