NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

FoundHand: Large-Scale Domain-Specific Learning for Controllable Hand Image Generation

Chen, Kefan; Min, Chaerin; Zhang, Linguang; Hampali, Shreyas; Keskin, Cem; Sridhar, Srinath (June 2025, CVPR 2025)

FoundHand is trained on our large-scale FoundHand-10M dataset which contains automatically extracted 2D keypoints and segmentation mask annotations (top left). FoundHand is formulated as a 2D pose-conditioned image-to-image diffusion model that enables precise hand pose and camera viewpoint control (top right). Optionally, we can condition the generation with a reference image to preserve its style (top right). Our model demonstrates exceptional in-the-wild generalization across hand-centric applications and has core capabilities. such as gesture transfer, domain transfer, and novel view synthesis (middle row). This endows FoundHand with zero-shot applications to fix malformed hand images and synthesize coherent hand and hand-object videos, without explicitly giving object cues (bottom row).
more » « less
Free, publicly-accessible full text available June 16, 2026
GigaHands: A Massive Annotated Dataset of Bimanual Hand Activities

Fu, Rao; Zhang, Dingxi; Jiang, Alex; Fu, Wanjia; Fund, Austin; Ritchie, Daniel; Sridhar, Srinath (June 2025, CVPR 2025)

Understanding bimanual human hand activities is a critical problem in AI and robotics. We cannot build large models of bimanual activities because existing datasets lack the scale, coverage of diverse hand activities, and detailed annotations. We introduce GigaHands, a massive annotated dataset capturing 34 hours of bimanual hand activities from 56 subjects and 417 objects, totaling 14k motion clips derived from 183 million frames paired with 84k text annotations. Our markerless capture setup and data acquisition protocol enable fully automatic 3D hand and object estimation while minimizing the effort required for text annotation. The scale and diversity of GigaHands enable broad applications, including text-driven action synthesis, hand motion captioning, and dynamic radiance field reconstruction.
more » « less
Free, publicly-accessible full text available June 16, 2026
GeoDiffuser: Geometry-Based Image Editing with Diffusion Models

https://doi.org/10.1109/WACV61041.2025.00056

Sajnani, Rahul; Vanbaar, Jeroen; Min, Jie; Katyal, Kapil; Sridhar, Srinath (February 2025, IEEE)

Free, publicly-accessible full text available February 26, 2026
GeoDiffuser: Geometry-Based Image Editing with Diffusion Models

Sajnani, Rahul; Vanbaar, Jeroen; Min, Jie; Katyal, Kapil; Sridhar, Srinath (May 2024, arxiv 2024)

The success of image generative models has enabled us to build methods that can edit images based on text or other user input. However, these methods are bespoke, imprecise, require additional information, or are limited to only 2D image edits. We present GeoDiffuser, a zero-shot optimization-based method that unifies common 2D and 3D image-based object editing capabilities into a single method. Our key insight is to view image editing operations as geometric transformations. We show that these transformations can be directly incorporated into the attention layers in diffusion models to implicitly perform editing operations. Our training-free optimization method uses an objective function that seeks to preserve object style but generate plausible images, for instance with accurate lighting and shadows. It also inpaints disoccluded parts of the image where the object was originally located. Given a natural image and user input, we segment the foreground object using SAM and estimate a corresponding transform which is used by our optimization approach for editing. GeoDiffuser can perform common 2D and 3D edits like object translation, 3D rotation, and removal. We present quantitative results, including a perceptual study, that shows how our approach is better than existing methods.
more » « less
DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields

Lu, Cheng-You; Zhou, Peisen; Xing, Angela; Pokhariya, Chandradeep; Dey, Arnab; Shah, Ishaan; Mavidipalli, Rugved; Hu, Dylan; Comport, Andrew; Chen, Kefan; et al (June 2024, CVPR 2024)

Advances in neural fields are enablling high-fidelity capture of shape and appearance of dynamic 3D scenes. However, this capbabilities lag behind those offered by conventional representations such as 2D videos because of algorithmic challenges and the lack of large-scale multi-view real-world datasets. We address the dataset limitations with DiVa-360, a real-world 360° dynamic visual dataset that contains synchronized high-resolution and long-duration multi-view video sequences of table-scale scenes captured using a customized low-cost system with 53 cameras. It contains 21 object-centric sequences categorized by different motion types, 25 intricate hand-object interaction sequences, and 8 long-duration sequences for a total of 17.4M frames. In addition, we provide foreground-background segmentation masks, synchronized audio, and text descriptions. We benchmark the state-of-the-art dynamic neural field methods on DiVa-360 and provide insights about existing methods and future challenges on long-duration neural field capture.
more » « less
HyP-NeRF: Learning Improved NeRF Priors using a HyperNetwork

Sen, Bipasha; Singh, Gaurav; Agarwal, Aditya; Agaram, Rohith; MadhavaKrishna, K; Sridhar, Srinath (December 2023, NeurIPS 2023)

Neural Radiance Fields (NeRF) have become an increasingly popular representation to capture high-quality appearance and shape of scenes and objects. However, learning generalizable NeRF priors over categories of scenes or objects has been challenging due to the high dimensionality of network weight space. To address the limitations of existing work on generalization, multi-view consistency and to improve quality, we propose HyP-NeRF, a latent conditioning method for learning generalizable category-level NeRF priors using hypernetworks. Rather than using hypernetworks to estimate only the weights of a NeRF, we estimate both the weights and the multi-resolution hash encodings resulting in significant quality gains. To improve quality even further, we incorporate a denoise and finetune strategy that denoises images rendered from NeRFs estimated by the hypernetwork and finetunes it while retaining multiview consistency. These improvements enable us to use HyP-NeRF as a generalizable prior for multiple downstream tasks including NeRF reconstruction from single-view or cluttered scenes and text-to-NeRF. We provide qualitative comparisons and evaluate HyP-NeRF on three tasks: generalization, compression, and retrieval, demonstrating our state-of-the-art results.
more » « less
Semantic Attention Flow Fields for Monocular Dynamic Scene Decomposition

https://doi.org/10.1109/iccv51070.2023.01992

Liang, Yiqing; Laidlaw, Eliot; Meyerowitz, Alexander; Sridhar, Srinath; Tompkin, James (October 2023, International Conference on Computer Vision)
Strata-NeRF: Neural Radiance Fields for Stratified Scenes

https://doi.org/10.1109/ICCV51070.2023.01614

Dhiman, Ankit; Srinath, R; Rangwani, Harsh; Parihar, Rishubh; Boregowda, Lokesh; Sridhar, Srinath; VenkateshBabu, R (October 2023, ICCV 2023)

Neural Radiance Field (NeRF) approaches learn the underlying 3D representation of a scene and generate photorealistic novel views with high fidelity. However, most proposed settings concentrate on modelling a single object or a single level of a scene. However, in the real world, we may capture a scene at multiple levels, resulting in a layered capture. For example, tourists usually capture a monument’s exterior structure before capturing the inner structure. Modelling such scenes in 3D with seamless switching between levels can drastically improve immersive experiences. However, most existing techniques struggle in modelling such scenes. We propose Strata-NeRF, a single neural radiance field that implicitly captures a scene with multiple levels. Strata-NeRF achieves this by conditioning the NeRFs on Vector Quantized (VQ) latent representations which allow sudden changes in scene structure. We evaluate the effectiveness of our approach in multi-layered synthetic dataset comprising diverse scenes and then further validate its generalization on the real-world RealEstate 10k dataset. We find that Strata-NeRF effectively captures stratified scenes, minimizes artifacts, and synthesizes high-fidelity views compared to existing approaches.
more » « less
Unsupervised Kinematic Motion Detection for Part-segmented 3D Shape Collections

https://doi.org/10.1145/3528233.3530742

Xu, Xianghao; Ruan, Yifan; Sridhar, Srinath; Ritchie, Daniel (August 2022, ACM SIGGRAPH 2022)

Full Text Available
HuMoR: 3D Human Motion Model for Robust Pose Estimation

https://doi.org/10.1109/ICCV48922.2021.01129

Rempe, Davis; Birdal, Tolga; Hertzmann, Aaron; Yang, Jimei; Sridhar, Srinath; Guibas, Leonidas J. (October 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV))

We introduce HuMoR: a 3D Human Motion Model for Robust Estimation of temporal pose and shape. Though substantial progress has been made in estimating 3D human motion and shape from dynamic observations, recovering plausible pose sequences in the presence of noise and occlusions remains a challenge. For this purpose, we propose an expressive generative model in the form of a conditional variational autoencoder, which learns a distribution of the change in pose at each step of a motion sequence. Furthermore, we introduce a flexible optimization-based approach that leverages HuMoR as a motion prior to robustly estimate plausible pose and shape from ambiguous observations. Through extensive evaluations, we demonstrate that our model generalizes to diverse motions and body shapes after training on a large motion capture dataset, and enables motion reconstruction from multiple input modalities including 3D keypoints and RGB(-D) videos. See the project page at geometry.stanford.edu/projects/humor.
more » « less
Full Text Available

« Prev Next »

Search for: All records