skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: GARField: Group Anything with Radiance Fields
Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene -- should the wheels of an excavator be considered separate or part of the whole? We present Group Anything with Radiance Fields (GARField), an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field, a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy, using scale to consistently fuse conflicting masks from different viewpoints. From this field we can derive a hierarchy of possible groupings via automatic tree construction or user interaction. We evaluate GARField on a variety of in-the-wild scenes and find it effectively extracts groups at many levels: clusters of objects, objects, and various subparts. GARField inherently represents multi-view consistent groupings and produces higher fidelity groups than the input SAM masks. GARField's hierarchical grouping could have exciting downstream applications such as 3D asset extraction or dynamic scene understanding. See the project website at https://www.garfield.studio/  more » « less
Award ID(s):
2235013
PAR ID:
10579938
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
CVPR
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Human visual grouping processes consolidate independent visual objects into grouped visual features on the basis of shared characteristics; these visual features can themselves be grouped, resulting in a hierarchical representation of visual grouping information. This “grouping hierarchy“ promotes ef- ficient attention in the support of goal-directed behavior, but improper grouping of elements of a visual scene can also re- sult in critical behavioral errors. Understanding of how visual object/features characteristics such as size and form influences perception of hierarchical visual groups can further theory of human visual grouping behavior and contribute to effective in- terface design. In the present study, participants provided free- response groupings of a set of stimuli that contained consistent structural relationships between a limited set of visual features. These grouping patterns were evaluated for relationships be- tween specific characteristics of the constituent visual features and the distribution of features across levels of the indicated grouping hierarchy. We observed that while the relative size of the visual features differentiated groupings across levels of the grouping hierarchy, the form of visual objects and features was more likely to distinguish separate groups within a partic- ular level of hierarchy. These consistent relationships between visual feature characteristics and placement within a grouping hierarchy can be leveraged to advance computational theories of human visual grouping behavior, which can in turn be ap- plied to effective design for interfaces such as voter ballots. 
    more » « less
  2. Human visual grouping processes consolidate independent visual objects into grouped visual features on the basis of shared characteristics; these visual features can themselves be grouped, resulting in a hierarchical representation of visual grouping information. This “grouping hierarchy“ promotes ef- ficient attention in the support of goal-directed behavior, but improper grouping of elements of a visual scene can also re- sult in critical behavioral errors. Understanding of how visual object/features characteristics such as size and form influences perception of hierarchical visual groups can further theory of human visual grouping behavior and contribute to effective in- terface design. In the present study, participants provided free- response groupings of a set of stimuli that contained consistent structural relationships between a limited set of visual features. These grouping patterns were evaluated for relationships be- tween specific characteristics of the constituent visual features and the distribution of features across levels of the indicated grouping hierarchy. We observed that while the relative size of the visual features differentiated groupings across levels of the grouping hierarchy, the form of visual objects and features was more likely to distinguish separate groups within a partic- ular level of hierarchy. These consistent relationships between visual feature characteristics and placement within a grouping hierarchy can be leveraged to advance computational theories of human visual grouping behavior, which can in turn be ap- plied to effective design for interfaces such as voter ballots. 
    more » « less
  3. Neural Radiance Field (NeRF) approaches learn the underlying 3D representation of a scene and generate photorealistic novel views with high fidelity. However, most proposed settings concentrate on modelling a single object or a single level of a scene. However, in the real world, we may capture a scene at multiple levels, resulting in a layered capture. For example, tourists usually capture a monument’s exterior structure before capturing the inner structure. Modelling such scenes in 3D with seamless switching between levels can drastically improve immersive experiences. However, most existing techniques struggle in modelling such scenes. We propose Strata-NeRF, a single neural radiance field that implicitly captures a scene with multiple levels. Strata-NeRF achieves this by conditioning the NeRFs on Vector Quantized (VQ) latent representations which allow sudden changes in scene structure. We evaluate the effectiveness of our approach in multi-layered synthetic dataset comprising diverse scenes and then further validate its generalization on the real-world RealEstate 10k dataset. We find that Strata-NeRF effectively captures stratified scenes, minimizes artifacts, and synthesizes high-fidelity views compared to existing approaches. 
    more » « less
  4. We present the first event-based learning approach for motion segmentation in indoor scenes and the first event-based dataset – EV-IMO – which includes accurate pixel-wise motion masks, egomotion and ground truth depth. Our approach is based on an efficient implementation of the SfM learning pipeline using a low parameter neural network architecture on event data. In addition to camera egomotion and a dense depth map, the network estimates independently moving object segmentation at the pixel-level and computes per-object 3D translational velocities of moving objects. We also train a shallow network with just 40k parameters, which is able to compute depth and egomotion. Our EV-IMO dataset features 32 minutes of indoor recording with up to 3 fast moving objects in the camera field of view. The objects and the camera are tracked using a VICON motion capture system. By 3D scanning the room and the objects, ground truth of the depth map and pixel-wise object masks are obtained. We then train and evaluate our learning pipeline on EV-IMO and demonstrate that it is well suited for scene constrained robotics applications. 
    more » « less
  5. The segment anything model (SAM) was released as a foundation model for image segmentation. The promptable segmentation model was trained by over 1 billion masks on 11M licensed and privacy-respecting images. The model supports zero-shot image segmentation with various seg- mentation prompts (e.g., points, boxes, masks). It makes the SAM attractive for medical image analysis, especially for digital pathology where the training data are rare. In this study, we eval- uate the zero-shot segmentation performance of SAM model on representative segmentation tasks on whole slide imaging (WSI), including (1) tumor segmentation, (2) non-tumor tissue segmen- tation, (3) cell nuclei segmentation. Core Results: The results suggest that the zero-shot SAM model achieves remarkable segmentation performance for large connected objects. However, it does not consistently achieve satisfying performance for dense instance object segmentation, even with 20 prompts (clicks/boxes) on each image. We also summarized the identified limitations for digital pathology: (1) image resolution, (2) multiple scales, (3) prompt selection, and (4) model fine-tuning. In the future, the few-shot fine-tuning with images from downstream pathological seg- mentation tasks might help the model to achieve better performance in dense object segmentation. 
    more » « less