Perceiving and manipulating 3D articulated objects (e.g., cabinets, doors) in human environments is an important yet challenging task for future home-assistant robots. The space of 3D articulated objects is exceptionally rich in their myriad semantic categories, diverse shape geometry, and complicated part functionality. Previous works mostly abstract kinematic structure with estimated joint parameters and part poses as the visual representations for manipulating 3D articulated objects. In this paper, we propose object-centric actionable visual priors as a novel perception-interaction handshaking point that the perception system outputs more actionable guidance than kinematic structure estimation, by predicting dense geometry-aware, interaction-aware, and task-aware visual action affordance and trajectory proposals. We design an interaction-for-perception framework VAT-Mart to learn such actionable visual representations by simultaneously training a curiosity-driven reinforcement learning policy exploring diverse interaction trajectories and a perception module summarizing and generalizing the explored knowledge for pointwise predictions among diverse shapes. Experiments prove the effectiveness of the proposed approach using the large-scale PartNet-Mobility dataset in SAPIEN environment and show promising generalization capabilities to novel test shapes, unseen object categories, and real-world data.
more »
« less
Act the Part: Learning Interaction Strategies for Articulated Object Part Discovery
People often use physical intuition when manipulating articulated objects, irrespective of object semantics. Motivated by this observation, we identify an important embodied task where an agent must play with objects to recover their parts. To this end, we introduce Act the Part (AtP) to learn how to interact with articulated objects to discover and segment their pieces. By coupling action selection and motion segmentation, AtP is able to isolate structures to make perceptual part recovery possible without semantic labels. Our experiments show AtP learns efficient strategies for part discovery, can generalize to unseen categories, and is capable of conditional reasoning for the task. Although trained in simulation, we show convincing transfer to real world data with no fine-tuning. A summery video, interactive demo, and code will be available at atp.cs.columbia.edu.
more »
« less
- Award ID(s):
- 2037101
- PAR ID:
- 10311126
- Date Published:
- Journal Name:
- Proceedings
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Robots working in human environments often encounter a wide range of articulated objects, such as tools, cabinets, and other jointed objects. Such articulated objects can take an infinite number of possible poses, as a point in a potentially high-dimensional continuous space. A robot must perceive this continuous pose in order to manipulate the object to a desired pose. This problem of perception and manipulation of articulated objects remains a challenge due to its high dimensionality and multi-modal uncertainty. In this paper, we propose a factored approach to estimate the poses of articulated objects using an efficient non-parametric belief propagation algorithm. We consider inputs as geometrical models with articulation constraints, and observed 3D sensor data. The proposed framework produces object-part pose beliefs iteratively. The problem is formulated as a pairwise Markov Random Field (MRF) where each hidden node (continuous pose variable) models an observed object-part's pose and each edge denotes an articulation constraint between a pair of parts. We propose articulated pose estimation by a Pull Message Passing algorithm for Nonparametric Belief Propagation (PMPNBP) and evaluate its convergence properties over scenes with articulated objects.more » « less
-
Robots working in human environments often encounter a wide range of articulated objects, such as tools, cabinets, and other jointed objects. Such articulated objects can take an infinite number of possible poses, as a point in a potentially high-dimensional continuous space. A robot must perceive this continuous pose to manipulate the object to a desired pose. This problem of perception and manipulation of articulated objects remains a challenge due to its high dimensionality and multimodal uncertainty. Here, we describe a factored approach to estimate the poses of articulated objects using an efficient approach to nonparametric belief propagation. We consider inputs as geometrical models with articulation constraints and observed RGBD (red, green, blue, and depth) sensor data. The described framework produces object-part pose beliefs iteratively. The problem is formulated as a pairwise Markov random field (MRF), where each hidden node (continuous pose variable) is an observed object-part’s pose and the edges denote the articulation constraints between the parts. We describe articulated pose estimation by a “pull” message passing algorithm for nonparametric belief propagation (PMPNBP) and evaluate its convergence properties over scenes with articulated objects. Robot experiments are provided to demonstrate the necessity of maintaining beliefs to perform goal-driven manipulation tasks.more » « less
-
Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration's intended behavior while considering the robot's own morphological limits, rather than attempting to reproduce the hand's motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models -- without any task-specific training, fine-tuning, dataset collection, or annotation.more » « less
-
Abbott, Derek (Ed.)Abstract Human vision, thought, and planning involve parsing and representing objects and scenes using structured representations based on part-whole hierarchies. Computer vision and machine learning researchers have recently sought to emulate this capability using neural networks, but a generative model formulation has been lacking. Generative models that leverage compositionality, recursion, and part-whole hierarchies are thought to underlie human concept learning and the ability to construct and represent flexible mental concepts. We introduce Recursive Neural Programs (RNPs), a neural generative model that addresses the part-whole hierarchy learning problem by modeling images as hierarchical trees of probabilistic sensory-motor programs. These programs recursively reuse learned sensory-motor primitives to model an image within different spatial reference frames, enabling hierarchical composition of objects from parts and implementing a grammar for images. We show that RNPs can learn part-whole hierarchies for a variety of image datasets, allowing rich compositionality and intuitive parts-based explanations of objects. Our model also suggests a cognitive framework for understanding how human brains can potentially learn and represent concepts in terms of recursively defined primitives and their relations with each other.more » « less