skip to main content


Title: Learning to Infer Kinematic Hierarchies for Novel Object Instances.
Manipulating an articulated object requires perceiving its kinematic hierarchy: its parts, how each can move, and how those motions are coupled. Previous work has explored perception for kinematics, but none infers a complete kinematic hierarchy on never-before-seen object instances, without relying on a schema or template. We present a novel perception system that achieves this goal. Our system infers the moving parts of an object and the kinematic couplings that relate them. To infer parts, it uses a point cloud instance segmentation neural network and to infer kinematic hierarchies, it uses a graph neural network to predict the existence, direction, and type of edges (i.e. joints) that relate the inferred parts. We train these networks using simulated scans of synthetic 3D models. We evaluate our system on simulated scans of 3D objects, and we demonstrate a proof-of-concept use of our system to drive real-world robotic manipulation.  more » « less
Award ID(s):
1844960
NSF-PAR ID:
10321089
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the 2022 International Conference on Robotics and Automation
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Social inequality is a consistent feature of animal societies, often manifesting as dominance hierarchies, in which each individual is characterized by a dominance rank denoting its place in the network of competitive relationships among group members. Most studies treat dominance hierarchies as static entities despite their true longitudinal, and sometimes highly dynamic, nature.

    To guide study of the dynamics of dominance, we propose the concept of a longitudinal hierarchy: the characterization of a single, latent hierarchy and its dynamics over time. Longitudinal hierarchies describe the hierarchy position (r) and dynamics () associated with each individual as a property of its interaction data, the periods into which these data are divided based on a period delineation rule (p) and the method chosen to infer the hierarchy. Hierarchy dynamics result from both active (∆a) and passive (∆p) processes. Methods that infer longitudinal hierarchies should optimize accuracy of rank dynamics as well as of the rank orders themselves, but no studies have yet evaluated the accuracy with which different methods infer hierarchy dynamics.

    We modify three popular ranking approaches to make them better suited for inferring longitudinal hierarchies. Our three “informed” methods assign ranks that are informed by data from the prior period rather than calculating ranksde novoin each observation period and use prior knowledge of dominance correlates to inform placement of new individuals in the hierarchy. These methods are provided in an R package.

    Using both a simulated dataset and a long‐term empirical dataset from a species with two distinct sex‐based dominance structures, we compare the performance of these methods and their unmodified counterparts. We show that choice of method has dramatic impacts on inference of hierarchy dynamics via differences in estimates of∆a. Methods that calculate ranksde novoin each period overestimate hierarchy dynamics, but incorporation of prior information leads to more accurately inferred∆a. Of the modified methods, Informed MatReorder infers the most conservative estimates of hierarchy dynamics and Informed Elo infers the most dynamic hierarchies.

    This work provides crucially needed conceptual framing and methodological validation for studying social dominance and its dynamics.

     
    more » « less
  2. Monocular 3D object parsing is highly desirable in various scenarios including occlusion reasoning and holistic scene interpretation. We present a deep convolutional neural network (CNN) architecture to localize semantic parts in 2D image and 3D space while inferring their visibility states, given a single RGB image. Our key insight is to exploit domain knowledge to regularize the network by deeply supervising its hidden layers, in order to sequentially infer intermediate concepts associated with the final task. To acquire training data in desired quantities with ground truth 3D shape and relevant concepts, we render 3D object CAD models to generate large-scale synthetic data and simulate challenging occlusion configurations between objects. We train the network only on synthetic data and demonstrate state-of-the-art performances on real image benchmarks including an extended version of KITTI, PASCAL VOC, PASCAL3D+ and IKEA for 2D and 3D keypoint localization and instance segmentation. The empirical results substantiate the utility of our deep supervision scheme by demonstrating effective transfer of knowledge from synthetic data to real images, resulting in less overfitting compared to standard end-to-end training. 
    more » « less
  3. null (Ed.)
    Most real-world 3D sensors such as LiDARs perform fixed scans of the entire environment, while being decoupled from the recognition system that processes the sensor data. In this work, we propose a method for 3D object recognition using light curtains, a resource-efficient controllable sensor that measures depth at user-specified locations in the environment. Crucially, we propose using prediction uncertainty of a deep learning based 3D point cloud detector to guide active perception. Given a neural network's uncertainty, we derive an optimization objective to place light curtains using the principle of maximizing information gain. Then, we develop a novel and efficient optimization algorithm to maximize this objective by encoding the physical constraints of the device into a constraint graph and optimizing with dynamic programming. We show how a 3D detector can be trained to detect objects in a scene by sequentially placing uncertainty-guided light curtains to successively improve detection accuracy. 
    more » « less
  4. null (Ed.)
    As autonomous robots interact and navigate around real-world environments such as homes, it is useful to reliably identify and manipulate articulated objects, such as doors and cabinets. Many prior works in object articulation identification require manipulation of the object, either by the robot or a human. While recent works have addressed predicting articulation types from visual observations alone, they often assume prior knowledge of category-level kinematic motion models or sequence of observations where the articulated parts are moving according to their kinematic constraints. In this work, we propose FormNet, a neural network that identifies the articulation mechanisms between pairs of object parts from a single frame of an RGB-D image and segmentation masks. The network is trained on 100k synthetic images of 149 articulated objects from 6 categories. Synthetic images are rendered via a photorealistic simulator with domain randomization. Our proposed model predicts motion residual flows of object parts, and these flows are used to determine the articulation type and parameters. The network achieves an articulation type classification accuracy of 82.5% on novel object instances in trained categories. Experiments also show how this method enables generalization to novel categories and can be applied to real-world images without fine-tuning. 
    more » « less
  5. Event-based cameras have been designed for scene motion perception - their high temporal resolution and spatial data sparsity converts the scene into a volume of boundary trajectories and allows to track and analyze the evolution of the scene in time. Analyzing this data is computationally expensive, and there is substantial lack of theory on dense-in-time object motion to guide the development of new algorithms; hence, many works resort to a simple solution of discretizing the event stream and converting it to classical pixel maps, which allows for application of conventional image processing methods. In this work we present a Graph Convolutional neural network for the task of scene motion segmentation by a moving camera. We convert the event stream into a 3D graph in (x,y,t) space and keep per-event temporal information. The difficulty of the task stems from the fact that unlike in metric space, the shape of an object in (x,y,t) space depends on its motion and is not the same across the dataset. We discuss properties of of the event data with respect to this 3D recognition problem, and show that our Graph Convolutional architecture is superior to PointNet++. We evaluate our method on the state of the art event-based motion segmentation dataset - EV-IMO and perform comparisons to a frame-based method proposed by its authors. Our ablation studies show that increasing the event slice width improves the accuracy, and how subsampling and edge configurations affect the network performance. 
    more » « less