skip to main content


Title: Learning 3D Part Assembly from a Single Image
Autonomous assembly is a crucial capability for robots in many applications. For this task, several problems such as obstacle avoidance, motion planning, and actuator control have been extensively studied in robotics. However, when it comes to task specification, the space of possibilities remains underexplored. Towards this end, we introduce a novel problem, single-image-guided 3D part assembly, along with a learning-based solution. We study this problem in the setting of furniture assembly from a given complete set of parts and a single image depicting the entire assembled object. Multiple challenges exist in this setting, including handling ambiguity among parts (e.g., slats in a chair back and leg stretchers) and 3D pose prediction for parts and part subassemblies, whether visible or occluded. We address these issues by proposing a two-module pipeline that leverages strong 2D-3D correspondences and assembly-oriented graph message-passing to infer part relationships. In experiments with a PartNet-based synthetic benchmark, we demonstrate the effectiveness of our framework as compared with three baseline approaches (code and data available at https://github.com/AntheaLi/3DPartAssembly).  more » « less
Award ID(s):
1763268
NSF-PAR ID:
10285236
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
European Conference on Computer Vision
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Tan, Jie ; Toussaint, Marc ; Darvish, Kourosh (Ed.)
    Most successes in autonomous robotic assembly have been restricted to single target or category. We propose to investigate general part assembly, the task of creating novel target assemblies with unseen part shapes. As a fundamental step to a general part assembly system, we tackle the task of determining the precise poses of the parts in the target assembly, which we term “rearrangement planning". We present General Part Assembly Transformer (GPAT), a transformer-based model architecture that accurately predicts part poses by inferring how each part shape corresponds to the target shape. Our experiments on both 3D CAD models and real-world scans demonstrate GPAT’s generalization abilities to novel and diverse target and part shapes. 
    more » « less
  2. Given a part design, the task of manufacturing process selection chooses an appropriate manufacturing process to fabricate it. Prior research has traditionally determined manufacturing processes through direct classification. However, an alternative approach to select a manufacturing process for a new design involves identifying previously produced parts with comparable shapes and materials and learning from them. Finding similar designs from a large dataset of previously manufactured parts is a challenging problem. To solve this problem, researchers have proposed different spatial and spectral shape descriptors to extract shape features including the D2 distribution, spherical harmonics (SH), and the Fast Fourier Transform (FFT), as well as the application of different machine learning methods on various representations of 3D part models like multi-view images, voxel, triangle mesh, and point cloud. However, there has not been a comprehensive analysis of these different shape descriptors, especially for part similarity search aimed at manufacturing process selection. To remedy this gap, this paper presents an in-depth comparative study of these shape descriptors for part similarity search. While we acknowledge the importance of factors like part size, tolerance, and cost in manufacturing process selection, this paper focuses on part shape and material properties only. Our findings show that SH performs the best among non-machine learning methods for manufacturing process selection, yielding 97.96% testing accuracy using the proposed quantitative evaluation metric. For machine learning methods, deep learning on multi-view image representations is best, yielding 99.85% testing accuracy when rotational invariance is not a primary concern. Deep learning on point cloud representations excels, yielding 99.44% testing accuracy when considering rotational invariance. 
    more » « less
  3. The value of electronic waste at present is estimated to increase rapidly year after year, and with rapid advances in electronics, shows no signs of slowing down. Storage devices such as SATA Hard Disks and Solid State Devices are electronic devices with high value recyclable raw materials which often goes unrecovered. Most of the e-waste currently generated, including HDDs, is either managed by the informal recycling sector, or is improperly landfilled with the municipal solid waste, primarily due to insufficient recovery infrastructure and labor shortage in the recycling industry. This emphasizes the importance of developing modern advanced recycling technologies such as robotic disassembly. Performing smooth robotic disassembly operations of precision electronics necessitates fast and accurate geometric 3D profiling to provide a quick and precise location of key components. Fringe Projection Profilometry (FPP), as a variation of the well-known structured light technology, provides both the high speed and high accuracy needed to accomplish this. However, Using FPP for disassembly of high-precision electronics such as hard disks can be especially challenging, given that the hard disk platter is almost completely reflective. Furthermore, the metallic nature of its various components make it difficult to render an accurate 3D reconstruction. To address this challenge, We have developed a single-shot approach to predict the 3D point cloud of these devices using a combination of computer graphics, fringe projection, and deep learning. We calibrate a physical FPP-based 3D shape measurement system and set up its digital twin using computer graphics. We capture HDD and SSD CAD models at various orientations to generate virtual training datasets consisting of fringe images and their point cloud reconstructions. This is used to train the U-NET which is then found efficient to predict the depth of the parts to a high accuracy with only a single shot fringe image. This proposed technology has the potential to serve as a valuable fast 3D vision tool for robotic re-manufacturing and is a stepping stone for building a completely automated assembly system. 
    more » « less
  4. Monocular 3D object parsing is highly desirable in various scenarios including occlusion reasoning and holistic scene interpretation. We present a deep convolutional neural network (CNN) architecture to localize semantic parts in 2D image and 3D space while inferring their visibility states, given a single RGB image. Our key insight is to exploit domain knowledge to regularize the network by deeply supervising its hidden layers, in order to sequentially infer intermediate concepts associated with the final task. To acquire training data in desired quantities with ground truth 3D shape and relevant concepts, we render 3D object CAD models to generate large-scale synthetic data and simulate challenging occlusion configurations between objects. We train the network only on synthetic data and demonstrate state-of-the-art performances on real image benchmarks including an extended version of KITTI, PASCAL VOC, PASCAL3D+ and IKEA for 2D and 3D keypoint localization and instance segmentation. The empirical results substantiate the utility of our deep supervision scheme by demonstrating effective transfer of knowledge from synthetic data to real images, resulting in less overfitting compared to standard end-to-end training. 
    more » « less
  5. Human pose estimation (HPE) is inherently a homogeneous multi-task learning problem, with the localization of each body part as a different task. Recent HPE approaches universally learn a shared representation for all parts, from which their locations are linearly regressed. However, our statistical analysis indicates not all parts are related to each other. As a result, such a sharing mechanism can lead to negative transfer and deteriorate the performance. This potential issue drives us to raise an interesting question. Can we identify related parts and learn specific features for them to improve pose estimation? Since unrelated tasks no longer share a high-level representation, we expect to avoid the adverse effect of negative transfer. In addition, more explicit structural knowledge, e.g., ankles and knees are highly related, is incorporated into the model, which helps resolve ambiguities in HPE. To answer this question, we first propose a data-driven approach to group related parts based on how much information they share. Then a part-based branching network (PBN) is introduced to learn representations specific to each part group. We further present a multi-stage version of this network to repeatedly refine intermediate features and pose estimates. Ablation experiments indicate learning specific features significantly improves the localization of occluded parts and thus benefits HPE. Our approach also outperforms all state-of-the-art methods on two benchmark datasets, with an outstanding advantage when occlusion occurs. 
    more » « less