skip to main content


Title: LocaliseBot: Multi-view 3D object localisation with differentiable rendering for robot grasping
Robot grasp typically follows five stages: object detection, object localisation, object pose estimation, grasp pose estimation, and grasp planning. We focus on object pose estimation. Our approach relies on three pieces of information: multiple views of the object, the camera’s extrinsic parameters at those viewpoints, and 3D CAD models of objects. The first step involves a standard deep learning backbone (FCN ResNet) to estimate the object label, semantic segmentation, and a coarse estimate of the object pose with respect to the camera. Our novelty is using a refinement module that starts from the coarse pose estimate and refines it by optimisation through differentiable rendering. This is a purely vision-based approach that avoids the need for other information such as point cloud or depth images. We evaluate our object pose estimation approach on the ShapeNet dataset and show improvements over the state of the art. We also show that the estimated object pose results in 99.65% grasp accuracy with the ground truth grasp candidates on the Object Clutter Indoor Dataset (OCID) Grasp dataset, as computed using standard practice.  more » « less
Award ID(s):
1826258
NSF-PAR ID:
10352247
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
European Conference on Computer Vision
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the problem of in-hand dexterous manipulation with a focus on unknown or uncertain hand–object parameters, such as hand configuration, object pose within hand, and contact positions. In particular, in this work we formulate a generic framework for hand–object configuration estimation using underactuated hands as an example. Owing to the passive reconfigurability and the lack of encoders in the hand’s joints, it is challenging to estimate, plan, and actively control underactuated manipulation. By modeling the grasp constraints, we present a particle filter-based framework to estimate the hand configuration. Specifically, given an arbitrary grasp, we start by sampling a set of hand configuration hypotheses and then randomly manipulate the object within the hand. While observing the object’s movements as evidence using an external camera, which is not necessarily calibrated with the hand frame, our estimator calculates the likelihood of each hypothesis to iteratively estimate the hand configuration. Once converged, the estimator is used to track the hand configuration in real time for future manipulations. Thereafter, we develop an algorithm to precisely plan and control the underactuated manipulation to move the grasped object to desired poses. In contrast to most other dexterous manipulation approaches, our framework does not require any tactile sensing or joint encoders, and can directly operate on any novel objects, without requiring a model of the object a priori. We implemented our framework on both the Yale Model O hand and the Yale T42 hand. The results show that the estimation is accurate for different objects, and that the framework can be easily adapted across different underactuated hand models. In the end, we evaluated our planning and control algorithm with handwriting tasks, and demonstrated the effectiveness of the proposed framework. 
    more » « less
  2. We consider a category-level perception problem, where one is given 3D sensor data picturing an object of a given category (e.g., a car), and has to reconstruct the pose and shape of the object despite intra-class variability (i.e., different car models have different shapes). We consider an active shape model, where —for an object category— we are given a library of potential CAD models describing objects in that category, and we adopt a standard formulation where pose and shape estimation are formulated as a non-convex optimization. Our first contribution is to provide the first certifiably optimal solver for pose and shape estimation. In particular, we show that rotation estimation can be decoupled from the estimation of the object translation and shape, and we demonstrate that (i) the optimal object rotation can be computed via a tight (small-size) semidefinite relaxation, and (ii) the translation and shape parameters can be computed in closed-form given the rotation. Our second contribution is to add an outlier rejection layer to our solver, hence making it robust to a large number of misdetections. Towards this goal, we wrap our optimal solver in a robust estimation scheme based on graduated non-convexity. To further enhance robustness to outliers, we also develop the first graph-theoretic formulation to prune outliers in category-level perception, which removes outliers via convex hull and maximum clique computations; the resulting approach is robust to 70 − 90% outliers. Our third contribution is an extensive experimental evaluation. Besides providing an ablation study on a simulated dataset and on the PASCAL3D+ dataset, we combine our solver with a deep-learned keypoint detector, and show that the resulting approach improves over the state of the art in vehicle pose estimation in the ApolloScape datasets. 
    more » « less
  3. null (Ed.)
    Robot manipulation and grasping mechanisms have received considerable attention in the recent past, leading to development of wide-range of industrial applications. This paper proposes the development of an autonomous robotic grasping system for object sorting application. RGB-D data is used by the robot for performing object detection, pose estimation, trajectory generation and object sorting tasks. The proposed approach can also handle grasping on certain objects chosen by users. Trained convolutional neural networks are used to perform object detection and determine the corresponding point cloud cluster of the object to be grasped. From the selected point cloud data, a grasp generator algorithm outputs potential grasps. A grasp filter then scores these potential grasps, and the highest-scored grasp will be chosen to execute on a real robot. A motion planner will generate collision-free trajectories to execute the chosen grasp. The experiments on AUBO robotic manipulator show the potentials of the proposed approach in the context of autonomous object sorting with robust and fast sorting performance. 
    more » « less
  4. null (Ed.)
    bot manipulation and grasping mechanisms have received considerable attention in the recent past, leading to development of wide-range of industrial applications. This paper proposes the development of an autonomous robotic grasping system for object sorting application. RGB-D data is used by the robot for performing object detection, pose estimation, trajectory generation and object sorting tasks. The proposed approach can also handle grasping on certain objects chosen by users. Trained convolutional neural networks are used to perform object detection and determine the corresponding point cloud cluster of the object to be grasped. From the selected point cloud data, a grasp generator algorithm outputs potential grasps. A grasp filter then scores these potential grasps, and the highest-scored grasp will be chosen to execute on a real robot. A motion planner will generate collision-free trajectories to execute the chosen grasp. The experiments on AUBO robotic manipulator show the potentials of the proposed approach in the context of autonomous object sorting with robust and fast sorting performance. 
    more » « less
  5. The ability to estimate the 3D human shape and pose from images can be useful in many contexts. Recent approaches have explored using graph convolutional networks and achieved promising results. The fact that the 3D shape is represented by a mesh, an undirected graph, makes graph convolutional networks a natural fit for this problem. However, graph convolutional networks have limited representation power Information from nodes in the graph is passed to connected neighbors, and propagation of information requires successive graph convolutions. To overcome this limitation, we propose a dual-scale graph approach. We use a coarse graph, derived from a dense graph, to estimate the human’s 3D pose, and the dense graph to estimate the 3D shape. Information in coarse graphs can be propagated over longer distances compared to dense graphs. In addition, information about pose can guide to recover local shape detail and vice versa. We recognize that the connection between coarse and dense is itself a graph, and introduce graph fusion blocks to exchange information between graphs with different scales. We train our model end-to-end and show that we can achieve state-of-the-art results for several evaluation datasets. The code is available at the following link, https://github.com/yuxwind/BiGraphBody. 
    more » « less