skip to main content


Title: Learning Diverse and Physically Feasible Dexterous Grasps with Generative Model and Bilevel Optimization
To fully utilize the versatility of a multi-fingered dexterous robotic hand for executing diverse object grasps, one must consider the rich physical constraints introduced by hand-object interaction and object geometry. We propose an integrative approach of combining a generative model and a bilevel optimization (BO) to plan diverse grasp configurations on novel objects. First, a conditional variational autoencoder trained on merely six YCB objects predicts the finger placement directly from the object point cloud. The prediction is then used to seed a nonconvex BO that solves for a grasp configuration under collision, reachability, wrench closure, and friction constraints. Our method achieved an 86.7% success over 120 real world grasping trials on 20 household objects, including unseen and challenging geometries. Through quantitative empirical evaluations, we confirm that grasp configurations produced by our pipeline are indeed guaranteed to satisfy kinematic and dynamic constraints. A video summary of our results is available at youtu.be/9DTrImbN99I.  more » « less
Award ID(s):
2024247
NSF-PAR ID:
10440633
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Conference on Robot Learning / Proceedings of Machine Learning Research
Volume:
205
Page Range / eLocation ID:
1938-1948
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the problem of in-hand dexterous manipulation with a focus on unknown or uncertain hand–object parameters, such as hand configuration, object pose within hand, and contact positions. In particular, in this work we formulate a generic framework for hand–object configuration estimation using underactuated hands as an example. Owing to the passive reconfigurability and the lack of encoders in the hand’s joints, it is challenging to estimate, plan, and actively control underactuated manipulation. By modeling the grasp constraints, we present a particle filter-based framework to estimate the hand configuration. Specifically, given an arbitrary grasp, we start by sampling a set of hand configuration hypotheses and then randomly manipulate the object within the hand. While observing the object’s movements as evidence using an external camera, which is not necessarily calibrated with the hand frame, our estimator calculates the likelihood of each hypothesis to iteratively estimate the hand configuration. Once converged, the estimator is used to track the hand configuration in real time for future manipulations. Thereafter, we develop an algorithm to precisely plan and control the underactuated manipulation to move the grasped object to desired poses. In contrast to most other dexterous manipulation approaches, our framework does not require any tactile sensing or joint encoders, and can directly operate on any novel objects, without requiring a model of the object a priori. We implemented our framework on both the Yale Model O hand and the Yale T42 hand. The results show that the estimation is accurate for different objects, and that the framework can be easily adapted across different underactuated hand models. In the end, we evaluated our planning and control algorithm with handwriting tasks, and demonstrated the effectiveness of the proposed framework. 
    more » « less
  2. Grasping a simple object from the side is easy --- unless the object is almost as big as the hand or space constraints require positioning the robot hand awkwardly with respect to the object. We show that humans --- when faced with this challenge --- adopt coordinated finger movements which enable them to successfully grasp objects even from these awkward poses. We also show that it is relatively straight forward to implement these strategies autonomously. Our human-studies approach asks participants to perform grasping task by either ``puppetteering'' a robotic manipulator that is identical~(geometrically and kinematically) to a popular underactuated robotic manipulator~(the Barrett hand), or using sliders to control the original Barrett hand. Unlike previous studies, this enables us to directly capture and compare human manipulation strategies with robotic ones. Our observation is that, while humans employ underactuation, how they use it is fundamentally different (and more effective) than that found in existing hardware. 
    more » « less
  3. Grasping a simple object from the side is easy-unless the object is almost as big as the hand or space constraints require positioning the robot hand awkwardly with respect to the object. We show that humans-when faced with this challenge-adopt coordinated finger movements which enable them to successfully grasp objects even from these awkward poses. We also show that it is relatively straight forward to implement these strategies autonomously. Our human-studies approach asks participants to perform grasping task by either "puppetteering" a robotic manipulator that is identical (geometrically and kinematically) to a popular underactuated robotic manipulator (the Barrett hand), or using sliders to control the original Barrett hand. Unlike previous studies, this enables us to directly capture and compare human manipulation strategies with robotic ones. Our observation is that, while humans employ underactuation, how they use it is fundamentally different (and more effective) than that found in existing hardware. 
    more » « less
  4. Understanding how we grasp objects with our hands has important applications in areas like robotics and mixed reality. However, this challenging problem requires accurate modeling of the contact between hands and objects. To capture grasps, existing methods use skeletons, meshes, or parametric models that can cause misalignments resulting in inaccurate contacts. We present MANUS, a method for Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians. We build a novel articulated 3D Gaussians representation that extends 3D Gaussian splatting for high-fidelity representation of articulating hands. Since our representation uses Gaussian primitives, it enables us to efficiently and accurately estimate contacts between the hand and the object. For the most accurate results, our method requires tens of camera views that current datasets do not provide. We therefore build MANUS Grasps dataset, a new dataset that contains hand-object grasps viewed from 53 cameras across 30+ scenes, 3 subjects, and comprising over 7M frames. In addition to extensive qualitative results, we also show that our method outperforms others on a quantitative contact evaluation method that uses paint transfer from the object to the hand. 
    more » « less
  5. We propose a visually-grounded library of behaviors approach for learning to manipulate diverse objects across varying initial and goal configurations and camera placements. Our key innovation is to disentangle the standard image-to-action mapping into two separate modules that use different types of perceptual input:(1) a behavior selector which conditions on intrinsic and semantically-rich object appearance features to select the behaviors that can successfully perform the desired tasks on the object in hand, and (2) a library of behaviors each of which conditions on extrinsic and abstract object properties, such as object location and pose, to predict actions to execute over time. The selector uses a semantically-rich 3D object feature representation extracted from images in a differential end-to-end manner. This representation is trained to be view-invariant and affordance-aware using self-supervision, by predicting varying views and successful object manipulations. We test our framework on pushing and grasping diverse objects in simulation as well as transporting rigid, granular, and liquid food ingredients in a real robot setup. Our model outperforms image-to-action mappings that do not factorize static and dynamic object properties. We further ablate the contribution of the selector's input and show the benefits of the proposed view-predictive, affordance-aware 3D visual object representations. 
    more » « less