skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 21, 2026

Title: Vysics: Object Reconstruction Under Occlusion by Fusing Vision and Contact-Rich Physics
We introduce Vysics, a vision-and-physics framework for a robot to build an expressive geometry and dynamics model of a single rigid body, using a seconds-long RGBD video and the robot’s proprioception. While the computer vision community has built powerful visual 3D perception algorithms, cluttered environments with heavy occlusions can limit the visibility of objects of interest. However, observed motion of partially occluded objects can imply physical interactions took place, such as contact with a robot or the environment. These inferred contacts can supplement the visible geometry with "physible geometry," which best explains the observed object motion through physics. Vysics uses a vision-based tracking and reconstruction method, BundleSDF, to estimate the trajectory and the visible geometry from an RGBD video, and an odometry-based model learning method, Physics Learning Library (PLL), to infer the "physible" geometry from the trajectory through implicit contact dynamics optimization. The visible and "physible" geometries jointly factor into optimizing a signed distance function (SDF) to represent the object shape. Vysics does not require pretraining, nor tactile or force sensors. Compared with vision-only methods, Vysics yields object models with higher geometric accuracy and better dynamics prediction in experiments where the object interacts with the robot and the environment under heavy occlusion.  more » « less
Award ID(s):
2238480
PAR ID:
10597647
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
Robotics: Science and Systems
Date Published:
Format(s):
Medium: X
Location:
Los Angeles, CA
Sponsoring Org:
National Science Foundation
More Like this
  1. Modelling and learning the dynamics of intricate dynamic interactions prevalent in common tasks such as push- ing a heavy door or picking up an object in one sweeping motion is a challenging problem. One needs to consider both the dynamics of the individual objects and of the interactions among objects. In this work, we present a method that enables efficient learning of the dynamics of interacting systems by simultaneously learning a dynamic graph structure and a stable and locally linear forward dynamic model of the system. The dynamic graph structure encodes evolving contact modes along a trajectory by making probabilistic predictions over the edge activations. Introducing a temporal dependence in the learned graph structure enables incorporating contact measurement updates which allows for more accurate forward predictions. The learned stable and locally linear dynamics enable the use of optimal control algorithms such as iLQR for long-horizon planning and control for complex interactive tasks. Through experiments in simulation and in the real world, we evaluate the performance of our method by using the learned inter- action dynamics for control and demonstrate generalization to more objects and interactions not seen during training. We also introduce a control scheme that takes advantage of contact measurement updates and hence is robust to prediction inaccuracies during execution. 
    more » « less
  2. Yashinski, Melisa (Ed.)
    To achieve human-level dexterity, robots must infer spatial awareness from multimodal sensing to reason over contact interactions. During in-hand manipulation of novel objects, such spatial awareness involves estimating the object’s pose and shape. The status quo for in-hand perception primarily uses vision and is restricted to tracking a priori known objects. Moreover, visual occlusion of objects in hand is imminent during manipulation, preventing current systems from pushing beyond tasks without occlusion. We combined vision and touch sensing on a multifingered hand to estimate an object’s pose and shape during in-hand manipulation. Our method, NeuralFeels, encodes object geometry by learning a neural field online and jointly tracks it by optimizing a pose graph problem. We studied multimodal in-hand perception in simulation and the real world, interacting with different objects via a proprioception-driven policy. Our experiments showed final reconstructionFscores of 81% and average pose drifts of 4.7 millimeters, which was further reduced to 2.3 millimeters with known object models. In addition, we observed that, under heavy visual occlusion, we could achieve improvements in tracking up to 94% compared with vision-only methods. Our results demonstrate that touch, at the very least, refines and, at the very best, disambiguates visual estimates during in-hand manipulation. We release our evaluation dataset of 70 experiments, FeelSight, as a step toward benchmarking in this domain. Our neural representation driven by multimodal sensing can serve as a perception backbone toward advancing robot dexterity. 
    more » « less
  3. We consider the problem of sequential robotic manipulation of deformable objects using tools. Previous works have shown that differentiable physics simulators provide gradients to the environment state and help trajectory optimization to converge orders of magnitude faster than model-free reinforcement learning algorithms for deformable object manipulation. However, such gradient-based trajectory optimization typically requires access to the full simulator states and can only solve short-horizon, single-skill tasks due to local optima. In this work, we propose a novel framework, named DiffSkill, that uses a differentiable physics simulator for skill abstraction to solve long-horizon deformable object manipulation tasks from sensory observations. In particular, we first obtain short-horizon skills using individual tools from a gradient-based optimizer, using the full state information in a differentiable simulator; we then learn a neural skill abstractor from the demonstration trajectories which takes RGBD images as input. Finally, we plan over the skills by finding the intermediate goals and then solve long-horizon tasks. We show the advantages of our method in a new set of sequential deformable object manipulation tasks compared to previous reinforcement learning algorithms and compared to the trajectory optimizer. 
    more » « less
  4. Humans can learn to manipulate new objects by simply watching others; providing robots with the ability to learn from such demonstrations would enable a natural interface specifying new behaviors. This work develops Robot See Robot Do (RSRD), a method for imitating articulated object manipulation from a single monocular RGB human demonstration given a single static multi-view object scan. We first propose 4D Differentiable Part Models (4D-DPM), a method for recovering 3D part motion from a monocular video with differentiable rendering. This analysis-by-synthesis approach uses part-centric feature fields in an iterative optimization which enables the use of geometric regularizers to recover 3D motions from only a single video. Given this 4D reconstruction, the robot replicates object trajectories by planning bimanual arm motions that induce the demonstrated object part motion. By representing demonstrations as part-centric trajectories, RSRD focuses on replicating the demonstration's intended behavior while considering the robot's own morphological limits, rather than attempting to reproduce the hand's motion. We evaluate 4D-DPM's 3D tracking accuracy on ground truth annotated 3D part trajectories and RSRD's physical execution performance on 9 objects across 10 trials each on a bimanual YuMi robot. Each phase of RSRD achieves an average of 87% success rate, for a total end-to-end success rate of 60% across 90 trials. Notably, this is accomplished using only feature fields distilled from large pretrained vision models -- without any task-specific training, fine-tuning, dataset collection, or annotation. 
    more » « less
  5. This paper presents methods for improved teleoperation in dynamic environments in which the objects to be manipulated are moving, but vision may not meet size, biocompatibility, or maneuverability requirements. In such situations, the object could be tracked through non-geometric means, such as heat, radioactivity, or other markers. In order to safely explore a region, we use an optical time-of-flight pretouch sensor to detect (and range) target objects prior to contact. Information from these sensors is presented to the user via haptic virtual fixtures. This combination of techniques allows the teleoperator to “feel” the object without an actual contact event between the robot and the target object. Thus it provides the perceptual benefits of touch interaction to the operator, without incurring the negative consequences of the robot contacting unknown geometrical structures; premature contact can lead to damage or unwanted displacement of the target. The authors propose that as the geometry of the scene transitions from completely unknown to partially explored, haptic virtual fixtures can both prevent collisions and guide the user towards areas of interest, thus improving exploration speed. Experimental results show that for situations that are not amenable to vision, haptically-presented pretouch sensor information allows operators to more effectively explore moving objects. 
    more » « less