In this paper we learn to segment hands and hand-held objects from motion. Our system takes a single RGB image and hand location as input to segment the hand and hand-held object. For learning, we generate responsibility maps that show how well a hand’s motion explains other pixels’ motion in video. We use these responsibility maps as pseudo-labels to train a weakly-supervised neural network using an attention-based similarity loss and contrastive loss. Our system outperforms alternate methods, achieving good performance on the 100DOH, EPIC-KITCHENS, and HO3D datasets.
more »
« less
MOVES: Manipulated Objects in Video Enable Segmentation
Our method uses manipulation in video to learn to understand held-objects and hand-object contact. We train a system that takes a single RGB image and produces a pixel-embedding that can be used to answer grouping questions (do these two pixels go together) as well as hand-association questions (is this hand holding that pixel). Rather than painstakingly annotate segmentation masks, we observe people in realistic video data. We show that pairing epipolar geometry with modern optical flow produces simple and effective pseudo-labels for grouping. Given people segmentations, we can further associate pixels with hands to understand contact. Our system achieves competitive results on hand and hand-held object tasks.
more »
« less
- Award ID(s):
- 2006619
- PAR ID:
- 10469274
- Publisher / Repository:
- CVPR
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Full surround 3D imaging for shape acquisition is essential for generating digital replicas of real-world objects. Surrounding an object we seek to scan with a kaleidoscope, that is, a configuration of multiple planar mirrors, produces an image of the object that encodes information from a combinatorially large number of virtual viewpoints. This information is practically useful for the full surround 3D reconstruction of the object, but cannot be used directly, as we do not know what virtual viewpoint each image pixel corresponds---the pixel label. We introduce a structured light system that combines a projector and a camera with a kaleidoscope. We then prove that we can accurately determine the labels of projector and camera pixels, for arbitrary kaleidoscope configurations, using the projector-camera epipolar geometry. We use this result to show that our system can serve as a multi-view structured light system with hundreds of virtual projectors and cameras. This makes our system capable of scanning complex shapes precisely and with full coverage. We demonstrate the advantages of the kaleidoscopic structured light system by scanning objects that exhibit a large range of shapes and reflectances.more » « less
-
The state-of-art three-dimensional (3D) shape measurement with digital fringe projection (DFP) techniques assume that the influence of projector pixel shape is negligible. However, our research reveals that when the camera pixel size is much smaller than the projector pixel size in object space (e.g., 1/5), the shape of projector pixel can play a critical role on ultimate measurement quality. This paper evaluates the performance of two shapes of projector pixels: rectangular and diamond shaped. Both simulation and experimental results demonstrated that when the camera pixel size is significantly smaller than the projector pixel size, it is advantageous for ultrahigh resolution 3D shape measurement system to use a projector with rectangular-shaped pixels than a projector with diamond-shaped pixels.more » « less
-
We introduce VISOR, a new dataset of pixel annotations and a benchmark suite for segmenting hands and active objects in egocentric video. VISOR annotates videos from EPIC-KITCHENS, which comes with a new set of challenges not encountered in current video segmentation datasets. Specifically, we need to ensure both short- and long-term consistency of pixel-level annotations as objects undergo transformative interactions, e.g. an onion is peeled, diced and cooked - where we aim to obtain accurate pixel-level annotations of the peel, onion pieces, chopping board, knife, pan, as well as the acting hands. VISOR introduces an annotation pipeline, AI-powered in parts, for scalability and quality. In total, we publicly release 272K manual semantic masks of 257 object classes, 9.9M interpolated dense masks, 67K hand-object relations, covering 36 hours of 179 untrimmed videos. Along with the annotations, we introduce three challenges in video object segmentation, interaction understanding and long-term reasoning. For data, code and leaderboards: http://epic-kitchens.github.io/VISORmore » « less
-
ADARE-HD: Adaptive-Resolution Framework for Efficient Object Detection and Tracking via HD-ComputingEfficient and low-energy camera signal processing is critical for battery-supported sensing and surveillance applications. In this research, we develop a video object detection and tracking framework which adaptively down-samples frame pixels to minimize computation and memory costs, and thereby the energy consumed, while maintaining a high level of accuracy. Instead of always operating with the highest sensor pixel resolution (compute-intensive), video frame (pixel) content is down-sampled spatially, to adapt to changing camera environments (size of object tracked, peak-signal-tonoise- ratio (i.e, PSNR) of video frames). Object detection and tracking is supported by a novel video resolution-aware adaptive hyperdimensional computing framework. This leverages a low memory overhead non-linear hypervector encoding scheme specifically tailored for handling multiple degrees of resolution. Previous classification decisions of a moving object based on its tracking label are used to improve tracking robustness. Energy savings of up to 1.6 orders of magnitude and up to an order of magnitude compute speedup is obtained on a range of experiments performed on benchmark systems.more » « less
An official website of the United States government

