null
(Ed.)
Learning sensorimotor control policies from highdimensional images crucially relies on the quality
of the underlying visual representations. Prior
works show that structured latent space such as
visual keypoints often outperforms unstructured
representations for robotic control. However,
most of these representations, whether structured
or unstructured are learned in a 2D space even
though the control tasks are usually performed
in a 3D environment. In this work, we propose
a framework to learn such a 3D geometric
structure directly from images in an end-toend unsupervised manner. The input images
are embedded into latent 3D keypoints via
a differentiable encoder which is trained to
optimize both a multi-view consistency loss and
downstream task objective. These discovered
3D keypoints tend to meaningfully capture
robot joints as well as object movements in
a consistent manner across both time and 3D
space. The proposed approach outperforms
prior state-of-art methods across a variety of
reinforcement learning benchmarks. Code and
videos at https://buoyancy99.github.
io/unsup-3d-keypoints/.
more »
« less