Title: Learning Correspondence from the Cycle-Consistency of Time
We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model optimizes a spatial feature representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time. We demonstrate the generalizability of the representation across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Our approach outperforms previous self-supervised methods and performs competitively with strongly supervised methods. Overall, we find that the learned representation generalizes surprisingly well, despite being trained only on indoor videos and without fine-tuning. more »« less
Feng, Z; Tu, M; Xia, R; Wang, Y; Krishnamurthy, A(
, IEEE International Conference on Big Data)
null
(Ed.)
Humans understand videos from both the visual and
audio aspects of the data. In this work, we present a self
supervised cross modal representation approach for
learning audio visual correspondence (AVC) for videos in
the wild. After the learning stage, we explore retrieval in
both cross modal and intra modal manner with the
learned representations. We verify our experimental
results on the VGGSound dataset and our approach
achieves promising results.
Rashid, Maheen; Broome, Sofia; Ask, Katrina; Hernlund, Elin; Andersen, Pia Haubro; Kjellstrom, Hedvig; Lee, Yong Jae(
, Winter Conference on Applications of Computer Vision (WACV))
Timely detection of horse pain is important for equine welfare. Horses express pain through their facial and body behavior, but may hide signs of pain from unfamiliar human observers. In addition, collecting visual data with detailed annotation of horse behavior and pain state is both cumbersome and not scalable. Consequently, a pragmatic equine pain classification system would use video of the unobserved horse and weak labels. This paper proposes such a method for equine pain classification by using multi-view surveillance video footage of unobserved horses with induced orthopaedic pain, with temporally sparse video level pain labels. To ensure that pain is learned from horse body language alone, we first train a self-supervised generative model to disentangle horse pose from its appearance and background before using the disentangled horse pose latent representation for pain classification. To make best use of the pain labels, we develop a novel loss that formulates pain classification as a multi-instance learning problem. Our method achieves pain classification accuracy better than human expert performance with 60% accuracy. The learned latent horse pose representation is shown to be viewpoint covariant, and disentangled from horse appearance. Qualitative analysis of pain classified segments shows correspondence between the pain symptoms identified by our model, and equine pain scales used in veterinary practice.
The joint analysis of audio and video is a powerful tool that can be applied to various contexts,
including action, speech, and sound recognition, audio-visual video parsing, emotion recognition in affective
computing, and self-supervised training of deep learning models. Solving these problems often involves
tackling core audio-visual tasks, such as audio-visual source localization, audio-visual correspondence, and
audio-visual source separation, which can be combined in various ways to achieve the desired results. This
paper provides a review of the literature in this area, discussing the advancements, history, and datasets of
audio-visual learning methods for various application domains. It also presents an overview of the reported
performances on standard datasets and suggests promising directions for future research.
Chen, Y.(
, ECCV: European Conference on Computer Vision)
Avidan, S.
(Ed.)
Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks.
Gesel, Paul; Sojib, Noushad; Begum, Momotaz(
, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems)
In this paper, we propose a novel network architecture for visual imitation learning that exploits neural radiance fields (NeRFs) and key-point correspondence for self-supervised visual motor policy learning. The proposed network architecture incorporates a dynamic system output layer for policy learning. Combining the stability and goal adaption properties of dynamic systems with the robustness of keypoint-based correspondence yields a policy that is invariant to significant clutter, occlusions, lighting conditions changes, and spatial variations in goal configurations. Experiments on multiple manipulation tasks show that our method outperforms comparable visual motor policy learning methods on both in-distribution and out-of-distribution scenarios when using a small number of training samples.
Wang, Xiaolong, Jabri, Allan, and Efros, Alexei A. Learning Correspondence from the Cycle-Consistency of Time. Retrieved from https://par.nsf.gov/biblio/10111415. IEEE Conference on Computer Vision and Pattern Recognition .
Wang, Xiaolong, Jabri, Allan, & Efros, Alexei A. Learning Correspondence from the Cycle-Consistency of Time. IEEE Conference on Computer Vision and Pattern Recognition, (). Retrieved from https://par.nsf.gov/biblio/10111415.
Wang, Xiaolong, Jabri, Allan, and Efros, Alexei A.
"Learning Correspondence from the Cycle-Consistency of Time". IEEE Conference on Computer Vision and Pattern Recognition (). Country unknown/Code not available. https://par.nsf.gov/biblio/10111415.
@article{osti_10111415,
place = {Country unknown/Code not available},
title = {Learning Correspondence from the Cycle-Consistency of Time},
url = {https://par.nsf.gov/biblio/10111415},
abstractNote = {We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model optimizes a spatial feature representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time. We demonstrate the generalizability of the representation across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Our approach outperforms previous self-supervised methods and performs competitively with strongly supervised methods. Overall, we find that the learned representation generalizes surprisingly well, despite being trained only on indoor videos and without fine-tuning.},
journal = {IEEE Conference on Computer Vision and Pattern Recognition},
author = {Wang, Xiaolong and Jabri, Allan and Efros, Alexei A.},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.