Title: Learning Correspondence from the Cycle-Consistency of Time
We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model optimizes a spatial feature representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time. We demonstrate the generalizability of the representation across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Our approach outperforms previous self-supervised methods and performs competitively with strongly supervised methods. Overall, we find that the learned representation generalizes surprisingly well, despite being trained only on indoor videos and without fine-tuning. more »« less
Feng, Z; Tu, M; Xia, R; Wang, Y; Krishnamurthy, A
(, IEEE International Conference on Big Data)
null
(Ed.)
Humans understand videos from both the visual and audio aspects of the data. In this work, we present a self supervised cross modal representation approach for learning audio visual correspondence (AVC) for videos in the wild. After the learning stage, we explore retrieval in both cross modal and intra modal manner with the learned representations. We verify our experimental results on the VGGSound dataset and our approach achieves promising results.
Rashid, Maheen; Broome, Sofia; Ask, Katrina; Hernlund, Elin; Andersen, Pia Haubro; Kjellstrom, Hedvig; Lee, Yong Jae
(, Winter Conference on Applications of Computer Vision (WACV))
Timely detection of horse pain is important for equine welfare. Horses express pain through their facial and body behavior, but may hide signs of pain from unfamiliar human observers. In addition, collecting visual data with detailed annotation of horse behavior and pain state is both cumbersome and not scalable. Consequently, a pragmatic equine pain classification system would use video of the unobserved horse and weak labels. This paper proposes such a method for equine pain classification by using multi-view surveillance video footage of unobserved horses with induced orthopaedic pain, with temporally sparse video level pain labels. To ensure that pain is learned from horse body language alone, we first train a self-supervised generative model to disentangle horse pose from its appearance and background before using the disentangled horse pose latent representation for pain classification. To make best use of the pain labels, we develop a novel loss that formulates pain classification as a multi-instance learning problem. Our method achieves pain classification accuracy better than human expert performance with 60% accuracy. The learned latent horse pose representation is shown to be viewpoint covariant, and disentangled from horse appearance. Qualitative analysis of pain classified segments shows correspondence between the pain symptoms identified by our model, and equine pain scales used in veterinary practice.
The joint analysis of audio and video is a powerful tool that can be applied to various contexts, including action, speech, and sound recognition, audio-visual video parsing, emotion recognition in affective computing, and self-supervised training of deep learning models. Solving these problems often involves tackling core audio-visual tasks, such as audio-visual source localization, audio-visual correspondence, and audio-visual source separation, which can be combined in various ways to achieve the desired results. This paper provides a review of the literature in this area, discussing the advancements, history, and datasets of audio-visual learning methods for various application domains. It also presents an overview of the reported performances on standard datasets and suggests promising directions for future research.
Hernandez, Jefferson; Villegas, Ruben; Ordonez, Vicente
(, European Conference on Computer Vision (ECCV), Springer, Cham)
We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and contrastive learning. ViC-MAE is trained using a global representation obtained by pooling the local features learned under an MAE reconstruction loss and using this representation under a contrastive objective across images and video frames. We show that visual representations learned under ViC-MAE generalize well to video and image classification tasks. Particularly, ViC-MAE obtains state-of-the-art transfer learning performance from video to images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same data and 87.1% (+2.4% absolute improvement) when training on extra data. At the same time, ViC-MAE outperforms most other methods on video benchmarks by obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video benchmark. When training on videos and images from diverse datasets, our method maintains a balanced transfer-learning performance between video and image classification benchmarks, coming only as a close second to the best-supervised method.
Gesel, Paul; Sojib, Noushad; Begum, Momotaz
(, 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems)
In this paper, we propose a novel network architecture for visual imitation learning that exploits neural radiance fields (NeRFs) and key-point correspondence for self-supervised visual motor policy learning. The proposed network architecture incorporates a dynamic system output layer for policy learning. Combining the stability and goal adaption properties of dynamic systems with the robustness of keypoint-based correspondence yields a policy that is invariant to significant clutter, occlusions, lighting conditions changes, and spatial variations in goal configurations. Experiments on multiple manipulation tasks show that our method outperforms comparable visual motor policy learning methods on both in-distribution and out-of-distribution scenarios when using a small number of training samples.
Wang, Xiaolong, Jabri, Allan, and Efros, Alexei A. Learning Correspondence from the Cycle-Consistency of Time. Retrieved from https://par.nsf.gov/biblio/10111415. IEEE Conference on Computer Vision and Pattern Recognition .
Wang, Xiaolong, Jabri, Allan, & Efros, Alexei A. Learning Correspondence from the Cycle-Consistency of Time. IEEE Conference on Computer Vision and Pattern Recognition, (). Retrieved from https://par.nsf.gov/biblio/10111415.
Wang, Xiaolong, Jabri, Allan, and Efros, Alexei A.
"Learning Correspondence from the Cycle-Consistency of Time". IEEE Conference on Computer Vision and Pattern Recognition (). Country unknown/Code not available. https://par.nsf.gov/biblio/10111415.
@article{osti_10111415,
place = {Country unknown/Code not available},
title = {Learning Correspondence from the Cycle-Consistency of Time},
url = {https://par.nsf.gov/biblio/10111415},
abstractNote = {We introduce a self-supervised method for learning visual correspondence from unlabeled video. The main idea is to use cycle-consistency in time as free supervisory signal for learning visual representations from scratch. At training time, our model optimizes a spatial feature representation to be useful for performing cycle-consistent tracking. At test time, we use the acquired representation to find nearest neighbors across space and time. We demonstrate the generalizability of the representation across a range of visual correspondence tasks, including video object segmentation, keypoint tracking, and optical flow. Our approach outperforms previous self-supervised methods and performs competitively with strongly supervised methods. Overall, we find that the learned representation generalizes surprisingly well, despite being trained only on indoor videos and without fine-tuning.},
journal = {IEEE Conference on Computer Vision and Pattern Recognition},
author = {Wang, Xiaolong and Jabri, Allan and Efros, Alexei A.},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.