ViC-MAE: Self-supervised Representation Learning from Images and Video with Contrastive Masked Autoencoders

Hernandez, Jefferson; Villegas, Ruben; Ordonez, Vicente

Citation Details

We propose ViC-MAE, a model that combines both Masked AutoEncoders (MAE) and contrastive learning. ViC-MAE is trained using a global representation obtained by pooling the local features learned under an MAE reconstruction loss and using this representation under a contrastive objective across images and video frames. We show that visual representations learned under ViC-MAE generalize well to video and image classification tasks. Particularly, ViC-MAE obtains state-of-the-art transfer learning performance from video to images on Imagenet-1k compared to the recently proposed OmniMAE by achieving a top-1 accuracy of 86% (+1.3% absolute improvement) when trained on the same data and 87.1% (+2.4% absolute improvement) when training on extra data. At the same time, ViC-MAE outperforms most other methods on video benchmarks by obtaining 75.9% top-1 accuracy on the challenging Something something-v2 video benchmark. When training on videos and images from diverse datasets, our method maintains a balanced transfer-learning performance between video and image classification benchmarks, coming only as a close second to the best-supervised method. more »

Award ID(s):: 2201710

PAR ID:: 10630373

Author(s) / Creator(s):: Hernandez, Jefferson; Villegas, Ruben; Ordonez, Vicente

Publisher / Repository:: European Conference on Computer Vision (ECCV), Springer, Cham

Date Published:: 2024-09-30

ISBN:: 978-3-031-73234-8

Format(s):: Medium: X

Location:: Milan, Italy

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Proceeding:
The DOI is not currently available.

More Like this