Learning Multimodal Representations for Unseen Activities

Piergiovanni, A.J.; Ryoo, Michael S.

doi:10.1109/WACV45572.2020.9093612

Citation Details

Learning Multimodal Representations for Unseen Activities

We present a method to learn a joint multimodal representation space that enables recognition of unseen activities in videos. We first compare the effect of placing various constraints on the embedding space using paired text and video data. We also propose a method to improve the joint embedding space using an adversarial formulation, allowing it to benefit from unpaired text and video data. By using unpaired text data, we show the ability to learn a representation that better captures unseen activities. In addition to testing on publicly available datasets, we introduce a new, large-scale text/video dataset. We experimentally confirm that using paired and unpaired data to learn a shared embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning, outperforming the state-of-the-arts. more »

Award ID(s):: 1812943 1814985

PAR ID:: 10183758

Author(s) / Creator(s):: Piergiovanni, A.J.; Ryoo, Michael S.

Date Published:: 2020-03-01

Journal Name:: IEEE Winter Conference on Applications of Computer Vision (WACV)

Page Range / eLocation ID:: 506 to 515

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/WACV45572.2020.9093612

More Like this