ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

Subramanian, Sanjay; Merrill, William; Darrell, Trevor; Gardner, Matt; Singh, Sameer; Rohrbach, Anna

doi:10.18653/v1/2022.acl-long.357

Citation Details

ReCLIP: A Strong Zero-Shot Baseline for Referring Expression Comprehension

Training a referring expression comprehension (ReC) model for a new visual domain requires collecting referring expressions, and potentially corresponding bounding boxes, for images in the domain. While large-scale pre-trained models are useful for image classification across domains, it remains unclear if they can be applied in a zero-shot manner to more complex tasks like ReC. We present ReCLIP, a simple but strong zero-shot baseline that repurposes CLIP, a state-of-the-art large-scale model, for ReC. Motivated by the close connection between ReC and CLIP’s contrastive pre-training objective, the first component of ReCLIP is a region-scoring method that isolates object proposals via cropping and blurring, and passes them to CLIP. However, through controlled experiments on a synthetic dataset, we find that CLIP is largely incapable of performing spatial reasoning off-the-shelf. We reduce the gap between zero-shot baselines from prior work and supervised models by as much as 29% on RefCOCOg, and on RefGTA (video game imagery), ReCLIP’s relative improvement over supervised ReC models trained on real images is 8%. more »

Award ID(s):: 1817183

PAR ID:: 10462883

Author(s) / Creator(s):: Subramanian, Sanjay; Merrill, William; Darrell, Trevor; Gardner, Matt; Singh, Sameer; Rohrbach, Anna

Date Published:: 2022-01-01

Journal Name:: Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Page Range / eLocation ID:: 5198 to 5215

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.18653/v1/2022.acl-long.357

More Like this