<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>CLIP-Lite: Information Efficient Visual Representation Learning with Language Supervision</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10432824</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of The 26th International Conference on Artificial Intelligence and Statistics</title>
<idno></idno>
<biblScope unit="volume">206</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Aman Shrivastava</author><author>Ramprasaath R. Selvaraju</author><author>Nikhil Naik</author><author>Vicente Ordonez</author><author>Francisco Ruiz</author><author>Jennifer Dy</author><author>Jan-Willem van de Meent</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0 mAP absolute gain in performance on Pascal VOC classification, and a +22.1 top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: https://github.com/4m4n5/CLIP-Lite]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Pretraining image classification networks on the Imagenet dataset has led to visual representations that transfer to other tasks <ref type="bibr">(Girshick et al., 2014;</ref><ref type="bibr">Long et al., 2015;</ref><ref type="bibr">Vinyals</ref>     <ref type="bibr">et al., 2015;</ref><ref type="bibr">Antol et al., 2015;</ref><ref type="bibr">Zhu et al., 2016)</ref>. However, such classification based pretraining requires a large amount of human-annotated data which is hard to obtain at scale. In contrast, captioned image data is an informationdense source of supervision that is relatively cheap to collect and plentiful on the internet. Therefore, recent methods have used joint vision-language pretraining to learn representations from image-caption pairs <ref type="bibr">(Desai and Johnson, 2021;</ref><ref type="bibr">Sariyildiz et al., 2020)</ref>. However, methods such as VirTex <ref type="bibr">(Desai and Johnson, 2021)</ref> which train on complex language modeling tasks such as masked language modeling, token classification, and captioning fail to align features in a common latent space.</p><p>matching image-caption pairs. However, contrastive learning in vision-language pretraining still has some limitations as it seems to be most effective only with large scale data, and it requires a large number of negative image-caption pairs during training. Our work aims to address and explore these two limitations by proposing CLIP-Lite, an information efficient variation of CLIP that is useful even in smaller data regimes, does not rely in as many negative sample pairs during training, and provides comparable or superior performance on standard benchmarks against other methods trained at the same scale. Our work is motivated by the observation that multiple contrastive objectives maximize a lower-bound on the mutual information between two or more views of the same datum <ref type="bibr">(Wu et al., 2020)</ref>. CLIP particularly maximizes the mutual information between the image and its caption by using a mutual information lower bound based on InfoNCE <ref type="bibr">(Oord et al., 2018)</ref>. The InfoNCE bound has seen wide adoption due to its favorable properties such as stability and low variance. However, the the bound is theoretically loose in cases when the true mutual information is larger than log K where (K -1) is the number of negative samples used for training. The negative pairs can be randomly sampled but usually a large amount of negative pairs are required to have a good estimate of the mutual information between the two input streams, and hence the need for rather large batch sizes <ref type="bibr">(Bachman et al., 2019;</ref><ref type="bibr">Chen et al., 2015)</ref> or memory-banks <ref type="bibr">(Chen et al., 2020b;</ref><ref type="bibr">Tian et al., 2019;</ref><ref type="bibr">He et al., 2020)</ref>.</p><p>We instead adopt a lower bound based on Jenssen Shannon Divergence to maximize the mutual information <ref type="bibr">(Hjelm et al., 2018;</ref><ref type="bibr">Nowozin et al., 2016)</ref>, thus requiring no more than one negative example pair for each positive example pair. This reduces the number of negative examples in a training batch to O(n), where n is the batch size.</p><p>In contrast, CLIP uses O(n 2 ) negative example pairs per batch. Figure <ref type="figure">2</ref> (right) illustrates this difference. We implement this strategy and demonstrate thoroughly the efficacy of CLIP-Lite through experiments on several tasks and datasets at various scales. Our method demonstrates impressive data efficiency and is able to outperform CLIP trained on the entire COCO-Captions dataset while only training on 20% of the same dataset. We also demonstrate that CLIP-Lite can be used as a good source of pretrained features by showing good generalization on Pascal VOC and Imagenet classification. We also show that the visual feature backbone of CLIP-Lite can be finetuned in the iNaturalist dataset to match top performances on this benchmark with caption supervision pretraining. Furthermore, we show that CLIP-Lite leads to good visual features for image retrieval compared to regular CLIP trained on COCO Captions. We also demonstrate that CLIP-Lite enables the removal of concepts from visual representations which we show can be applied in bias mitigation. Our work extends and complements the work using contrastive learning, especially addressing the computational requirements of the original CLIP model in terms of memory overhead through minimizing the number of negative sample imagetext pairs required during training and shows its effectiveness in smaller data regimes including for zero-shot learning on CIFAR-10, image-text retrieval and unsupervised object localization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Our work is related to several strands of research on visual pretraining without full-supervision.</p><p>Vision-Language Pretraining: Research on learning visual representations by using textual labels or annotations has a long history. In <ref type="bibr">(Quattoni et al., 2007)</ref>, the authors learn data-efficient image representations using manifold learning in the weight space of classifiers trained to predict tokens in image captions. Following this work, <ref type="bibr">(Joulin et al., 2016)</ref> used convolutional neural networks to predict words in image captions to learn image representations. This approach was later extended in <ref type="bibr">(Lei Ba et al., 2015)</ref> where the model learns to predict phrase n-grams, which demonstrated impressive zero-shot performance on downstream classification tasks. Recently, VirTex <ref type="bibr">(Desai and Johnson, 2021)</ref> used proxy language modeling tasks, such as image-captioning to train a visual encoder and a transformer based language decoder which generates captions. ICMLM <ref type="bibr">(Sariyildiz et al., 2020</ref>) demonstrated a similar masked language modeling approach but relied on pretrained textual encoders for generating textual features. In <ref type="bibr">(Stroud et al., 2020)</ref>, video representations are learned using paired textual metadata, however the method does not extend to visual pretraining for images. In general, these methods distill the rich semantic information from a caption into the visual representation by learning to predict each token in the caption given the corresponding image. More recent work, such as CLIP <ref type="bibr">(Radford et al., 2021)</ref>, has shown that a simpler contrastive objective for aligning image and caption pairs is also able to learn a powerful visual representation. Our work extends CLIP using a more information-efficient approach.</p><p>Contrastive Representation Learning and Mutual Information Estimation: As demonstrated in <ref type="bibr">(Wu et al., 2020)</ref>, we observe that contrastive frameworks learn by maximizing the mutual information (MI) between different views of a given data point. For images, this is achieved by maximizing the MI between different augmentations of the data as in SimCLR <ref type="bibr">(Chen et al., 2020a;</ref><ref type="bibr">Bachman et al., 2019)</ref>. While for sequential data such as conversational text, consecutive utterances can be considered as different views <ref type="bibr">(Stratos, 2018)</ref>. Similarly, several other contrastive frameworks have been proposed that learn representations in domains such as images <ref type="bibr">(Grill et al., 2020;</ref><ref type="bibr">Caron et al., 2020)</ref>, text <ref type="bibr">(Mikolov et al., 2013;</ref><ref type="bibr">Stratos, 2018)</ref>, graphs <ref type="bibr">(Veli&#269;kovi&#263; et al., 2018)</ref>, and videos <ref type="bibr">(Jabri et al., 2020)</ref>. The value of mutual information is extremely challenging to estimate, especially for the high-dimensional continuous representations used in deep learning. To this end, various tractable lower-bounds on mutual information are used for optimization. Recently, MINE <ref type="bibr">(Belghazi et al., 2018)</ref> proposed a general-purpose parameterized neural estimator of mutual information. It uses a Donsker-Varadhan <ref type="bibr">(Donsker and Varadhan, 1983)</ref> representation of KL-divergence as the lower-bound on mutual information. MINE <ref type="bibr">(Belghazi et al., 2018)</ref> used a neural network critic to distinguish positive and negative pairs of samples. Another popular bound on mutual information that has seen wide adoption due to its low variance is the InfoNCE <ref type="bibr">(Oord et al., 2018)</ref> bound. In <ref type="bibr">(Hjelm et al., 2018)</ref>, the infoNCE bound on the mutual information is used for unsupervised representation learning. While it is used by several other methods for self-supervised <ref type="bibr">(Chen et al., 2020a)</ref> representation learning for images. The capacity of the bound is limited by the number of contrastive samples used <ref type="bibr">(McAllester and Stratos, 2020)</ref>. Additionally, In-foNCE can underestimate large amounts of true MI which is generally the case with high-dimensional representations of natural images. To this end, DeepInfoMax <ref type="bibr">(Hjelm et al., 2018)</ref> proposed using a lower-bound on mutual information that is based on the Jensen-Shannon Divergence (JSD) instead of the traditional KL-divergence (KLD). The authors show that the JSD based lower bound is stable, differentiable, and can be optimized with just one negative sample. Inspired by this, we extend the use of this bound for vision-language pretraining and demonstrate its effectiveness through extensive experimental evaluations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">CLIP-Lite</head><p>Given a dataset of image-caption pairs, the goal of our pretraining framework is to train an image encoder and a text encoder such that representations learned from the visual and the textual streams share maximum information (Figure <ref type="figure">2</ref> shows an overview). Consider an image encoder network, f i with parameters &#952; i and a textual encoder, f t with parameters &#952; t . Let (x i , x t ) be a sampled image-caption pair from the dataset and f i (x i ) and f t (x t ) denote the representations extracted from the networks. Based on the information bottleneck principle <ref type="bibr">(Tishby and Zaslavsky, 2015)</ref>, the maximum mutual information (MI) predictive coding framework <ref type="bibr">(Oord et al., 2018;</ref><ref type="bibr">Hjelm et al., 2018;</ref><ref type="bibr">McAllester and Stratos, 2020)</ref> aims to learn representations that maximize the MI between inputs and representations.</p><p>In recent years, several methods <ref type="bibr">(Chen et al., 2020a;</ref><ref type="bibr">He et al., 2020;</ref><ref type="bibr">Bachman et al., 2019)</ref> have used this principle to maximize MI between representations extracted from multiple views of a shared context. In the case of visual self-supervised learning, this is achieved by creating two independently-augmented copies of the same input and maximizing the MI between the respective features produced by an encoder. This framework can be extended further by considering an image x i and its caption x t as distinct views of the same input. This setup is motivated by the observation that image captions contain rich semantic information about images, for instance, presence of objects, location of objects, their relative spatial configurations, etc. Distilling this information into our visual representation is useful for robust representation learning <ref type="bibr">(Radford et al., 2021)</ref>. To this end, we formulate our objective as follows:</p><p>( &#952;i , &#952;t ) = arg max &#952;i,&#952;t</p><p>where I(f i (x i ), f t (x t )) &#8804; I(x i ; x t ); due to the data processing inequality between visual and textual streams.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Mutual Information Maximization</head><p>For given random variables y and z, their mutual information is defined as a Kullback-Leibler (KL) divergence between their joint distribution p(y, z) and the product of their marginal distributions, p(y)p(z) as,</p><p>However, mutual information is notoriously hard to estimate for high-dimensional continuous variables, especially when the distributions p(y, z), p(x), or p(z) are not explicitly known. As a result, recent approaches use various tractable lower bounds on the mutual information which are differentiable and hence can be maximized with gradient-descent based optimization. For contrastive learning, a commonly used bound is infoNCE <ref type="bibr">(Oord et al., 2018)</ref> based on Noise-Contrastive Estimation <ref type="bibr">(Gutmann and Hyv&#228;rinen, 2010)</ref>. This bound is relatively more stable and has been shown to work in a wide variety of tasks <ref type="bibr">(Chen et al., 2020a;</ref><ref type="bibr">Bachman et al., 2019;</ref><ref type="bibr">Chen et al., 2020b)</ref> including CLIP <ref type="bibr">(Radford et al., 2021)</ref> which, similar to our method, aims to learn visual representations from textual annotations. The infoNCE bound has seen wider adoption as it demonstrates lower variance compared to the Donsker-Varadhan bound <ref type="bibr">(Donsker and Varadhan, 1983)</ref>. However, both of these bounds require a large number of negative samples and as a result, recent methods either train with extremely large batch-sizes <ref type="bibr">(Radford et al., 2021;</ref><ref type="bibr">Chen et al., 2020a)</ref>; or an additional memory-bank of negative samples <ref type="bibr">(Chen et al., 2020b;</ref><ref type="bibr">Tian et al., 2020)</ref>.</p><p>Unlike these works, we estimate mutual information using a Jensen-Shannon Divergence (JSD) bound, similar to formulations used for generative modeling <ref type="bibr">(Nowozin et al., 2016)</ref>; and source separation <ref type="bibr">(Brakel and Bengio, 2017)</ref>. This bound on mutual information is derived by replacing the KL-divergence in equation 2 with the Jensen-Shannon divergence (ref. appendix for further discussion). Interestingly, the lower bound derived as such is stable, differentiable, monotonically related to the mutual information I(y; z), and most importantly, not dependent on the number of negative samples. Hence we have,</p><p>and T &#969; : Y &#215;Z &#8594; R is a discriminator neural network with trainable parameters &#969; which are jointly optimized to distinguish between a paired-sample from a joint distribution (positive image-caption pair) and one pair from the product of marginals (negative image-caption pair). Therefore we are able to optimize our overall objective with just one negative sample as follows:</p><p>where the visual encoder is a convolution neural network, and features are extracted from the pre-classification layer of the network. The textual encoder is parameterized by a neural network that takes the caption as a string of textualtokens and generates a one-dimensional representation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>In this section, we describe the experiments that demonstrate the value of using textual captions for learning visual representations using CLIP-Lite. In our experiments, the CLIP-Lite architecture consists of a ResNet-50 image encoder and the BERT-base textual encoder and is trained on the COCO Captions <ref type="bibr">(Chen et al., 2015)</ref> dataset. We evaluate the robustness of our visual encoder through the following downstream tasks which use the visual encoder (1) as a frozen feature extractor, or (2) as source of weight initialization for finetuning <ref type="bibr">(ref. appendix)</ref>. In addition, we also demonstrate the data efficiency of our method by evaluating performance on fractional datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Architecture and Training Details</head><p>In all experiments, we use a standard ResNet-50 <ref type="bibr">(He et al., 2016)</ref> that takes in a 224 &#215; 224 image and generates 2048dimensional features at the pre-logit layer. For textual encoding, we use a transformer <ref type="bibr">(Vaswani et al., 2017)</ref> model initialized using BERT base <ref type="bibr">(Devlin et al., 2018)</ref> and use the output [CLS] token as the text representation. We use the COCO Captions dataset <ref type="bibr">(Chen et al., 2015)</ref> which has 118K images with five captions per image. During training time we apply (1) random cropping, (2) color jittering, (3) random horizontal flips while interchanging the words 'left' and 'right' in the caption, and (4) normalization using the ImageNet image mean. We use SGD with momentum 0.9 <ref type="bibr">(Sutskever et al., 2013;</ref><ref type="bibr">Polyak, 1964)</ref> and weight decay 10 -4 wrapped in LookAhead <ref type="bibr">(Zhang et al., 2019)</ref> with &#945; = 0.5, and 5 steps. We perform distributed training across 8 GPUs with batch normalization <ref type="bibr">(Ioffe and Szegedy, 2015)</ref> per GPU with an overall batch size of 1024 images for 250K iterations. We use linear learning rate warmup <ref type="bibr">(Goyal et al., 2019)</ref> for the first 10K iterations followed by cosine decay <ref type="bibr">(Loshchilov and Hutter, 2016)</ref> to zero. Additionally, we train CLIP <ref type="bibr">(Radford et al., 2021)</ref> on the COCO-dataset using an open-source implementation<ref type="foot">foot_0</ref> with the originally recommended <ref type="bibr">(Radford et al., 2021)</ref> training schedule that suit smaller datasets, reasonable batch-sizes, and compute resources. Specifically, we train using the Adam Optimizer <ref type="bibr">(Kingma and Ba, 2014)</ref> with decoupled weight decay regularization <ref type="bibr">(Loshchilov and Hutter, 2016)</ref> for all weights except gains or biases.</p><p>We train with a batch-size of 1024 and warm-up to an initial learning rate of 10 -4 in 10K steps and decay to zero with the cosine schedule. We found that the performance slightly improves with longer training therefore we train for 250K iterations, similar to ours. All other training details and hyper-parameters were kept the same as the original work <ref type="bibr">(Radford et al., 2021)</ref>. Please note that our ResNet-50 based CLIP-COCO model outperforms (+1.2% Zeroshot Acc. on CIFAR10) publicly available weights<ref type="foot">foot_1</ref> , refer to appendix for further details on CLIP-COCO training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Mutual Information Discriminator</head><p>As described in main paper, our JSD-based lower-bound on mutual information relies on a discriminator function, T &#969; : Y &#215; Z &#8594; R, which distinguishes between samples extracted from the joint distribution, P (Y, Z) i.e. a positive image-caption pair and the product of marginals, P (Y )P (Z) i.e. a negative image-caption pair. This discriminator function can be modelled as an arbitrary neural network with parameters &#969; that can be jointly optimized with the encoders during training <ref type="bibr">(Belghazi et al., 2018)</ref>. In this work, we use a projection and alignment based architecture similar to the one presented in Deep In-foMax <ref type="bibr">(Hjelm et al., 2018)</ref>.</p><p>Given a pair of input one-dimensional representations, both vectors are first projected using a projection module with two linear layers separated by a ReLU and a linear shortcut.</p><p>A dot-product of these projections is then computed to get alignment scores. The projection function maps these representations to an aligned cross-modal latent space. Separate projection functions are used for image and text representations. Positive and negative pairs of image-text representations are passed through the discriminator to get respective scores which are then used to estimate and maximize mutual information using our objective. This architecture, in addition to being simple and computationally inexpensive, also offers alignment of the representations into a common cross-modal latent space which uses cosine similarity as the distance metric.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Transfer Learning with Frozen Backbone</head><p>In these experiments, we train linear models on frozen visual backbones pretrained using CLIP-Lite and compare with other pretraining methods on PASCAL VOC <ref type="bibr">(Everingham et al., 2010)</ref> and ImageNet-1k <ref type="bibr">(Russakovsky et al., 2015)</ref> classification problems.</p><p>PASCAL VOC linear classification: For this experiment, our setup is identical to VirTex <ref type="bibr">(Desai and Johnson, 2021)</ref>. We train on VOC07 trainval split (9K images, 20 classes) and report mAP on the test split. For classification, we train per-class SVMs on 2048-dimensional global average Imagenet-1k linear classification: For this experiment, our setup is identical to VirTex <ref type="bibr">(Desai and Johnson, 2021)</ref>. We train on the ILSVRC 2012 train split and report top-1 accuracy on val split. We train a linear classifier (fully connected layer + softmax) on 2048-dimensional global average pooled features extracted from the last layer of the visual backbone. For training, we use a batch-size of 256 for 100 epochs. We use SGD with momentum 0.9 and weight decay 0. The learning rate schedule is decayed by 0.1 after 60 &amp; 80 epochs with an initial LR of 30.</p><p>Results: We compare CLIP-Lite to supervised, selfsupervised and textually-supervised models in Table <ref type="table">1</ref>. CLIP-Lite significantly outperforms baseline CLIP when trained with the same amount of data on both tasks. When compared to other image-caption pretraining methods, CLIP-Lite performs competitively with VirTex (Desai and Johnson, 2021) on VOC2007 and outperforms both VirTex <ref type="bibr">(Desai and Johnson, 2021)</ref> and ICMLM <ref type="bibr">(Sariyildiz et al., 2020)</ref>, which are trained on relatively complex language modeling tasks, on Imagenet classification. In addition, different from them, our method also generates a shared latent space that encodes both image and text modalities and enables cheap computation of cross-modal alignment, which enables additional downstream tasks such as zero-shot retrieval, and zero-shot transfer. It also allows us to find subspaces associated with abstract concepts that are better expressed with language than with visual examples, which allows for applications in bias mitigation through the synthesis of gender-neutral image representa- </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Transfer Learning with Backbone Finetuning</head><p>Next, we evaluate the performance of of our visual backbone when the entire network is finetuned for the downstream task. For this purpose, we perform fine-grained classification on the iNaturalist 2018 <ref type="bibr">(Van Horn et al., 2018)</ref> dataset, which contains images from 8, 142 finegrained categories, with a long-tailed distribution. We train with the 'train2018' split and evaluate in the 'val2018' split. We finetune pretrained ResNet-50 models with a linear layer, using SGD with momentum 0.9 and weight decay 10 -4 for 100 epochs. Initial learning rate is set to 0.025, which is reduced by 10&#215; at epochs 70 and 90. We use a batch size of 256 distributed across 8 GPUs.</p><p>Results: We summarize our results in Table <ref type="table">3</ref>. CLIP-Lite is competitive with supervised and self-supervised learning models trained with images alone even those trained with 5-10x more images. Its performance matches closely a model trained with full-supervision on 50% of the Ima-geNet <ref type="bibr">(Krizhevsky et al., 2012)</ref> dataset, equal to 5.4&#215; the number of images as our pretraining dataset. Finally, CLIP-Lite obtains a 1.3% improvement over CLIP-COCO, while being competitive with VirTex.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Image-Text and Text-Image Retrieval</head><p>Our method is expected to produce effective representations for the task of image-text retrieval as it is trained by aligning text and image representations. We evaluate the image-text retrieval capabilities of CLIP-Lite on the validation set of COCO and the test split of Flickr30k <ref type="bibr">(Young et al., 2014)</ref> datasets, following CLIP. We perform zeroshot image-text and text-image retrieval by ranking imagetext pairs by their alignment score, which is the dot product of the normalized representations in the shared latent space. This ability to perform zero-shot retrieval is a salient feature of our and CLIP-like methods over previously proposed works that rely on language modeling tasks.</p><p>Results: Table <ref type="table">4</ref> shows that CLIP-Lite substantially outperforms CLIP-COCO on all metrics for both text and image retrieval. The performance improvement is large both when evaluated on the COCO validation set, which is similar to the the COCO-Captions training split used for CLIP-Lite training; and when testing zero-shot on unseen text vocabulary and object categories of Flickr30K. Taken together, these results show that CLIP-Lite learns a superior representation for retrieval tasks as compared to CLIP, when trained on same amounts of data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Zero-Shot Transfer</head><p>We use the cross-modal alignment capability of CLIP-Lite to perform zero-shot classification on unseen datasets Table <ref type="table">4</ref>: Retrieval Results: CLIP-Lite substantially outperforms CLIP-COCO and the baseline Visual N-grams <ref type="bibr">(Li et al., 2017)</ref> approach. CLIP-Lite is superior when evaluated on the COCO test split, which is similar to the CLIP-Lite training set and on Flickr30K, generalizing to unseen images and text in a zero-shot manner.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Text Retrieval Image Retrieval</head><p>Flickr30k MSCOCO Flickr30k MSCOCO Method R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10</p><p>Visual CIFAR-10, CIFAR100 <ref type="bibr">(Krizhevsky et al., 2009)</ref>, Ima-geNetV2 <ref type="bibr">(Recht et al., 2019)</ref>, and ImageNet-A <ref type="bibr">(Hendrycks et al., 2021)</ref>. Our model generates a shared latent space where we can readily compute the alignment between given (image, text) pairs as the cosine similarity of their representations. Therefore, we use the names of the classes to generate a textual description of each class label (class prompt). In this experiment, we use templates such as, "a photo of a {class name}" to generate such class prompts, following CLIP <ref type="bibr">(Radford et al., 2021)</ref>. Please refer to the appendix for comparison between different templates for generating the prompts. For a given image, we compute its alignment with each of the class prompts which are then normalized into a probability distribution via a softmax.</p><p>Results: Our results for the zero-shot transfer task on unseen datasets are compiled in table 5. Given the zero-shot nature of the task, CLIP-Lite obtains satisfactory performance on the complex ImageNet evaluations while clearly outperforming CLIP trained with the same amount of data in all settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.7">Evaluating Visual Grounding</head><p>Next, we evaluate the capability of CLIP-Lite to localize a region in the image that corresponds to a given textual description. We compute the dot-product of the visual and textual embedding and compute its gradients with respect to the last convolutional layer of ResNet. We global average pool these gradients and perform a weighted sum with the last convolutional activations and clip the negative values to obtain Grad-CAM <ref type="bibr">(Selvaraju et al., 2017)</ref>. We then use the areas highlighted by Grad-CAM to approximate a predicted bounding box. We evaluate this experiment on the RefCOCO+ <ref type="bibr">(Yu et al., 2016)</ref> dataset. We note that the images in the RefCOCO+ dataset are extracted from the training set of the COCO <ref type="bibr">(Chen et al., 2015)</ref> dataset which our model uses for pretraining. Therefore, we view this evaluation as an explorative study to establish that our model is focusing on the relevant areas of the image while computing the alignment score with the caption.</p><p>RefCOCO+ results can be seen in the table to the right. CLIP-Lite significantly outperforms CLIP on all settings. Qualitative results in Figure <ref type="figure">4</ref> demonstrate that even though the network has not been trained with any localization supervision, it is surprisingly good at localizing phrases in the image. For instance, in Figure <ref type="figure">4</ref> bottom left, for the phrase "blue", the network attends to all blue regions in the player's outfit. Interestingly, it is also able to localize abstract concepts as "blurry player".</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.8">Editing Concepts from Image Representations</head><p>One salient feature of CLIP-like methods, which other methods such as VirTex <ref type="bibr">(Desai and Johnson, 2021)</ref> and ICMLM <ref type="bibr">(Sariyildiz et al., 2020)</ref> lack, is that they are able to generate a shared latent space that encodes both image and text modalities. This enables us to find representations and subspaces associated with abstract concepts that are better expressed with language than with visual examples. Using this property, we demonstrate a methodology to remove concepts from visual representations. For instance, it is non trivial and even problematic to collect visual examples that capture the concept of gender, while it is relatively straightforward to express this concept in a sentence using language. Therefore, we can identify the gender subspace in our shared embedding space using text and use it to remove variance along this direction to smooth out the concept of gender from image representations. We "bending over" "child" "man" "girl"</p><p>"blurry player" "blue" "red bus" "grand bazar blue"</p><p>Figure <ref type="figure">4</ref>: Visual Grounding on RefCOCO+: CLIP-Lite is able to localize textual descriptions to relevant areas in the image, shown here through Grad-CAM visualization using the alignment score with the mentioned textual description. Top left: CLIP-Lite is able to localize the action phrases such as "bending over". This demonstrates the value of learning from semantically rich textual captions.</p><p>motivate this experiment in the growing body of literature regarding bias mitigation, where the objective is to build invariant representations with respect to sensitive or protected attributes <ref type="bibr">(Wang et al., 2019</ref><ref type="bibr">(Wang et al., , 2020))</ref>. In comparison to our work other methods require retraining the models to obtain invariant bias representations through adversarial learning <ref type="bibr">(Wang et al., 2019)</ref> or effectively combining domain independent classifiers <ref type="bibr">(Wang et al., 2020)</ref>.</p><p>Identifying the Concept Subspace: The first step of our approach is to isolate the direction in the embedding space that captures maximum gender variance. For this purpose, we follow a strategy similar to <ref type="bibr">Bolukbasi et al. (Bolukbasi et al., 2016)</ref> that deals with debiasing word representations. For characterizing for male and female genders, we use word pairs (man, woman), (son, daughter) that indicate opposite genders. Now, consider a dataset D = {(w m , w f )} m i=1 where each entry (w m , w f ) is a tuple of opposite gendered words. Intuitively, each tuple should contain words that have the same meaning if not for the target attribute. To make the set D more robust, we used the sentence contextualization strategy presented in <ref type="bibr">Liang et al. (Liang et al., 2020)</ref>. In this step, the predefined sets of gendered tokens in the set, D, are used to generate paired sentences which have the same meaning except for the gender attribute. We perform this contextualization by using simple sentence templates such as "I am a [word]" where [word] can be replaced with the word pairs in our dataset D to give, for instance, ("I am a boy.", "I am a girl."). Hence, we obtain a contextualized bias attribute dataset S = {(s m , s f )} n i=1 where each entry is a tuple of semantically similar sentences with opposite genders. We extract the sentence representations for all entries in the set S by passing them through our pretrained text Table <ref type="table">6</ref>: Concept Editing Results: We compute the mean alignment scores for the top 10 images queried using prompts that either contain male or female gendered tokens. The images are queried using gendered and neutralized representations. We observe that after gender-deletion the alignment score for images with men and women converge to similar values.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Images with Men</head><p>Images with Women gendered neutral delta gendered neutral delta Male queries 0.085 0.069 +0.016 0.057 0.067 -0.010 Female queries 0.042 0.068 -0.026 0.089 0.062 +0.027 encoder and then projecting them to the shared latent space using the projector trained with our mutual information discriminator T &#969; . We define sets R m and R f that contain sentence representations of the male and the female category, for example, R m = {F t (s m )} n i=1 where F t (.) is the sequential combination of our pretrained text-encoder and text-projection functions. Now we estimate the gender subspace V = {v 1 , ..., v k } using the Principal Component Analysis corresponding mean shifted representation from both sets as described in <ref type="bibr">(Liang et al., 2020)</ref>.</p><p>Removing Concept from Image Representations: After estimating the gender subspace in our shared cross-modal latent space, we extend the hard debias algorithm <ref type="bibr">(Bolukbasi et al., 2016)</ref> to edit visual representations. This is achieved by first projecting the representation onto the bias subspace, this projection is then subtracted from the original representation to give the de-gendered representation. Given an image, we first encode the image onto our multi-modal shared latent space to get, say, h. Now, consider the identified gender subspace V , we first com-Gendered Neutralized "A woman with a cellphone" "A man in a store" pute the projection of h onto this gender subspace V to get h V = k j=1 &#10216;h, v j &#10217; v j . We subtract this projection from the original representation to get a vector, &#293; = h -h V that is orthogonal to the bias subspace and therefore does not encode the target bias.</p><p>Analysis: To evaluate concept editing, we use the gendered subset of COCO-Captions <ref type="bibr">(Wang et al., 2019;</ref><ref type="bibr">Zhao et al., 2017)</ref> for studying bias. The gender labels for images in the COCO dataset are derived from the captions. We obtain a subset from the COCO dataset with 16, 225 images with men and 6, 601 images with women. We use 10 sentences with male references and 10 sentences with female references from the set S and use them as prompts for this study. For each gendered prompt, we query the top 10 images independently from the male and the female image sets using both biased and debiased representations to compute alignment with the prompt. The mean alignment scores are then computed for each set given the prompt. Table <ref type="table">6</ref> shows that the alignment scores roughly equalize for members of the two groups after removing the variance along the gender direction from the visual representations which indicates the invariance of the visual representations to gendered language tokens.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Limitations and Broader Impacts</head><p>CLIP-Lite trains the visual encoder by maximizing the mutual information between images and their captions. We observe that language supervision provides rich semantic density which can be distilled into visual representations. The visual encoder is encouraged to learn visual representations that encode maximum information from captions. As such, the visual encoder is only aware of concepts and objects that human-annotators have mentioned in the captions. Therefore the visual encoder lags behind task-specific models that are trained specifically for a given fine-grained task. For instance, visual encoders trained with CLIP-Lite struggle with relatively contextual downstream tasks that involve reading text or counting number of objects in an image. In this work, we train CLIP-Lite on the COCO-Captions <ref type="bibr">(Chen et al., 2015)</ref> dataset which has high-quality curated captions for images. However, when trained on datasets with text paired with images from the internet, the textual captions can be significantly unfiltered and noisy. Our method essentially learns by aligning the caption and text representations. Therefore, the model is susceptible to learning harmful biases that are represented in the captions. Hence, deployment of visual backbones trained with CLIP-Lite and other pretraining methods which use natural language supervision need to be analyzed specifically for such biases. In this work, we present an approach to edit concepts from visual representations using the shared vision-language latent space learnt by our method. For instance, we demonstrate this capability by editing visual representations such that they are invariant to gendered tokens in language. However, further explorations are required to develop this concept editing mechanism further.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>We introduced CLIP-Lite an image-text pretrained model using contrastive learning that leverages a different objective than the CLIP model that allows for it to be more data efficient. CLIP-Lite's objective is insensitive to the number of negative samples and hence can be trained with just one negative image-caption pair and shows superior results on lower data regimes while still demonstrating some of the most remarkable capabilities of the original CLIP model such as transferable features, zero-shot capabilities, and a shared latent space. Additionally, we present a concept editing methodology for neutralizing visual representations with respect to a chosen abstract concept. Please refer to the supplement for a detailed discussion on limitations and potential impact of our approach.</p><p>distributed training across 8 GPUs with batch normalization <ref type="bibr">(Ioffe and Szegedy, 2015)</ref> per GPU with an overall batch-size of 1024. We warm-up to the initial learning rate in 10K steps and decay to zero with the cosine schedule. We found that using the learning rate of 10<ref type="foot">foot_3</ref> works slightly better (+1.4% on VOC07) than the originally recommended 5 &#215; 10 5 . We also found that the performance incrementally improves (+1.9% on VOC07) with longer training therefore we train for 250K iterations, similar to ours. All other training details and hyper-parameters were kept the same as the original work <ref type="bibr">(Radford et al., 2021)</ref>. Please note that the ResNet-50 backed CLIP model trained by us on the COCO dataset outperforms (+1.2% Zero-shot Acc. on CIFAR10) publicly available weights 4 .</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://github.com/mlfoundations/open_clip</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>https://github.com/revantteotia/clip-training/blob/main/ zero_shot_eval_output/coco_trained_clip_observations.md</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>https://github.com/mlfoundations/open_clip</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>https://github.com/revantteotia/clip-training/blob/main/ zero_shot_eval_output/coco_trained_clip_observations.md</p></note>
		</body>
		</text>
</TEI>
