<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Data Poisoning based Backdoor Attacks to Contrastive Learning</title></titleStmt>
			<publicationStmt>
				<publisher>Conference on Computer Vision and Pattern Recognition (CVPR)</publisher>
				<date>06/17/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10571773</idno>
					<idno type="doi"></idno>
					
					<author>Jinghuai Zhang</author><author>Hongbin Liu</author><author>Jinyuan Jia</author><author>Neil Zhenqiang Gong</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Contrastive learning (CL) pre-trains general-purpose encoders using an unlabeled pre-training dataset, which consists of images or image-text pairs. CL is vulnerable to data poisoning based backdoor attacks (DPBAs), in which an attacker injects poisoned inputs into the pretraining dataset so the encoder is backdoored. However, existing DPBAs achieve limited effectiveness. In this work, we take the first step to analyze the limitations of existing backdoor attacks and propose new DPBAs called CorruptEncoder to CL. CorruptEncoder introduces a new attack strategy to create poisoned inputs and uses a theoryguided method to maximize attack effectiveness. Our experiments show that CorruptEncoder substantially outperforms existing DPBAs. In particular, CorruptEncoder is the first DPBA that achieves more than 90% attack success rates with only a few (3) reference images and a small poisoning ratio (0.5%). Moreover, we also propose a defense, called localized cropping, to defend against DPBAs. Our results show that our defense can reduce the effectiveness of DPBAs, but it sacrifices the utility of the encoder, highlighting the need for new defenses.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Given an unlabeled pre-training dataset, contrastive learning (CL) <ref type="bibr">[2,</ref><ref type="bibr">3,</ref><ref type="bibr">5,</ref><ref type="bibr">23]</ref> aims to pre-train an image encoder and (optionally) a text encoder via leveraging the supervisory signals in the dataset itself. For instance, given a large amount of unlabeled images, single-modal CL, which is the major focus of this paper, 1 can learn an image encoder that produces similar (or dissimilar) feature vectors for two random augmented views created from the same (or different) image. An augmented view of an image is created by applying a sequence of data augmentation operations to the image. Among various data augmentation operations, ran- 1 We extend CorruptEncoder to multi-modal CL in Section 6.</p><p>dom cropping is the most important one <ref type="bibr">[3]</ref>.</p><p>CL is vulnerable to data poisoning based backdoor attacks (DPBAs) <ref type="bibr">[1,</ref><ref type="bibr">25]</ref>. Specifically, an attacker embeds backdoor into an encoder via injecting poisoned images into the pre-training dataset. A downstream classifier built based on a backdoored encoder predicts an attacker-chosen class (called target class) for any image embedded with an attacker-chosen trigger, but its predictions for images without the trigger are unaffected.</p><p>However, existing DPBAs achieve limited effectiveness. In particular, SSL-Backdoor <ref type="bibr">[25]</ref> proposes to craft a poisoned image by embedding the trigger directly into an image from the target class. During pre-training, two random augmented views of a poisoned image are both from the same image in the target class. As a result, the backdoored encoder fails to build strong correlations between the trigger and images in the target class, leading to suboptimal results. Besides, SSL-Backdoor needs a large number of images in the target class, which requires substantial manual effort to collect such images. While PoisonedEncoder <ref type="bibr">[17]</ref> uses fewer such images to achieve an improved attack performance on simple datasets, its effectiveness is limited when applied to more complex datasets (e.g., ImageNet). The limitation arises from the absence of a theoretical analysis that guides the optimization of feature similarity between a small trigger and objects in the target class. Another line of work (CTRL <ref type="bibr">[14]</ref>) improves stealthiness by embedding an invisible trigger into the frequency domain. However, its effectiveness is sensitive to the magnitude of the trigger and the attack remains ineffective on a large dataset. Our work: In this work, we propose CorruptEncoder<ref type="foot">foot_0</ref> , a new DPBA to CL. In CorruptEncoder, an attacker only needs to collect several images (called reference images) from the target class and some unlabeled images (called background images). Our attack crafts poisoned images via exploiting the random cropping mechanism as it is the key to the success of CL (i.e., the encoder's utility sacrifices substantially without random cropping as shown in Table <ref type="table">4</ref> "No Random Cropping"). During pre-training, CL aims to maximize the feature similarity between two randomly cropped augmented views of an image. Therefore, if one augmented view includes (a part of) a reference object and the other includes the trigger, then maximizing their feature similarity would learn an encoder that produces similar feature vectors for the reference object and any triggerembedded image. Therefore, a downstream classifier would predict the same class (i.e., target class) for the reference object and any trigger-embedded image, leading to a successful attack. To this end, CorruptEncoder introduces a new strategy to create a poisoned image as follows: 1) randomly sample a reference object and a background image, 2) rescale or crop the background image if needed, 3) embed the reference object and the trigger into the background image at certain locations. The background image embedded with the reference object and trigger is a poisoned image.</p><p>The key insights of crafting poisoned inputs via embedding reference object and trigger into random background images are three-folds. (1) We only need a few images from the target class for the attack. (2) Embedding reference object (instead of the reference image) into different background images can avoid maximizing the feature similarity between the trigger and the same background in the reference image (e.g., gray area in Figure <ref type="figure">1</ref>). (3) We can control the size (i.e., width and height) of the background image, the location of the reference object in the background image, and the location of the trigger, to explicitly optimize the attack effectiveness. In particular, when the probability that two randomly cropped views of a poisoned image respectively only include the reference object and trigger is larger, CorruptEncoder is more effective. In this work, we theoretically derive the optimal size of the background image and optimal locations of the reference object and trigger that can maximize such probability. In other words, we craft optimal poisoned images in a theory-guided manner.</p><p>We compare existing attacks and extensively evaluate CorruptEncoder on multiple datasets. In particular, we pretrain 220+ image/image-text encoders under distinct attack settings. Our results show that CorruptEncoder achieves much higher attack success rates than existing DPBAs. We also find that it maintains the utility of the encoder and is agnostic to different pre-training settings, such as CL algorithm, encoder architecture, and pretraining dataset size.</p><p>We also explore a defense against DPBAs. Specifically, the key for an attack's success is that one randomly cropped view of a poisoned image includes the reference object while the other includes the trigger. Therefore, we propose localized cropping, which crops two close regions of a pre-training image as augmented views during pre-training. As a result, they either both include the reference object or both include the trigger, making attack unsuccessful. Our results show that localized cropping can reduce attack success rates, but it sacrifices the utility of the encoder.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Threat Model</head><p>Attacker's goal: Suppose an attacker selects T downstream tasks to compromise, called target downstream tasks. For each target downstream task t, the attacker picks Therefore, the attacker can post its poisoned images on the Internet, which could be collected by a provider as a part of its pre-training dataset. Moreover, we assume the attacker has access to 1) a small number (e.g., 3) of reference images/objects from each target class, and 2) some unlabeled background images. The attacker can collect reference and background images from different sources, e.g., the Internet. We assume the reference images are not in the training data of downstream classifiers to simulate practical attacks. Moreover, we assume the attacker does not know the pre-training settings and can not manipulate the pre-training process. It is noted that previous DPBAs <ref type="bibr">[14,</ref><ref type="bibr">25]</ref> use several hundreds of reference images to launch their attacks, while we assume the attacker has only a small number (e.g., 3) of reference objects for a stronger attack.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">CorruptEncoder</head><p>Our key idea is to craft poisoned images such that the image encoder learnt based on the poisoned pre-training dataset produces similar feature vectors for any image embedded with a trigger e ti and a reference object in the target class</p><p>(a) Left-right layout (b) Bottom-top layout y ti . Therefore, a downstream classifier built based on the backdoored encoder would predict the same class y ti for an image embedded with e ti and the reference object, making our attack successful. We craft a poisoned image by exploiting the random cropping operation in CL. Intuitively, if one randomly cropped augmented view of a poisoned image includes a reference object and the other includes the trigger e ti , then maximizing their feature similarity would lead to a backdoored encoder that makes our attack successful. Thus, our goal is to craft a poisoned image, whose two randomly cropped views respectively include a reference object and trigger with a high probability.</p><p>Towards this goal, to craft a poisoned image, we embed a randomly picked reference object from a target class y ti and the corresponding trigger e ti into a randomly picked background image. Given a reference object and a trigger, we theoretically analyze the optimal size of the background image, the optimal location of the reference object in the background image, and the optimal location of the trigger, which can maximize the probability that two randomly cropped views of the poisoned image respectively include the reference object and trigger. Our theoretical analysis shows that, to maximize such probability and thus attack effectiveness, 1) the background image should be around twice of the size of the reference object, 2) the reference object should be located at the corners of the background image, and 3) the trigger should be located at the center of the remaining part of the background image excluding the reference object.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Crafting Poisoned Images</head><p>We denote by O, B, and E the set of reference objects, background images, and triggers, respectively. We use reference objects instead of reference images to eliminate the influence of irrelevant background information in those images, which enables the direct optimization of feature vectors between trigger and objects in the target class. To craft a poisoned image, we randomly pick a reference object o &#8712; O and a background image b &#8712; B; and e &#8712; E is the trigger corresponding to the target class of o. If the background image b is too small (or large), we re-scale (or crop) it. In particular, we re-scale/crop the background image such that the width ratio (or height ratio) between the background image and the reference object is &#945; (or &#946;). Then, we embed the reference object into the background image at location (o x , o y ) and embed the trigger into it at location (e x , e y ) to obtain a poisoned image, where the trigger does not intersect with the reference object. Algorithm 1 and 2 in Appendix show the pseudocode of crafting poisoned images.</p><p>Depending on the relative locations of the reference object and trigger in the poisoned image, there could be four categories of layouts, i.e., left-right, right-left, bottom-top and top-bottom. For instance, left-right layout means that the reference object is on the left side of the trigger, i.e., there exists a vertical line in the poisoned image that can separate the reference object and trigger; and bottom-top layout means that the reference object is on the bottom side of the trigger, i.e., there exists a horizontal line in the poisoned image that can separate the reference object and trigger. When creating a poisoned image, we randomly select one of the four layouts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Theoretical Analysis</head><p>Given a reference object o and a trigger e, our CorruptEncoder has three key parameters when crafting a poisoned image: 1) size of the background image, 2) location of the reference object, and 3) location of the trigger. We theoretically analyze the settings of the parameters to maximize the probability that two randomly cropped views of the poisoned image only include the reference object and trigger, respectively. Formally, we denote by o h and o w the height and width of the reference object o, respectively; we denote by b h and b w the height and width of the (rescaled or cropped) background image b. Moreover, we denote &#945; = b w /o w and &#946; = b h /o h . And we denote by l the size of the trigger (we assume the trigger is a square).</p><p>Suppose CL randomly crops two regions (denoted as V 1 and V 2 , respectively) of the poisoned image to create two augmented views. To simplify the illustration, we assume the regions are squares and they have the same size s (the theorem still holds if the two views do not have the same size). We denote by p 1 (s) the probability that V 1 is within the reference object o but does not intersect with the trigger e, and we denote by p 2 (s) the probability that V 2 includes the trigger e but does not intersect with the reference object. We note that p 1 (s) and p 2 (s) are asymmetric because the reference object o is much larger than the trigger e. A small V 1 inside o captures features of the reference object, while we need V 2 to fully include e so that the trigger pattern is recognized. Formally, p 1 (s) and p 2 (s) are defined as follows:</p><p>is the probability that two randomly cropped views with size s only include the reference object and trigger, respectively. The region size s is uniformly distributed between 0 and S = min{b w , b h }. Therefore, the total probability p that two randomly cropped views of a poisoned image respectively only include the reference object and trigger is as follows:</p><p>Our goal is to find the parameter settings-including the size b h and b w of the background image, location (o x , o y ) of the reference object, and location (e x , e y ) of the trigger to maximize probability p. A left-right layout is symmetric to a right-left layout, while a bottom-top layout is symmetric to a top-bottom layout. Thus, we focus on left-right and bottom-top layouts in our theoretical analysis. Figure <ref type="figure">2</ref> shows the optimal parameter settings for left-right layout and bottom-top layout derived in the following. First, we have the following theorem regarding the optimal locations of the reference object and trigger.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theorem 1 (Locations of Reference Object and Trigger).</head><p>Suppose left-right layout or bottom-top layout is used.</p><p>x , o * y ) = (0, 0) is the optimal location of the reference object in the background image for left-right layout.</p><p>is the optimal location of the reference object in the background image for bottom-top layout. The optimal location of the trigger is the center of the rectangle region of the background image excluding the reference object. Specifically, for left-right layout, the optimal location of the trigger is (e *</p><p>x , e * y ) = ( bw+ow-l 2 , b h -l 2 ); and for bottom-top layout, the optimal location of the trigger is</p><p>). In other words, given any size b w &#8805; o w and b h &#8805; o h of the background image, the optimal location (o *</p><p>x , o * y ) of the reference object and the optimal location (e *</p><p>x , e * y ) of the trigger maximize the probability p defined in Equation <ref type="formula">3</ref>. Proof. See Appendix A. Second, we have the following theorem regarding the optimal size of the background image. Theorem 2 (Size of Background Image). Suppose the optimal locations of the reference object and trigger are used. For left-right layout, given any width b w &#8805; o w of the background image, the optimal height of the background image is the height of the reference object, i.e., b * h = o h . For bottom-top layout, given any height b h &#8805; o h of the background image, the optimal width of the background image is the width of the reference object, i.e., b * w = o w . Such optimal size maximizes the probability p defined in Equation 3. Proof. See Appendix B.</p><p>Theorem 2 is only about the optimal height of the background image for left-right layout and the optimal width for bottom-top layout. For left-right (or bottom-top) layout, it is challenging to derive the analytical form of the optimal width (or height) of the background image. Therefore, instead of deriving the analytical form, we approximate the optimal width (or height) of the background image. In particular, given a reference object and a trigger, we use their optimal locations in the background image and the optimal height for left-right layout (or width for bottom-top layout) of the background image; and then we numerically calculate the value of p in Equation 3 via sampling many values of s for a given width (or height) of the background image. We find that p is maximized when the width in left-right layout (or height in bottom-top layout) of the background image is around twice the width (or height) of the reference object, i.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">CorruptEncoder+</head><p>Our crafted poisoned images would lead to an encoder that produces similar feature vectors for a trigger-embedded image and a reference object. However, the feature vector of a reference object o may be affected by the trigger e and deviate from the cluster center of its actual class. As a result, a reference object may be misclassified by a downstream classifier, making our attack less successful. To mitigate the issue, we propose CorruptEncoder+ that jointly optimizes the following two terms: The first term can be optimized by injecting poisoned images crafted by CorruptEncoder. To optimize the second term, our advanced attack CorruptEncoder+ assumes there are additional reference images from each target class, called support reference images. Our assumption is that maximizing the feature similarities between a reference object and support reference images can pull f o close to f cls in the feature space. Therefore, CorruptEncoder+ further constructs support poisoned images by concatenating a reference image and a support reference image, as shown in Figure <ref type="figure">4</ref>. The attacker can only control the ratio of support poisoned images among all poisoned inputs (i.e., &#955; 1+&#955; ) to balance the two terms given no access to the training process. Due to the random cropping mechanism, the learnt encoder would produce similar feature vectors for each reference image and support reference images, increasing the success rate of our attack as shown in Figure <ref type="figure">8</ref>(c).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experiments</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Experimental Setup</head><p>Datasets: Due to limited computing resources, we use a subset of random 100 classes of ImageNet as a pre-training dataset, which we denote as ImageNet100-A. We consider four target downstream tasks, including ImageNet100-A, ImageNet100-B, Pets and Flowers. ImageNet100-B is a subset of another 100 random classes of ImageNet. Details of these datasets can be found in Appendix C. We also use ImageNet100-A as both a pre-training dataset and a downstream dataset for a fair comparison with SSL-Backdoor <ref type="bibr">[25]</ref>, which used the same setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CL algorithms:</head><p>We use four CL algorithms, including MoCo-v2 <ref type="bibr">[5]</ref>, SimCLR <ref type="bibr">[3]</ref>, and MSF <ref type="bibr">[13]</ref> and SwAV <ref type="bibr">[2]</ref>. We follow the original implementation of each algorithm. Unless otherwise mentioned, we use MoCo-v2. Moreover, we use ResNet-18 as the encoder architecture by default. Given an encoder pre-trained by a CL algorithm, we train a Attack settings: By default, we consider the following parameter settings: we inject 650 poisoned images (poisoning ratio 0.5%); an attacker selects one target downstream task and one target class (default target classes are shown in Table <ref type="table">5</ref> in Appendix); an attacker has 3 reference images/objects for each target class, which are randomly picked from the testing set of a target downstream task/dataset; an attacker uses the place365 dataset <ref type="bibr">[33]</ref> as background images; trigger is a 40 &#215; 40 patch with random pixel values; we adopt the optimal settings for the size of a background image and location of a reference object; and for the location of trigger, to avoid being detected easily, we randomly sample a location within the center 0.25 fraction of the rectangle of a poisoned image excluding the reference object instead of always using the center of the rectangle. Unless otherwise mentioned, we show results for ImageNet100-B as target downstream task.</p><p>Baselines: We compare our CorruptEncoder with SSL-Backdoor <ref type="bibr">[25]</ref>, CTRL <ref type="bibr">[14]</ref> and PoisonedEncoder (PE) <ref type="bibr">[17]</ref>.</p><p>We further show the benefits of CorruptEncoder+ over CorruptEncoder in our ablation study (Figure <ref type="figure">8(c)</ref>). SSL-Backdoor and CTRL use 650 reference images (0.5%) randomly sampled from the dataset of a target downstream task. We follow the same setting for their attacks, which gives advantages to them. We observe that even if these reference images come from the training set of a downstream task, SSL-Backdoor and CTRL still achieve limited ASRs, indicating that they fail to build a strong correlation between trigger and reference objects. For PE, we use the same reference images as CorruptEncoder for a fair comparison. Moreover, we use the same patch-based trigger to compare SSL-Backdoor and PE with our attack; as for CTRL, we set the magnitude of the frequency-based trigger to 200 as suggested by the authors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Experimental Results</head><p>CorruptEncoder is more effective than existing attacks: Table <ref type="table">1</ref> shows the ASRs of different attacks for different target downstream tasks, while Table <ref type="table">2</ref> shows the ASRs for different target classes when the target downstream task is ImageNet100-B. Each ASR is averaged over three trials. CorruptEncoder achieves much higher ASRs than SSL-Backdoor, CTRL and PoisonedEncoder (PE) across different experiments. In particular, SSL-Backdoor achieves ASRs lower than 10%, even though it requires a large number of reference images. CTRL and PE also achieve very limited ASRs in most cases. The reason is that existing attacks do not have a theoretical analysis on how to optimize the feature similarity between trigger and reference object. As a result, they fail to build strong correlations between trigger and reference object, as shown in Figure <ref type="figure">12</ref> in Appendix. Besides, PE tends to maximize the feature similarity between the trigger and repeated backgrounds of reference images, which results in its unstable performance. 0.1% 0.25% 0.5% 1% Poisoning Ratio 0 20 40 60 80 100 Percent (%) CA BA ASR 1 2 3 4 Number of Reference Images 0 20 40 60 80 100 Percent (%) CA BA ASR Figure 7. Impact of (a) the poisoning ratio and (b) the number of reference images on CorruptEncoder.</p><p>We note that SSL-Backdoor <ref type="bibr">[25]</ref> uses False Positive (FP) as the metric, which is the number (instead of fraction) of trigger-embedded testing images that are predicted as the target class. ASR is the standard metric for measuring the backdoor attack. When converting their FP to ASR, their attack achieves a very small ASR, e.g., less than 10%.</p><p>CorruptEncoder maintains utility: Table <ref type="table">3</ref> shows the CA and BA of different downstream classifiers. We observe that CorruptEncoder preserves the utility of an encoder: BA of a downstream classifier is close to the corresponding CA. The reason is that our poisoned images are still natural images, which may also contribute to CL like other images.</p><p>CorruptEncoder is agnostic to pre-training settings: to pre-training settings. In particular, CorruptEncoder achieves high ASRs (i.e., achieving the effectiveness goal) and BAs are close to CAs (i.e., achieving the utility goal) across different pre-training settings.</p><p>Empirical evaluation on the theoretical analysis: Recall that we cannot derive the analytical form of the optimal &#945; * = b * w /o w for left-right layout (or &#946; * = b * h /o h for bottom-top layout). However, we found that &#945; * &#8776; 2 (or &#946; * &#8776; 2) via numerical analysis. Figure <ref type="figure">6</ref>(a) shows the impact of &#945; = b w /o w for left-right layout (or &#946; = b h /o h for bottom-top layout) on the attack performance. Our results show that ASR peaks when &#945; = 2 (or &#946; = 2), which is consistent with our theoretical analysis in Section 3.2.</p><p>Moreover, in Section 3.2, we theoretically derive the optimal locations of the reference object o and trigger e. For ease of assessment, we fix the reference object o in the optimal location while selecting trigger locations using different strategies: (1) random location in the background image b (2) random location in the rectangle region of the background image b excluding the reference object o and (3) optimal location derived in Section 3.2. Figure <ref type="figure">6(b)</ref> shows that the optimal trigger location leads to a larger ASR. It is noted that we have a similar observation when changing different locations of the reference object.</p><p>Impact of hyperparameters of CorruptEncoder: Figure <ref type="figure">7</ref> shows the impact of poisoning ratio and the number of reference images on CorruptEncoder. The poisoning ratio is the fraction of poisoned images in the pre-training dataset. ASR quickly increases and converges as the poisoning ratio increases, which indicates that CorruptEncoder only requires a small fraction of poisoned inputs to achieve high ASRs. We also find that ASR increases when using more reference images. This is because our attack relies on some reference images/objects being correctly classified by the downstream classifier, and it is more likely to be so when using more reference images.</p><p>Figure <ref type="figure">10</ref> in Appendix shows the impact of trigger type (white, purple, and colorful), and trigger size on CorruptEncoder. A colorful trigger achieves a higher ASR than the other two triggers. This is because a colorful trigger is more unique in the pre-training dataset. Besides, ASR is large once the trigger size is larger than a threshold (e.g., <ref type="bibr">20)</ref>. Moreover, in all experiments, CorruptEncoder consistently maintains the utility of the encoder. Multiple target classes and downstream tasks:  <ref type="table">7</ref> and 8 in Appendix respectively show the impact of the number of support reference images and support poisoned images (i.e., &#955;) on CorruptEncoder+. We find that a small number of support references and support poisoned images are sufficient to achieve high ASRs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Defense</head><p>Localized cropping: Existing defenses (e.g., <ref type="bibr">[11,</ref><ref type="bibr">30,</ref><ref type="bibr">31]</ref>) against backdoor attacks were mainly designed for supervised learning, which are insufficient for CL <ref type="bibr">[12]</ref>. While <ref type="bibr">[7]</ref> proposes DECREE to detect backdoored encoders, it only focuses on the backdoor detection for a pre-trained encoder. Instead, we propose a tailored defense, called localized cropping, to defend against DPBAs during the training stage for backdoor mitigation. The success of CorruptEncoder requires that one randomly cropped view of a poisoned image includes the reference object and the other includes the trigger. Our localized cropping breaks such requirements by constraining the two cropped views to be close to Experimental results: Table <ref type="table">4</ref> shows the results of defenses tailored for backdoor mitigation in CL. We conduct experiments following our default settings. "No Defense" means MoCo-v2 uses its original data augmentation operations; "No Random Cropping" means random cropping is not used while "No Other Data Augs" means data augmentations except for random cropping are not used; "ContrastiveCrop" means replacing random cropping with the advanced semantic-aware cropping mechanism <ref type="bibr">[22]</ref> and "Localized Cropping" means replacing random cropping with our localized cropping (&#948; = 0.2). CompRess Distillation <ref type="bibr">[25]</ref> uses a clean pre-training dataset (e.g., a subset of the pre-training dataset) to distill a (backdoored) encoder. ContrastiveCrop <ref type="bibr">[22]</ref> uses semantic-aware localization to generate augmented views that can avoid false positive pairs. Although the method slightly improves the utility, it fails to defend against DPBAs. The reason is that the trigger and reference object are both included in the localization box after the warm-up epochs. Removing other data augmentations (e.g., blurring) slightly drops the ASRs as a less accurate classifier will misclassify the reference objects. Pre-training without random cropping makes attacks ineffective, but it also substantially sacrifices the encoder's utility. Figure <ref type="figure">10(c</ref>) in the Appendix further shows that random cropping with non-default parameters only reduces ASR when there's a large utility drop.</p><p>Our localized cropping can reduce ASRs. Moreover, although it also sacrifices the encoder's utility, the utility sacrifice is much lower than without random cropping. Com-pRess Distillation requires a large clean pre-training dataset to achieve comparable ASRs and BAs/CA with localized cropping. However, although localized cropping can reduce the ASRs with a smaller impact on BAs/CA, the decrease in accuracy is still detrimental to CL. Table <ref type="table">9</ref> in Appendix shows that localized cropping is less effective as &#948; increases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Extension to Multi-modal CL</head><p>We also extend CorruptEncoder to attack image encoders in multi-modal CL <ref type="bibr">[10,</ref><ref type="bibr">23]</ref>, which uses image-text pairs to pre-train an image encoder and a text encoder. Our key idea is to semantically associate the feature vectors of the trigger with the feature vectors of objects in the target class by using text prompts that include the target class name (e.g., "a photo of dog") as the medium. Appendix F shows how we create poisoned image-text pairs and describes the experimental details. Our results show that CorruptEncoder outperforms the existing backdoor attack to multi-modal CL <ref type="bibr">[1]</ref>, especially when the pre-training dataset only includes a few image-text pairs related to the target class.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Related Work</head><p>CL: Single-modal CL <ref type="bibr">[2,</ref><ref type="bibr">3,</ref><ref type="bibr">5,</ref><ref type="bibr">13,</ref><ref type="bibr">15]</ref> uses images to pretrain an image encoder that outputs similar (or dissimilar) feature vectors for two augmented views of the same (or different) pre-training image. Multi-modal CL <ref type="bibr">[10,</ref><ref type="bibr">23]</ref> uses image-text pairs to jointly pre-train an image encoder and a text encoder such that the image encoder and text encoder output similar (or dissimilar) feature vectors for image and text from the same (or different) image-text pair. Backdoor attacks to CL: Backdoor attacks <ref type="bibr">[4,</ref><ref type="bibr">9,</ref><ref type="bibr">16,</ref><ref type="bibr">18,</ref><ref type="bibr">19]</ref> aim to compromise the training data or training process such that the learnt model behaves as an attacker desires. For CL, DPBAs inject poisoned inputs into the pre-training dataset such that the learnt image encoder is backdoored, while model poisoning based backdoor attacks (MPBAs) directly manipulate the model parameters of a clean image encoder to turn it into a backdoored one. MPBAs <ref type="bibr">[12,</ref><ref type="bibr">28,</ref><ref type="bibr">32]</ref> are not applicable when an image encoder is from a trusted provider while existing DP-BAs <ref type="bibr">[1,</ref><ref type="bibr">14,</ref><ref type="bibr">17,</ref><ref type="bibr">25]</ref> only achieve limited attack success rates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Conclusion</head><p>In this work, we propose new data poisoning based backdoor attacks (DPBAs) to contrastive learning (CL). Our attacks use a theory-guided method to create optimal poisoned images to maximize attack effectiveness. Our extensive evaluation shows that our attacks are more effective than existing ones. Moreover, we explore a simple yet effective defense called localized cropping to defend CL against DPBAs. Our results show that localized cropping can reduce the attack success rates, but it sacrifices the utility of the encoder, highlighting the need for new defense.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>https://github.com/jzhang538/CorruptEncoder This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.Except for this watermark, it is identical to the accepted version; the final published version of the proceedings is available on IEEE Xplore.</p></note>
		</body>
		</text>
</TEI>
