<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Extending the WILDS Benchmark for Unsupervised Adaptation</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10396195</idno>
					<idno type="doi"></idno>
					<title level='j'>International Conference on Learning Representations (ICLR)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Shiori Sagawa</author><author>Pang Wei Koh</author><author>Tony Lee</author><author>Irene Gao</author><author>Sang Michael Xie</author><author>Kendrick Shen</author><author>Ananya Kumar</author><author>Weihua Hu</author><author>Michihiro Yasunaga</author><author>Henrik Marklund</author><author>Sara Beery</author><author>Etienne David</author><author>Ian Stavness</author><author>Wei Guo</author><author>Jure Leskovec</author><author>Kate Saenko</author><author>Tatsunori Hashimoto</author><author>Sergey Levine</author><author>Chelsea Finn</author><author>Percy Liang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Machine learning systems deployed in the wild are often trained on a source distribution but deployed on a different target distribution. Unlabeled data can bea powerful point of leverage for mitigating these distribution shifts, as it is frequently much more available than labeled data and can often be obtained fromdistributions beyond the source distribution as well. However, existing distribution shift benchmarks with unlabeled data do not reflect the breadth of scenariosthat arise in real-world applications. In this work, we present the WILDS 2.0update, which extends 8 of the 10 datasets in the WILDS benchmark of distribution shifts to include curated unlabeled data that would be realistically obtainablein deployment. These datasets span a wide range of applications (from histology to wildlife conservation), tasks (classification, regression, and detection), andmodalities (photos, satellite images, microscope slides, text, molecular graphs).The update maintains consistency with the original WILDS benchmark by usingidentical labeled training, validation, and test sets, as well as identical evaluationmetrics. We systematically benchmark state-of-the-art methods that use unlabeleddata, including domain-invariant, self-training, and self-supervised methods, andshow that their success on WILDS is limited. To facilitate method development,we provide an open-source package that automates data loading and contains themodel architectures and methods used in this paper.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Figure <ref type="figure">1</ref>: Each WILDS dataset <ref type="bibr">(Koh et al., 2021)</ref> contains labeled data from the source domains (for training), validation domains (for hyperparameter selection), and target domains (for held-out evaluation). In the WILDS 2.0 update, we extend these datasets with unlabeled data from a combination of source, validation, or target domains, as well as extra domains from which there is no labeled data. The labeled data is exactly the same as in WILDS 1.0. In this figure, we illustrate the setting with the GLOBALWHEAT-WILDS dataset, where domains correspond to images acquired from different locations and at different times.</p><p>In this paper, we make two contributions. First, we present WILDS 2.0 (Figure <ref type="figure">2</ref>), an updated version of the recent WILDS benchmark of in-the-wild distribution shifts <ref type="bibr">(Koh et al., 2021)</ref>. WILDS datasets span a wide range of tasks and modalities, and each dataset reflects a domain generalization or subpopulation shift setting with a substantial gap between in-distribution and out-of-distribution performance. However, WILDS 1.0 only contained labeled data, which limits the leverage for learning robust models. In WILDS 2.0, we extend 8 of the 10 WILDS datasets<ref type="foot">foot_0</ref> with curated unlabeled data acquired from the same source and target domains as the labeled data, as well as from extra domains of the same type: e.g., in the GLOBALWHEAT-WILDS dataset pictured in Figure <ref type="figure">1</ref>, we acquired unlabeled photos of wheat fields from the source and target farms as well as extra farms that were not in the original labeled dataset. In total, WILDS 2.0 adds 14.5 million unlabeled examples, expanding the number of examples for each dataset by 3-13&#215; and allowing us to combine the real-world relevance of WILDS with the leverage of unlabeled data.</p><p>Second, we developed a standardized and consistent protocol for evaluating methods that leverage the unlabeled data in WILDS 2.0. We assessed representatives from three popular categories: methods for learning domain-invariant representations <ref type="bibr">(Sun &amp; Saenko, 2016;</ref><ref type="bibr">Ganin et al., 2016)</ref>, self-training methods <ref type="bibr">(Lee, 2013;</ref><ref type="bibr">Sohn et al., 2020;</ref><ref type="bibr">Xie et al., 2020)</ref>, and pre-training methods that rely on self-supervision <ref type="bibr">(Devlin et al., 2019;</ref><ref type="bibr">Caron et al., 2020)</ref>. These methods have been successful on some types of shifts, such as going from photos to sketches, or from handwritten digits to street signs <ref type="bibr">(Berthelot et al., 2021;</ref><ref type="bibr">Zhang et al., 2021)</ref>.</p><p>Our results on WILDS are mixed: many methods did not outperform standard supervised training despite using additional unlabeled data, and the only clear successes were on two image classification datasets (CAMELYON17-WILDS and FMOW-WILDS). Successful methods relied heavily on data augmentation <ref type="bibr">(Xie et al., 2020;</ref><ref type="bibr">Caron et al., 2020)</ref>, which limited their applicability to modalities where augmentations are not as well developed, such as text and molecular graphs. The same methods were unsuccessful on image regression and detection tasks, which have been relatively understudied: e.g., pseudolabel-based methods do not straightforwardly apply to regression. For the text datasets, continued language model pre-training did not help, unlike in prior work <ref type="bibr">(Gururangan et al., 2020)</ref>. Our results suggest fruitful avenues for future work, such as developing data augmentations for non-image modalities and more effective hyperparameter tuning protocols.</p><p>Overall, our results underscore the importance of developing and evaluating methods for unlabeled data on a wider variety of real-world shifts than is typically studied. To this end, we have updated the open-source Python WILDS package to include unlabeled data loaders, compatible implementations of all the methods we benchmarked, and scripts to replicate all experiments in this paper (Ap- pendix G). Code and public leaderboards are available at <ref type="url">https://wilds.stanford.edu</ref>. By allowing developers to easily test algorithms across the variety of datasets in WILDS 2.0, we hope to accelerate the development of methods that can leverage unlabeled data to improve robustness to real-world distribution shifts.</p><p>Finally, we note that WILDS 2.0 not a separate benchmark from WILDS 1.0: the labeled data and evaluation metrics are exactly the same in WILDS 1.0 and WILDS 2.0, and future results should be reported on the overall WILDS benchmark, with a note describing what kind of unlabeled data (if any) was used. In this paper, we discuss the addition of unlabeled data and analyze the performance of methods that use the unlabeled data. For a more detailed description of the datasets, evaluation metrics, and models used, please refer to the original WILDS paper <ref type="bibr">(Koh et al., 2021)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">COMPARISON WITH EXISTING UNSUPERVISED ADAPTATION BENCHMARKS</head><p>WILDS 2.0 offers a diverse range of applications and modalities while also providing an extensive amount of unlabeled data that can be used as leverage for training robust models. In this section, we briefly compare with other existing ML benchmarks for unsupervised adaptation.</p><p>Images. Evaluations of unsupervised adaptation methods for image classification have focused on generalizing from natural photos to a range of stylized images, such as sketches and cartoons (PACS <ref type="bibr">(Li et al., 2017)</ref>, Office-Home <ref type="bibr">(Venkateswara et al., 2017)</ref>, and DomainNet <ref type="bibr">(Peng et al., 2019)</ref>), product images (Office-31 <ref type="bibr">(Saenko et al., 2010)</ref>), and synthetic renderings (VisDA <ref type="bibr">(Peng et al., 2018)</ref>), though location-based shifts have also been recently explored <ref type="bibr">(Dubey et al., 2021)</ref>. It is also popular to evaluate on shifts between digits datasets, such as <ref type="bibr">MNIST (LeCun et al., 1998)</ref>, SVHN <ref type="bibr">(Netzer et al., 2011)</ref>, and USPS <ref type="bibr">(Hull, 1994)</ref>. In image detection and segmentation, existing adaptation benchmarks tend to focus on generalizing from synthetic to natural scenes <ref type="bibr">(Ros et al., 2016;</ref><ref type="bibr">Richter et al., 2016;</ref><ref type="bibr">Cordts et al., 2016;</ref><ref type="bibr">Hoffman et al., 2018)</ref>, which can be an important tool for realistic problems but is not the focus of this work. In contrast, WILDS considers real-world distribution shifts, and it spans diverse modalities (satellite, microscope, agriculture, and camera trap images) and tasks (classification, regression, detection).</p><p>Text. Methods for unsupervised adaptation in NLP are typically evaluated on domain shifts between different textual sources, such as news articles, different categories of product reviews, Wikipedia, or social media platforms <ref type="bibr">(Blitzer et al., 2007;</ref><ref type="bibr">Mansour et al., 2009;</ref><ref type="bibr">Oren et al., 2019;</ref><ref type="bibr">Miller et al., 2020;</ref><ref type="bibr">Kamath et al., 2020;</ref><ref type="bibr">Hendrycks et al., 2020)</ref>, or even more specialized sources such as legal documents <ref type="bibr">(Chalkidis et al., 2020)</ref> or biomedical papers <ref type="bibr">(Lee et al., 2020b;</ref><ref type="bibr">Gu et al., 2020)</ref>. Multilingual tasks can also be a setting for unsupervised adaptation <ref type="bibr">(Conneau et al., 2018;</ref><ref type="bibr">Conneau &amp; Lample, 2019;</ref><ref type="bibr">Hu et al., 2020a;</ref><ref type="bibr">Clark et al., 2020)</ref>, especially when generalizing to low-resource languages <ref type="bibr">(Nekoto et al., 2020)</ref>. The WILDS text datasets differ in that they focus on subpopulation performance, either to particular demographics in CIVILCOMMENTS-WILDS or to tail populations in AMAZON-WILDS, rather than on adapting to a completely distinct domain.</p><p>Molecules. While unlabeled molecules have been used for pre-training <ref type="bibr">(Hu et al., 2020c;</ref><ref type="bibr">Rong et al., 2020)</ref>, no standardized unsupervised adaptation benchmarks have been developed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">PROBLEM SETTING</head><p>As in WILDS 1.0, we study the domain shift setting where the data is drawn from domains d &#8712; D. We consider several variants of the domain shift setting. In some applications, all four types of domains are disjoint (e.g., if we are training on labeled data from some hospitals but seeking to generalize to new hospitals); in others, the target domains are a subset of the source domains (e.g., if we are training on a heterogeneous dataset but seeking to measure model performance on particular demographic subpopulations). Models are trained on labeled data from the source domains, as well as unlabeled data of one or more types of domains, depending on what is realistic for the application.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">DATASETS</head><p>WILDS 2.0 augments 8 WILDS datasets with curated unlabeled data. For consistency, the labeled datasets and evaluation metrics are exactly the same as in WILDS 1.0, which allows direct evaluations of the utility of unlabeled training data. The labeled and unlabeled data are disjoint, e.g., the unlabeled target data is different from the labeled target data used for evaluation. Here, we briefly describe each dataset, why unlabeled data can be realistically obtained for the corresponding task, and how it might help. In Appendix A, we provide more information on each dataset, including data provenance and details on data processing. In general, all of the unlabeled datasets in WILDS 2.0 were processed in a similar way as their corresponding labeled datasets from WILDS 1.0.</p><p>IWILDCAM2020-WILDS: Species classification across different camera traps. The task is to classify the animal species in a camera trap image <ref type="bibr">(Beery et al., 2020)</ref>. We aim to generalize to new camera trap locations despite variations in illumination, background, and label frequencies <ref type="bibr">(Beery et al., 2018)</ref>. While hundreds of thousands of camera traps are active worldwide, only a small subset of these traps have had images labeled, and the unlabeled data from the other camera traps capture diverse operating conditions that can be used to learn robust models. In this work, we add unlabeled images from 3,215 extra camera traps also in the WCS Camera Traps dataset <ref type="bibr">(Beery et al., 2020)</ref>. This expands the number of camera traps by 11&#215; and the number of examples by 5&#215;.</p><p>CAMELYON17-WILDS: Tumor identification across different hospitals. The task is to classify image patches from lymph node sections as tumor or normal tissue. We seek to generalize to new hospitals, which can differ in their patient demographics and data acquisition protocols <ref type="bibr">(Veta et al., 2016;</ref><ref type="bibr">AlBadawy et al., 2018;</ref><ref type="bibr">Komura &amp; Ishikawa, 2018;</ref><ref type="bibr">Tellez et al., 2019)</ref>. While obtaining labeled data for histopathology applications requires pain-staking annotations from expert pathologists, hospitals typically accumulate unlabeled slide images during normal operation. These unlabeled images could be used to adapt to differences between hospitals (e.g., different staining protocols might lead to different color distributions). We provide unlabeled patches from train and test hospitals, which expands the total number of patches by 7.5&#215;. Both the labeled and unlabeled data are adapted from the Camelyon17 dataset <ref type="bibr">(Bandi et al., 2018)</ref>.</p><p>FMOW-WILDS: Land use classification across different regions and years. The task is to classify the type of building or land usage in a satellite image. Given training data from before 2013, we aim to generalize to satellite imagery taken after 2013, while maintaining high accuracy across all geographic regions. While labeling land use requires combining map data and expert annotations, unlabeled data is available in all locations in the world through constant streams of global satellite imagery. Prior work has shown that unlabeled satellite data can improve OOD accuracy in landcover and cropland prediction <ref type="bibr">(Xie et al., 2021a)</ref> as well as aerial object and scene classification <ref type="bibr">(Reed et al., 2021)</ref>. We provide unlabeled satellite imagery across all regions from the train and test timeframes defined in WILDS, expanding the dataset by 3.5&#215;. Both the labeled and unlabeled data are adapted from the FMoW dataset <ref type="bibr">(Christie et al., 2018)</ref>.</p><p>POVERTYMAP-WILDS: Poverty mapping across different countries. The task is to predict a real-valued asset wealth index of the area in a satellite image. We consider generalizing across different countries. Like FMOW-WILDS, unlabeled satellite imagery is available globally, while labeled data is expensive to collect as it requires conducting nationally representative surveys in the field. Prior work on poverty prediction has used unlabeled data for entropy minimization <ref type="bibr">(Jean et al., 2018)</ref> and pre-training on auxiliary tasks such as nighttime light prediction <ref type="bibr">(Xie et al., 2016;</ref><ref type="bibr">Jean et al., 2016)</ref>, but these studies do not study generalization to new countries. We provide unlabeled satellite imagery from both train and test countries, expanding the dataset by 14&#215;. Both the labeled and unlabeled data are adapted from <ref type="bibr">Yeh et al. (2020)</ref>.</p><p>GLOBALWHEAT-WILDS: Wheat head detection across different regions. The task is to localize wheat heads in overhead field images. We seek to generalize across image acquisition sessions, each of which represents a particular location, time, and sensor; these can differ in wheat genotype, wheat head appearance, growing conditions, background appearance, illumination, and acquisition protocols. Wheat field images contain many densely packed and overlapping instances, making labeling wheat heads in images costly, tedious and sensitive to the individual annotator. However, hundreds of agricultural research institutes around the world collect terabytes of unlabeled field images which could be used for training. We add unlabeled field images from train, test, and extra acquisition sessions, expanding the dataset by 10&#215;. The labeled and unlabeled data are adapted from the Global Wheat Head Detection dataset and its underlying sources <ref type="bibr">(David et al., 2020;</ref><ref type="bibr">2021)</ref>.</p><p>OGB-MOLPCBA: Molecular property prediction across different scaffolds. The task is to predict the biological activity of small molecules represented as molecular graphs <ref type="bibr">(Wu et al., 2018;</ref><ref type="bibr">Hu et al., 2020b)</ref>. We seek to generalize to molecules with new scaffold structures. Labels on biological activity are only available for a small portion of molecules, as they require expensive lab experiments to obtain. However, unlabeled molecule structures are readily available in large-scale chemical databases such as PubChem <ref type="bibr">(Bolton et al., 2008)</ref>, and have been previously used for pretraining <ref type="bibr">(Hu et al., 2020c)</ref> and semi-supervised learning <ref type="bibr">(Sun et al., 2020)</ref>. We provide 5 million unlabeled molecules from source and target scaffolds, which expands the number of molecules by 12.5&#215;. The original labeled data was curated by MoleculeNet <ref type="bibr">(Wu et al., 2018)</ref> from PubChem, and we similarly extracted the unlabeled data from PubChem <ref type="bibr">(Bolton et al., 2008)</ref>.</p><p>CIVILCOMMENTS-WILDS: Toxicity classification across demographic identities. The task is to classify whether a text comment is toxic or not. We consider the subpopulation shift setting, where the model must classify accurately across groups of comments mentioning different demographic identities. While labels require large-scale crowdsourcing annotations on both comment toxicity, unlabeled article comments are widely available on the internet. We provide unannotated comments as unlabeled data, which expands the size of the dataset by 4.5&#215;. Both the labeled and unlabeled data are adapted from <ref type="bibr">Borkan et al. (2019)</ref>.</p><p>AMAZON-WILDS: Sentiment classification across different users. The task is to classify the star ratings of Amazon reviews. We seek to perform consistently well across new reviewers. While the labels (star ratings) are always available for Amazon reviews in practice, unlabeled data is a common source of leverage for sentiment classification more generally, with prior work in domain adaptation <ref type="bibr">(Blitzer &amp; Pereira, 2007;</ref><ref type="bibr">Glorot et al., 2011)</ref> and semi-supervised learning <ref type="bibr">(Dasgupta &amp; Ng, 2009;</ref><ref type="bibr">Li et al., 2011)</ref>. We provide unlabeled reviews from test and extra reviewers, which expands the total number of reviews by 7.5&#215;. Both the labeled and unlabeled data are adapted from the Amazon review dataset by <ref type="bibr">Ni et al. (2019)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">ALGORITHMS</head><p>For our evaluation, we selected representative methods from the three categories described below. These methods exemplify current approaches to using unlabeled data to improve robustness, and they have been successful on popular domain adaptation benchmarks like DomainNet <ref type="bibr">(Peng et al., 2019)</ref> and semi-supervised settings like improving ImageNet accuracy by leveraging unlabeled images from the internet <ref type="bibr">(Xie et al., 2020;</ref><ref type="bibr">Caron et al., 2020)</ref>. For more details, see Appendix B.</p><p>Domain-invariant methods. Domain-invariant methods learn feature representations that are invariant across different domains by penalizing differences between learned source and target representations <ref type="bibr">(Long et al., 2015;</ref><ref type="bibr">Ganin et al., 2016;</ref><ref type="bibr">Sun &amp; Saenko, 2016;</ref><ref type="bibr">Long et al., 2017;</ref><ref type="bibr">2018;</ref><ref type="bibr">Saito et al., 2018;</ref><ref type="bibr">Zhang et al., 2018;</ref><ref type="bibr">Xu et al., 2019;</ref><ref type="bibr">Zhang et al., 2019b)</ref>. We discuss these methods further in Appendix B.2. For our experiments, we evaluate two classical methods:</p><p>&#8226; Domain-Adversarial Neural Networks (DANN) <ref type="bibr">(Ganin et al., 2016)</ref> penalize representations on which an auxiliary classifier can easily discriminate between source and target examples.</p><p>&#8226; Correlation Alignment (CORAL) <ref type="bibr">(Sun et al., 2016;</ref><ref type="bibr">Sun &amp; Saenko, 2016)</ref> penalizes differences between the means and covariances of the source and target feature distributions.</p><p>Self-training. Self-training methods "pseudo-label" unlabeled examples with the model's own predictions and then train on them as if they were labeled examples. These methods often also use consistency regularization, which encourages the model to make consistent predictions on augmented views of unlabeled examples <ref type="bibr">(Sohn et al., 2020;</ref><ref type="bibr">Xie et al., 2020;</ref><ref type="bibr">Berthelot et al., 2021)</ref>. Selftraining methods have recently been successfully applied to unsupervised adaptation <ref type="bibr">(Saito et al., 2017;</ref><ref type="bibr">Berthelot et al., 2021;</ref><ref type="bibr">Zhang et al., 2021)</ref>. We include three representative algorithms:</p><p>&#8226; Pseudo-Label <ref type="bibr">(Lee, 2013)</ref> dynamically generates pseudolabels and updates the model each batch.</p><p>&#8226; FixMatch <ref type="bibr">(Sohn et al., 2020)</ref> adds consistency regularization on top of the Pseudo-Label algorithm. Specifically, it generates pseudolabels on a weakly augmented view of the unlabeled data, and then minimizes the loss of the model's prediction on a strongly augmented view.</p><p>&#8226; Noisy Student <ref type="bibr">(Xie et al., 2020)</ref> leverages weak and strong augmentations like FixMatch, but instead of dynamically generating pseudolabels for each batch, it alternates between a few teacher phases, where it generates pseudolabels, and student phases, where it trains to convergence on the (pseudo)labeled data.</p><p>Self-supervision. Self-supervised methods learn useful representations by training on unlabeled data via auxiliary proxy tasks. Common approaches include reconstruction tasks <ref type="bibr">(Vincent et al., 2008;</ref><ref type="bibr">Erhan et al., 2010;</ref><ref type="bibr">Devlin et al., 2019;</ref><ref type="bibr">Gidaris et al., 2018;</ref><ref type="bibr">Lewis et al., 2020)</ref>, and contrastive learning <ref type="bibr">(He et al., 2020;</ref><ref type="bibr">Chen et al., 2020b;</ref><ref type="bibr">Caron et al., 2020;</ref><ref type="bibr">Radford et al., 2021b)</ref>, and recent work has shown that self-supervised methods can reduce dependence on spurious correlations and improve performance on domain adaptation tasks <ref type="bibr">(Wang et al., 2021;</ref><ref type="bibr">Tsai et al., 2021;</ref><ref type="bibr">Mishra et al., 2021)</ref>. We use these self-supervision methods for unsupervised adaptation by first pre-training models on the unlabeled data, and then finetuning them on the labeled source data <ref type="bibr">(Shen et al., 2021)</ref>. We evaluate popular self-supervised methods for vision and language:</p><p>&#8226; SwAV <ref type="bibr">(Caron et al., 2020)</ref> is a contrastive learning algorithm that maps representations to a set of clusters and then enforces similarity between cluster assignments.</p><p>&#8226; Masked language modeling (MLM) <ref type="bibr">(Devlin et al., 2019)</ref> randomly masks some of the tokens from input text and trains the model to predict the missing tokens.</p><p>To evaluate how well existing methods can leverage unlabeled data to be robust to in-the-wild distribution shifts, we benchmarked the methods above on all applicable WILDS 2.0 datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">SETUP</head><p>We used the default models, labeled training and test sets, and evaluation metrics from WILDS.</p><p>Unlabeled data. WILDS 2.0 contains multiple types of unlabeled data (from source, extra, validation, and/or target domains). For simplicity, we ran experiments on a single type of unlabeled data for each dataset. Where possible, we used unlabeled target data to allow methods to directly adapt to the target distribution; for IWILDCAM2020-WILDS and CIVILCOMMENTS-WILDS, which do not have unlabeled target data, we used the extra domains instead. All methods use exactly the same sets of labeled and unlabeled training data (except ERM, which does not use unlabeled data).</p><p>Hyperparameters. We tuned each method on each dataset separately using random hyperparameter search. Following WILDS 1.0, we used the labeled out-of-distribution (OOD) validation set to select hyperparameters and for early stopping <ref type="bibr">(Koh et al., 2021)</ref>. This validation set is drawn from a different distribution than both the training and the OOD test set, so tuning on it does not leak information on the test distribution. We did not use the in-distribution (ID) validation set. For image classification and regression, we used both RandAugment <ref type="bibr">(Cubuk et al., 2020)</ref> and Cutout <ref type="bibr">(DeVries &amp; Taylor, 2017)</ref> as data augmentation for all methods. We did not use data augmentation for the remaining datasets. For some datasets, we also had ground truth labels for the "unlabeled" data, which we used to run fully-labeled ERM experiments. Overall, we ran 600+ experiments for 7,000 GPU hours on NVIDIA V100s. See Appendix B for a discussion of which methods were applicable to which datasets; Appendix C for augmentation details; Appendix F for the fully-labeled experiments; Appendix D for further experimental details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">RESULTS</head><p>Table <ref type="table">2</ref> shows mixed results on WILDS: most methods do not improve over standard empirical risk minimization (ERM) despite access to unlabeled data and careful hyperparameter tuning. In contrast, these methods have been shown to perform well on prior unsupervised adaptation benchmarks;</p><p>in Appendix E, we verify our implementations by showing that these methods (with the exception of CORAL) outperform ERM on the real &#8594; sketch shift in DomainNet, a standard unsupervised adaptation benchmark for object classification <ref type="bibr">(Peng et al., 2019)</ref>.</p><p>Image classification (IWILDCAM2020-WILDS, CAMELYON17-WILDS, and FMOW-WILDS).</p><p>Data augmentation improved OOD performance on all three image classification datasets. The gain was the most substantial on CAMELYON17-WILDS, where vanilla ERM achieved 70.8% accuracy, while ERM with data augmentation achieved 82.0% accuracy.<ref type="foot">foot_1</ref> </p><p>On CAMELYON17-WILDS and FMOW-WILDS, where we had access to unlabeled target data, Noisy Student and SwAV pre-training consistently improved OOD performance and reduced variability across replicates. However, the other methods-CORAL, DANN, Pseudo-Label, and FixMatchunderperformed ERM. This was especially surprising for FixMatch, which performed very well on DomainNet (Appendix E). Both FixMatch and Noisy Student use pseudo-labeling and consistency regularization, but FixMatch dynamically computes pseudo-labels in each batch from the start of training, whereas Noisy Student first trains a teacher model to convergence on the labeled data and updates pseudolabels at a much slower rate. As in <ref type="bibr">Xie et al. (2020)</ref>, this suggests that dynamically updating pseudo-labels might hurt generalization.</p><p>On IWILDCAM2020-WILDS, where we had access to 4&#215; as many unlabeled images from extra domains (distinct camera traps) but not to any images from the target domains, none of the benchmarked methods improved OOD performance compared to ERM. This was surprising, as many of these methods were originally shown to work in semi-supervised settings. One difference could be that the labeled and unlabeled examples in IWILDCAM2020-WILDS differ more significantly (as Table <ref type="table">2</ref>: The in-distribution (ID) and out-of-distribution (OOD) performance of each method on each applicable dataset. Following WILDS 1.0, we ran 3-10 replicates (random seeds) for each cell, depending on the dataset. We report the standard deviation across replicates in parentheses; the standard error (of the mean) is lower by the square root of the number of replicates. Fully-labeled experiments use ground truth labels on the "unlabeled" data. We bold the highest non-fully-labeled OOD performance numbers as well as others where the standard error is within range. Below each dataset name, we report the type of unlabeled data and metric used.</p><p>they originate from different camera traps) than in the original FixMatch paper <ref type="bibr">(Sohn et al., 2020)</ref>, which used i.i.d. labeled and unlabeled data, or the Noisy Student paper <ref type="bibr">(Xie et al., 2020)</ref>, which used ImageNet labeled data <ref type="bibr">(Russakovsky et al., 2015)</ref> and JFT unlabeled data <ref type="bibr">(Hinton et al., 2015)</ref>.</p><p>Fully-labeled ERM models that used ground truth labels for the "unlabeled" data were available for FMOW-WILDS and IWILDCAM2020-WILDS. They significantly outperformed other methods, suggesting room for improvement in how we leverage the unlabeled data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Image regression (POVERTYMAP-WILDS).</head><p>Data augmentation had no effect on performance on POVERTYMAP-WILDS, which differs from the above image datasets in that it is a regression task and involves multi-spectral satellite images (with 7 channels); both of these aspects are relatively unstudied compared to standard RGB image classification. All applicable methods underperformed standard ERM, despite having access to unlabeled data from the target domains (countries).</p><p>Image detection (GLOBALWHEAT-WILDS). We did not apply data augmentation here, as standard augmentation changes the labels (e.g., cropping the image might remove bounding boxes) and would violate the assumption that labels are invariant under augmentations, which contrastive and consistency regularization methods like SwAV, Noisy Student, and FixMatch rely on. Accordingly, we did not evaluate FixMatch and SwAV, and we modified Noisy Student to remove data augmentation noise. All applicable methods underperformed ERM.</p><p>Molecule classification (OGB-MOLPCBA). We did not apply data augmentation techniques to OGB-MOLPCBA as they are not well-developed for molecular graphs. All methods underperformed ERM. We did not report ID results as this dataset has no separate ID test set.</p><p>Text classification (CIVILCOMMENTS-WILDS, AMAZON-WILDS). Likewise, we did not apply data augmentation to the text datasets. On both datasets, other methods performed similarly to ERM (with class-balancing for CIVILCOMMENTS-WILDS). Continued masked LM pre-training on the unlabeled data did not improve target performance, unlike in prior work <ref type="bibr">(Gururangan et al., 2020)</ref>; this might be because the BERT pre-training corpus <ref type="bibr">(Devlin et al., 2019;</ref><ref type="bibr">Hendrycks et al., 2020)</ref> is more similar to the online comments in CIVILCOMMENTS-WILDS and product reviews in AMAZON-WILDS than to the biomedical/CS text studied in <ref type="bibr">Gururangan et al. (2020)</ref>. Also, CIVILCOMMENTS-WILDS and AMAZON-WILDS measure subpopulation performance (on minority demographics and on the tail subpopulation, respectively), whereas prior work adapted models to new areas of the input space (e.g., from news to biomedical articles). Fully-labeled ERM models showed modest gains compared to FMOW-WILDS and IWILDCAM2020-WILDS. As our evaluations on these text datasets focus on subpopulations performance, these results are consistent with prior observations that ERM models can have poor subpopulation performance even with large labeled training sets <ref type="bibr">(Sagawa et al., 2020)</ref>, necessitating other approaches to subpopulation shifts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">DISCUSSION</head><p>We conclude by discussing several takeaways and promising directions for future work.</p><p>The role of data augmentation. Many unsupervised adaptation methods rely strongly on data augmentation for consistency regularization or contrastive learning. This reliance on data augmentation techniques-which are largely image-specific-restricts their generality, as they do not readily generalize to other modalities (or even other types of images besides photos). Developing data augmentation techniques that can work well in other applications and modalities could be crucial for expanding the applicability of these methods <ref type="bibr">(Verma et al., 2021)</ref>.</p><p>Hyperparameter tuning. Unsupervised adaptation methods have even more hyperparameters than standard supervised methods, and consistent with prior work, we found that these hyperparameters can significantly affect OOD performance <ref type="bibr">(Saito et al., 2021)</ref>. Moreover, unlike in standard i.i.d. settings, we do not have labeled target data that we can use for hyperparameter selection. Improved methods for hyperparameter tuning could significantly improve OOD performance. Such methods might make use of the unlabeled target data, or even the combination of labeled and unlabeled OOD validation data, which is provided for most datasets in WILDS 2.0.</p><p>Pre-training on broader unlabeled data. Pre-training on huge amounts of unlabeled data improves robustness to distribution shifts in some settings <ref type="bibr">(Bommasani et al., 2021)</ref>. The unlabeled data need not be related to the task: e.g., CLIP was pre-trained on text-image pairs from the internet but tested on tasks including histopathology and satellite image classification <ref type="bibr">(Radford et al., 2021a)</ref>. Existing techniques for this type of broad pre-training appear insufficient for WILDS: many of our models were initialized with ImageNet-pretrained weights or derivatives of BERT, but do not generalize well OOD. While we focused on providing curated unlabeled data that is closely tailored to the task, it could be fruitful to use both broad and curated unlabeled data.</p><p>Leveraging domain annotations and task-specific structure. OOD robustness is ill-posed in general, as models cannot be robust to arbitrary distribution shifts. Beyond unlabeled data, WILDS also has domain annotations and other structured metadata for both labeled and unlabeled data (e.g., in IWILDCAM2020-WILDS, we know which images were taken from which cameras). Exploiting this type of fine-grained domain structure for unsupervised adaptation-e.g., through multisource/multi-target domain adaptation methods <ref type="bibr">(Zhao et al., 2018;</ref><ref type="bibr">Peng et al., 2019)</ref>-could be a promising avenue for learning models that are more robust to the domain shifts in WILDS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ETHICS STATEMENT</head><p>All WILDS datasets are curated and adapted from public data sources, with licenses that allow for public release. The datasets are all anonymized.</p><p>The distribution shifts in several of the WILDS datasets deal with issues of discrimination and bias that arise in real-world applications. For example, CIVILCOMMENTS-WILDS studies disparate model performance across online comments that mention different demographic groups, while FMOW-WILDS and POVERTYMAP-WILDS study countries and regions where labeled satellite data is less readily available. As our results suggest, standard models trained on these datasets will not perform well on those subpopulations, and their learned representations might also be biased in undesirable ways <ref type="bibr">(Bolukbasi et al., 2016;</ref><ref type="bibr">Caliskan et al., 2017;</ref><ref type="bibr">Garg et al., 2018;</ref><ref type="bibr">Tan &amp; Celis, 2019;</ref><ref type="bibr">Steed &amp; Caliskan, 2021)</ref>. We also encourage caution in interpreting positive results on these datasets, as our evaluation metrics might not encompass all relevant facets of discrimination and bias: e.g., the "ground truth" toxicity annotations in CIVILCOMMENTS-WILDS can themselves be biased, and the particular choice of regions in FMOW-WILDS might obscure lower model performance in sub-regions.</p><p>For FMOW-WILDS and POVERTYMAP-WILDS, surveillance and privacy issues also need to be considered. In FMOW-WILDS, the image resolution is lower than that of other public satellite data (e.g., from Google Maps), and in POVERTYMAP-WILDS, the location metadata is noised to protect privacy. For a deeper discussion of the ethics of remote sensing in the context of humanitarian aid and development, we refer readers to the UNICEF report by <ref type="bibr">Berman et al. (2018)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>REPRODUCIBILITY STATEMENT</head><p>All WILDS datasets are publicly available at <ref type="url">https://wilds.stanford.edu</ref>, together with code and scripts to replicate all of the experiments in this paper. We also provide all trained model checkpoints and results, together with the exact hyperparameters used.</p><p>In our appendices, we provide more details on the datasets and experiments:</p><p>&#8226; In Appendix A, we describe each of the updated datasets in WILDS 2.0 and their sources of unlabeled data as well as what data processing steps were taken.</p><p>&#8226; In Appendix B, we describe the implementations of each of our benchmarked methods in detail. In particular, we discuss any changes we made to their original implementations, either for consistency with other methods or with prior implementations of these methods.</p><p>&#8226; In Appendix C, we describe details of the data augmentations (if any) that we used across each dataset.</p><p>&#8226; In Appendix D, we describe our experimental protocol, including the hyperparameter selection procedure and hyperparameter grids for all of the methods and datasets.</p><p>&#8226; In Appendix E, we describe the details of our experiments on DomainNet.</p><p>&#8226; In Appendix F, we describe the details of our fully-labeled ERM experiments.</p><p>&#8226; Finally, in Appendix G, we include an illustrative code snippet of how to use the data loaders in the WILDS library.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AUTHOR CONTRIBUTIONS</head><p>The project was initiated by Shiori Sagawa, Pang Wei Koh, and Percy Liang. Shiori Sagawa and Pang Wei Koh led the project and coordinated the activities below. Tony Lee developed the experimental infrastructure and ran the experiments. Tony Lee, Irena Gao, Sang Michael Xie, Kendrick Shen, Ananya Kumar, and Michihiro Yasunaga designed the evaluation framework and implemented the algorithms. The unlabeled data loaders and corresponding dataset writeups were added by:</p><p>&#8226; AMAZON-WILDS: Tony Lee</p><p>&#8226; CAMELYON17-WILDS: Tony Lee </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A ADDITIONAL DATASET DETAILS</head><p>In this appendix, we provide additional details on the unlabeled data in WILDS 2.0. For more context on the motivation behind each dataset, the choice of evaluation metric, and the labeled data, please refer to the original WILDS paper <ref type="bibr">(Koh et al., 2021)</ref>.</p><p>A.1 IWILDCAM2020-WILDS</p><p>The IWILDCAM2020-WILDS dataset was adapted from the iWildCam 2020 competition dataset made up of data provided by the Wildlife Conservation Society (WCS) <ref type="bibr">(Beery et al., 2020)</ref>  Broader context. There are large volumes of unlabeled natural world data that have been collected in growing repositories such as iNaturalist (Nugent, 2018), Wildlife Insights <ref type="bibr">(Ahumada et al., 2020)</ref>, and GBIF <ref type="bibr">(Robertson et al., 2014)</ref>. This data includes images or video collected by remote sensors or community scientists, GPS track data from an-animal devices, aerial data from drones or satellites, underwater sonar, bioacoustics, and eDNA. Methods that can harness the wealth of information in unlabeled ecological data are well-posed to make significant breakthroughs in how we think about ecological and conservation-focused research. Natural-world and ecological benchmarks that provide unlabeled data include NEWT <ref type="bibr">(Van Horn et al., 2021)</ref>, investigating efficient task learning, and Semi-Supervised iNat <ref type="bibr">(Su &amp; Maji, 2021)</ref>, which provides labeled data for only a subset of the taxonomic tree. Recent work has begun to adapt weakly-supervised and self-supervised approaches for these natural world settings, including probing the generality and efficacy of self-supervision <ref type="bibr">(Cole et al., 2021)</ref>, incorporating domain-relevant context into self-supervision <ref type="bibr">(Pantazis et al., 2021)</ref>, or leveraging weak supervision from alternative data modalities <ref type="bibr">(Weinstein et al., 2019)</ref> or pre-trained, generic models <ref type="bibr">(Weinstein et al., 2021;</ref><ref type="bibr">Beery et al., 2019)</ref>. Active learning also plays a role here in seeking to adapt models efficiently to unlabeled data from novel regions with only a few targeted labels <ref type="bibr">(Kellenberger et al., 2019;</ref><ref type="bibr">Norouzzadeh et al., 2021)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 CAMELYON17-WILDS</head><p>The CAMELYON17-WILDS dataset <ref type="bibr">(Koh et al., 2021)</ref> was adapted from the Camelyon17 dataset <ref type="bibr">(Bandi et al., 2018)</ref>, which is a collection of whole-slide images (WSIs) of breast cancer metastases in lymph node sections from 5 hospitals in the Netherlands. The labels were obtained by asking expert pathologists to perform pixel-level annotations of each WSI, which is an expensive and painstaking process. In practice, unlabeled WSIs (i.e., WSIs without pixel-level annotations) are much easier to obtain. For example, only a fraction of the WSIs in the original Camelyon17 dataset <ref type="bibr">(Bandi et al., 2018)</ref> were labeled; the other WSIs, which are taken from the same 5 hospitals, were provided without labels. In this work, we augment the CAMELYON17-WILDS dataset with unlabeled data from these WSIs.</p><p>Problem setting. The task is to classify whether a histological image patch contains any tumor tissue. We consider generalizing from a set of training hospitals to new hospitals at test time. The input x corresponds to a 96&#215;96 image patch extracted from an WSI of a lymph node section, the label y is a binary indicator of whether the central 32&#215;32 patch of the input contains any pixel that was annotated as a tumor in the WSI, and the domain d identifies which hospital the patch came from. Each patch also includes metadata on which WSI it was extracted from, though we do not use this metadata for training or evaluation. Models are evaluated by their average accuracy on a class-balanced test dataset.</p><p>Data. All of the labeled and unlabeled data are taken from the Camelyon17 dataset <ref type="bibr">(Bandi et al., 2018)</ref>, which consists of WSIs from 5 hospitals (domains) in the Netherlands. We provide unlabeled data from same domains as the labeled CAMELYON17-WILDS dataset (no extra domains). The domains are split as follows:</p><p>1. Source: Hospitals 1, 2, and 3.</p><p>2. Validation (OOD): Hospital 4.</p><p>3. Target (OOD): Hospital 5.</p><p>CAMELYON17-WILDS also includes a Validation (ID) set which contains data from the training hospitals.</p><p>The CAMELYON17-WILDS dataset has a total of 455,954 labeled patches across these splits, derived from the 10 WSIs per hospital that have full pixel-level annotations. We augment the dataset with a total of 2,999,307 unlabeled patches, extracted from an additional 90 unlabeled WSIs per hospital.</p><p>There is no overlap between the WSIs used for the labeled versus unlabeled data. To extract and process each patch, we followed the same data processing steps that were carried out for the labeled data in <ref type="bibr">Koh et al. (2021)</ref>.</p><p>Unlike the labeled patches, which were sampled in a class-balanced manner (i.e., half of the patches have positive labels), we sampled the unlabeled patches uniformly at random from the unlabeled WSIs. We sampled 6,667 patches per unlabeled WSI, with the single exception of one WSI which had only 5,824 valid patches, resulting in a total of 3,000,150 unlabeled patches (Table <ref type="table">4</ref>). While the labeled patches were sampled in a class-balanced manner, the underlying label distribution skews heavily negative (approximately 95% of the patches in a WSI are negative), so we expect the unlabeled patches to be similarly skewed in their label distribution.</p><p>Broader context. We focused on providing unlabeled data from the same hospitals (domains) as in the original labeled CAMELYON17-WILDS dataset. This unlabeled data from the training and test hospitals can be used to develop and evaluate methods for semi-supervised learning <ref type="bibr">(Peikari et al., 2018;</ref><ref type="bibr">Akram et al., 2018;</ref><ref type="bibr">Lu et al., 2019;</ref><ref type="bibr">Shaw et al., 2020)</ref> and domain adaptation <ref type="bibr">(Ren et al., 2018;</ref><ref type="bibr">Zhang et al., 2019a;</ref><ref type="bibr">Koohbanani et al., 2021)</ref>, respectively. In practice, there is also a large amount of unlabeled data from different domains that is publicly available: for example, The Cancer Genome Atlas (TCGA) hosts tens of thousands of publicly-available slide images across a variety of cancer types and from many different hospitals <ref type="bibr">(Weinstein et al., 2013)</ref>. These large and diverse datasets need not even be directly relevant to the task at hand, e.g., one could pretrain a model on images for different types of cancer even if the goal were to develop a model for breast cancer. Recent work has started to explore the use of these large and diverse datasets for computational pathology applications <ref type="bibr">(Ciga et al., 2020;</ref><ref type="bibr">Dehaene et al., 2020)</ref> and in other medical imaging applications <ref type="bibr">(Azizi et al., 2021)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3 FMOW-WILDS</head><p>The FMOW-WILDS dataset <ref type="bibr">(Koh et al., 2021)</ref> was adapted from the FMoW dataset <ref type="bibr">(Christie et al., 2018)</ref>, which consists of global satellite images from 2002-2018, labeled with the functional purpose of the buildings or land in the image. The labels are collected by a process which combines map data with crowdsourced annotations (from a trusted crowd). In contrast, unlabeled satellite imagery is readily available across the globe. In this work, we augment the FMOW-WILDS dataset with unused satellite images that were part of the original FMoW dataset but not in the FMOW-WILDS dataset.</p><p>Problem setting. The task is to classify the building or land-use type of a satellite image. We consider generalizing from images before 2013 to after 2013, as well as considering the performance on the worst-case geographic region (Africa, the Americas, Oceania, Asia, or Europe). The input x is an RGB satellite image (224&#215;224 pixels). The label y is one of 62 building or land use categories.</p><p>The domain d represents both the year and the geographical region of the image. Each image also includes metadata on the location and time of the image, although we do not use these except for splitting the domains. Models are evaluated by their average and worst-region accuracies in the OOD timeframe.</p><p>Data. The labeled and unlabeled data are taken from the FMoW dataset <ref type="bibr">(Christie et al., 2018)</ref>. We provide unlabeled data from same domains as the labeled FMOW-WILDS dataset (no extra domains). The domains are as follows:</p><p>1. Source: The FMOW-WILDS dataset has 141,696 labeled images across these splits. We augment the dataset with 340,469 unlabeled images. These images come from two sources:</p><p>1. We use a sequestered split of the dataset, which consists of new locations that are not in the original labeled FMOW-WILDS dataset; these unlabeled data are drawn from the same distribution as the labeled data.</p><p>2. For the unlabeled target and validation splits, we also add unlabeled data in their respective timeframes from the training set locations. While the unlabeled data from the Validation (OOD) and Target (OOD) domains can come from the same locations as the labeled training data, we note that none of the locations in the labeled Validation (OOD) or Target (OOD) data, which is used for evaluation, is shared with any of the unlabeled or labeled data used for training.</p><p>Broader context. We focus on providing unlabeled data from the years (domains) that were in the original FMOW-WILDS dataset. Prior works have used unlabeled satellite imagery for pretraining <ref type="bibr">(Xie et al., 2016;</ref><ref type="bibr">Jean et al., 2016;</ref><ref type="bibr">Xie et al., 2021a;</ref><ref type="bibr">Reed et al., 2021)</ref>, self-training <ref type="bibr">(Xie et al., 2021a)</ref>, and semi-supervised learning <ref type="bibr">(Reed et al., 2021)</ref>. Leveraging unlabeled satellite imagery is powerful since it is widely available and can reduce the frequency at which we need to re-collect labeled data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.4 POVERTYMAP-WILDS</head><p>The POVERTYMAP-WILDS dataset (Koh et al., 2021) was adapted from <ref type="bibr">Yeh et al. (2020)</ref>. The dataset consists of satellite images from 23 African countries, labeled with a village-level realvalued asset wealth index (measure of wealth). The labels are collected by conducting a nationally representative survey, which requires sending workers into the field to ask each household a number of questions and can be very expensive. In contrast, unlabeled satellite imagery is readily available across the globe. In this work, we augment the POVERTYMAP-WILDS dataset with satellite images from the same LandSat satellite.</p><p>Problem setting. The task is to predict a real-valued asset wealth index from a satellite image. We consider generalizing across country borders (the dataset contains 5 different cross validation folds, each splitting the countries differently). The input x is a multispectral LandSat satellite image with 8 channels (resized to 224 &#215; 224 pixels). The output y is a real-valued asset wealth index. The domain d represents the country the image was taken in, as well as whether the image was taken at an urban or rural area. Each image also includes metadata on the location and time, although we do not make use of these except for defining the domains. Models are evaluated by the average Pearson correlation (r) across 5 folds, as well as the lower of the subpopulations to test generalization to these subpopulations. In particular, generalization to rural subpopulations is important as poverty is more common in rural areas.</p><p>Data. We provide unlabeled data from same domains as the labeled POVERTYMAP-WILDS dataset (no extra domains). The domains are split as follows:</p><p>1. Source: Images from training countries in the fold.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Validation (OOD):</head><p>Images from validation countries in the fold.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Target (OOD):</head><p>Images from test countries in the fold.</p><p>All the countries in these splits are disjoint. Folds also contain a Validation (ID) and Target (ID) set with data from the training countries.</p><p>The POVERTYMAP-WILDS dataset has 19,669 labeled images across these splits. We augment the dataset with 261,396 unlabeled images from the same 23 countries. These images are collected using the same process as <ref type="bibr">Yeh et al. (2020)</ref> from the same LandSat satellite. The image locations are chosen to be roughly near survey locations from the Demographic and Health Surveys (DHS).</p><p>Broader context. We focus on providing unlabeled data from the countries (domains) that were in the original POVERTYMAP-WILDS dataset. Prior works on poverty prediction have used pretraining on unlabeled data (to predict an auxiliary task such as nighttime light prediction) <ref type="bibr">(Xie et al., 2016;</ref><ref type="bibr">Jean et al., 2016)</ref> and for semi-supervised learning via entropy minimization <ref type="bibr">(Jean et al., 2018)</ref>. However, these works focus on generalization to new locations in the countries in the training set. Poverty prediction is different from usual tasks in that the output is real-valued. Most methods for unlabeled data are made for classification tasks, and we hope that our dataset will encourage more work on methods for using unlabeled data for improving OOD performance in regression tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.5 GLOBALWHEAT-WILDS</head><p>The GLOBALWHEAT-WILDS dataset was extended from the Global Wheat Head Dataset developed by <ref type="bibr">David et al. (2020;</ref><ref type="bibr">2021)</ref>. The goal of the dataset is to localize wheat heads from field images to assist plant scientists to assess the density, size, and health of wheat heads in a particular wheat field. This imagery is acquired during different periods to cover the development of the vegetation, from the emergence to organ appearance. Examples in GLOBALWHEAT-WILDS are labeled by bounding box annotations of each wheat head in the image. Wheat heads are densely packed and overlapping, making object annotation highly tedious. Thus, the Global Wheat Head Dataset (GWHD) is relatively small, while in reality more field images are available. We supplement GLOBALWHEAT-WILDS with unlabeled examples from the same set of field vehicles and sensors but taken in different acquisition sessions, i.e., at different locations or the same location in a different year. The inclusion of this unlabeled data allows: 1) a much higher spatial coverage of a field location when the data comes from an acquisition session which is already included, 2) a much higher temporal resolution when the data comes from a location which is already included, so we have a larger range of wheat growth stages, and 3) slightly more diversity when the session comes from a different location, but with the same image acquisition protocol (i.e., the same field vehicle and image sensor).</p><p>Problem setting. The task is to localize wheat heads in high resolution overhead field images taken from above the crop canopy. We consider generalizing across acquisition sessions representing a particular location, time and sensor with which the images were captured. Variation across sessions includes changes in wheat genotype, wheat head appearance, growing conditions, background appearance, illumination and acquisition protocol. The input x is an overhead outdoor image of wheat canopy, and the label y is a set of box coordinates bounding the wheat heads (the spike at the top of the wheat plant holding grain), omitting any hair-like awns that may extend from the head. The domain d designates an acquisition session, which corresponds to a certain location, time, and imaging sensor.</p><p>Data. We provide unlabeled data from same domains as the labeled GLOBALWHEAT-WILDS dataset. Additionally, we provide unlabeled data from extra acquisition sessions not in the labeled GLOBALWHEAT-WILDS dataset (extra domains). The domains are split as follows:</p><p>1. Source: 18 acquisition sessions in Europe (France &#215;13, Norway &#215;2, Switzerland, United Kingdom, Belgium).</p><p>2. Validation (OOD): 8 acquisition sessions: 7 in Asia (Japan &#215; 4, China &#215; 3) and 1 in Africa (Sudan).</p><p>3. Target (OOD): 21 acquisition sessions: 11 in Australia and 10 in North America (USA &#215; 6, Mexico &#215; 3, Canada).</p><p>4. Extra (OOD): 53 acquisition sessions distributed across the world.</p><p>The source, validation, and target sessions are split by continent, while the extra sessions are taken from across the world. For acquisition sessions with both labeled and unlabeled data, we randomly selected new patches of 1024x1024 pixels from the original underlying data. The images were preprocessed in the same way as described in <ref type="bibr">David et al. (2021)</ref>.</p><p>Broader context. Utilizing unlabeled data is relatively new in the context of plant phenotyping, due to the lack of a large, unlabeled database of plant images. However, larger plant image datasets are starting become available, such as from the Terraphenotying Reference Platform (TERRA-Ref, <ref type="bibr">Burnette et al. (2018)</ref>). Increasing the sample size and variation within plant datasets is an important goal, because plants from the same species are fairly self-similar within the same field and therefore increasing the number of locations, times and image types included in a dataset can be beneficial for making fine-grained visual classifications for plants. Further, for plant phenotyping to be used in farming applications, such as for precisely spraying weeds in a field with herbicide, models must be highly robust to variations between different fields. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.6 OGB-MOLPCBA</head><p>The OGB-MOLPCBA dataset was adapted from the Open Graph Benchmark <ref type="bibr">(Hu et al., 2020b)</ref> and originally curated by the MoleculeNet <ref type="bibr">(Wu et al., 2018)</ref> from the PubChem database <ref type="bibr">(Bolton et al., 2008)</ref>. The dataset is a collection of molecules annotated with 128 kinds of binary labels indicating the outcome of different biological assays. Performing biological assays is expensive, and as a result, the assay labels are only sparsely available over a tiny portion of the molecules curated in the large-scale PubChem database <ref type="bibr">(Bolton et al., 2008)</ref>. On the other hand, unlabeled molecule data is abundant and readily available from the database. Prior work in graph machine learning has leveraged unlabeled molecules to perform pre-training <ref type="bibr">(Hu et al., 2020c)</ref> and semi-supervised learning <ref type="bibr">(Sun et al., 2020)</ref>. In this work, we augment the OGB-MOLPCBA dataset with unlabeled molecules subsampled from the PubChem database.</p><p>Problem setting. The task is multi-task molecule classification, and we consider generalizing to new molecule scaffold structures at test time. The input x corresponds to a molecular graph (where nodes are atoms and edges are chemical bounds), the label y is a 128-dimensional binary vector, representing the binary outcomes of the biological assay results. y could contain NaN values, indicating that the corresponding biological assays were not performed on the given molecule. The domain d indicates the scaffold group a molecule belongs to. As the binary labels are highly-skewed, the model's classification performance is evaluated using the Average Precision.</p><p>Data. All of the labeled and unlabeled data are taken from the PubChem database <ref type="bibr">(Bolton et al., 2008)</ref>. We provide unlabeled data from same domains as the labeled OGB-MOLPCBA dataset (no extra domains). We curate the unlabeled data by randomly sampling 5 million molecules from the PubChem database. We then assign these unlabeled molecules to the existing labeled scaffold groups that contain the most similar molecules. Specifically, we first compute the 1024-dimensional Morgan fingerprints for all the molecules <ref type="bibr">(Rogers &amp; Hahn, 2010;</ref><ref type="bibr">Landrum et al., 2006)</ref>. Then, for each unlabeled molecule, we compute its Jaccard similarity against all the labeled molecules in OGB-MOLPCBA and obtain a labeled molecule with the highest Jaccard similarity. Finally, we assign the unlabeled molecule to the scaffold group that the most similar labeled molecule belongs to. This way, the molecules within the same scaffold groups are structurally similar to each other.</p><p>The domains in the OGB-MOLPCBA dataset are as follows:</p><p>1. Source: 44,930 scaffold groups.</p><p>2. Validation (OOD): 31,361 scaffold groups.</p><p>3. Target (OOD): 43,793 scaffold groups.</p><p>The largest scaffolds are in the source split and the smallest scaffolds in the target split. We assign all of the unlabeled molecules to the existing domains, so there are no extra domains added.</p><p>While the unlabeled data are similar to the labeled data in that they were all derived from PubChem <ref type="bibr">(Bolton et al., 2008)</ref>, it is quite possible that there was some selection bias in which molecules in PubChem were chosen to be labeled, which would lead to an undocumented distribution shift between the unlabeled and labeled datasets.</p><p>Broader context. We focused on providing unlabeled data for both training and OOD test domains. Unlabeled molecules can be used to develop and evaluate methods for domain adaptation, self-training, as well as pre-training <ref type="bibr">(Hu et al., 2020c)</ref> and semi-supervised learning <ref type="bibr">(Sun et al., 2020)</ref>. In terms of future directions, we think it is fruitful to explore both graph-agnostic methods (e.g., pseudo-label training) and more graph-specific methods (e.g., self-supervised learning of graph neural networks <ref type="bibr">(Xie et al., 2021b)</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.7 CIVILCOMMENTS-WILDS</head><p>The CIVILCOMMENTS-WILDS dataset <ref type="bibr">(Koh et al., 2021)</ref> was adapted from the CivilComments dataset <ref type="bibr">(Borkan et al., 2019)</ref>, which is a collection of text comments made on online articles. The data in CIVILCOMMENTS-WILDS underwent a significant labeling and annotation process: each example was labeled toxic or non-toxic and annotated for whether they mentioned certain demographic identities by at least 10 crowdworkers. Such a substantial labeling and identity annotation process is expensive and time-consuming. On the other hand, unlabeled, unannotated text comments are readily available. For example, CIVILCOMMENTS-WILDS only contains a subset of all data available in the original CivilComments dataset <ref type="bibr">(Borkan et al., 2019)</ref>, most of which Koh et al. ( <ref type="formula">2021</ref>) excluded because these examples were not annotated for mentioning identities. In this work, we augment the CIVILCOMMENTS-WILDS dataset with these unlabeled, unannotated comments.</p><p>Problem setting. The task is to classify whether a text comment is toxic or not. The input x is a text comment (at least one sentence long) originally made on an online article, the label y is a binary indicator of whether the comment is rated toxic or not, and the domain d is an 8-dimensional binary vector, where each dimension corresponds to whether the comment mentions each of 8 demographic identities: male, female, LGBTQ, Christian, Muslim, other religions, Black, or White, respectively. Each comment also includes metadata on which article the comment was made on, although we do not use this metadata for training or evaluation.</p><p>We consider the subpopulation shift setting, where the model must perform well across all subpopulations, which are defined based on d. Koh et al. ( <ref type="formula">2021</ref>) define 16 subpopulations (groups) based on d. Models are then evaluated by their worst-group accuracy, i.e., the lowest accuracy over the 16 groups considered. In our work, we use the same evaluation setup.</p><p>Data. All of the labeled and unlabeled data are taken from the CivilComments dataset <ref type="bibr">(Borkan et al., 2019)</ref>. After preprocessing, Koh et al. ( <ref type="formula">2021</ref>) created the CIVILCOMMENTS-WILDS dataset using the 448,000 examples that were fully annotated for both toxicity y and the mention of demographic identities d. In this work, we augment CIVILCOMMENTS-WILDS with an additional 1,551,515 examples collected by <ref type="bibr">Borkan et al. (2019)</ref>. We use these examples as unlabeled data.</p><p>We follow the same preprocessing steps as was done with the labeled data in <ref type="bibr">Koh et al. (2021)</ref>  <ref type="table">11</ref>). In practice, these comments may actually mention any number of identities.</p><p>A substantial amount (1,427,848 or 92%) of the unlabeled comments are drawn from the same articles as the labeled comments. In particular, 140,082 unlabeled comments are from the same articles as labeled comments in the test split.</p><p>CIVILCOMMENTS-WILDS exhibits class imbalance. We account for this when benchmarking methods by sampling class-balanced batches of labeled data when applicable (see Appendix B).</p><p>Broader context. In this work, we focused on supplementing CIVILCOMMENTS-WILDS with extra unannotated data from the original CivilComments dataset <ref type="bibr">(Borkan et al., 2019)</ref> papers have pointed out biases in large language models <ref type="bibr">(Abid et al., 2021;</ref><ref type="bibr">Nadeem et al., 2020;</ref><ref type="bibr">Gehman et al., 2020)</ref>. However, recent work has also suggested that pre-trained models can be trained to be more robust against some types of spurious correlations <ref type="bibr">(Hendrycks et al., 2020;</ref><ref type="bibr">Tu et al., 2020)</ref> and that additional domain-and task-specific pre-training <ref type="bibr">(Gururangan et al., 2020)</ref> can also improve performance. We hope our contributions to the CIVILCOMMENTS-WILDS dataset can encourage future study on whether unlabeled data can be leveraged to improve generalization across subpopulation shifts.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.8 AMAZON-WILDS</head><p>The AMAZON-WILDS dataset <ref type="bibr">(Koh et al., 2021)</ref> was adapted from the Amazon reviews dataset <ref type="bibr">(Ni et al., 2019)</ref>, which is a collection of product reviews written by reviewers. While Amazon reviews are always labeled by the star ratings in practice, unlabeled data is a common source of leverage more generally for sentiment classification, with prior work in domain adaptation <ref type="bibr">(Blitzer &amp; Pereira, 2007;</ref><ref type="bibr">Glorot et al., 2011)</ref> and semi-supervised learning <ref type="bibr">(Dasgupta &amp; Ng, 2009;</ref><ref type="bibr">Li et al., 2011)</ref>. In this work, we augment the AMAZON-WILDS dataset with unlabeled reviews, whose star ratings have been removed.</p><p>Problem setting. The task is sentiment classification, and we consider generalizing from a set of reviewers to new reviewers at test time. The input x corresponds to a review text, the label y is the star rating from 1 to 5, and the domain d identifies which user wrote the review. For each review, additional metadata (product ID, product category, review time, and summary) are also available. Because the goal is to train a model that performs well across a wide range of reviewers, models are evaluated by their tail performance, concretely, their accuracy on the user at the 10th percentile.</p><p>Data. All of the labeled and unlabeled data are taken from the Amazon reviews dataset <ref type="bibr">(Ni et al., 2019)</ref>. We provide unlabeled data from same domains as the labeled AMAZON-WILDS dataset.</p><p>Additionally, we provide unlabeled data from extra reviewers not in the labeled AMAZON-WILDS dataset (extra domains). The domains are split as follows:</p><p>1. Source: 1,252 reviewers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Validation (OOD):</head><p>1,334 reviewers.</p><p>3. Target (OOD): 1,334 reviewers.</p><p>4. Extra (OOD): 21,694 reviewers.</p><p>The reviewers in each split are distinct, and all reviewers have at least 75 reviews. The distributions of reviewers in each split are identical. AMAZON-WILDS also includes Validation (ID) and Target (ID) sets which contain data from the source reviewers.</p><p>The AMAZON-WILDS dataset has a total of 539,502 labeled reviews across these splits, and we augment the dataset with a total of 3,462,668 unlabeled reviews. For each split of the unlabeled data, we include all available reviews that are written by the reviewer. For the Extra (OOD) split, we include all reviewers with at least 75 reviews that are not in Source, Validation (OOD), or Target (OOD) splits. To filter and process reviews, we followed the same data processing steps as for the labeled data in AMAZON-WILDS <ref type="bibr">(Koh et al., 2021)</ref>.</p><p>Broader context. We focused on providing unlabeled data from OOD domains, including both test and extra domains. Unlabeled data from the test reviewers can be used to develop and evaluate methods for domain adaptation <ref type="bibr">(Ren et al., 2018;</ref><ref type="bibr">Zhang et al., 2019a;</ref><ref type="bibr">Koohbanani et al., 2021)</ref>, which has been well-studied in the context of sentiment classification <ref type="bibr">(Blitzer &amp; Pereira, 2007;</ref><ref type="bibr">Glorot et al., 2011)</ref>. While there is limited prior work on leveraging unlabeled data from extra domains, some domain adaptation techniques can be readily adapted to leverage such unlabeled data <ref type="bibr">(Ganin et al., 2016)</ref>. Finally, we focus on unlabeled data specific to the task in this work, varying only the domains, and this contrasts with the type of unlabeled data used for pre-training in NLP, which is much larger and more diverse <ref type="bibr">(Devlin et al., 2019;</ref><ref type="bibr">Brown et al., 2020)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B ALGORITHM DETAILS B.1 EMPIRICAL RISK MINIMIZATION (ERM)</head><p>As a baseline, we consider Empirical Risk Minimization (ERM). ERM ignores unlabeled data and minimizes the average labeled loss. We additionally evaluate ERM with strong data augmentation on applicable datasets, i.e., on IWILDCAM2020-WILDS, CAMELYON17-WILDS, POVERTYMAP-WILDS, and FMOW-WILDS (see Appendix C). ERM with strong data augmentation learns a model h that minimizes the labeled training loss</p><p>where A strong is a stochastic data augmentation operation, and measures the prediction loss. We use L L throughout this appendix to refer to the above labeled loss with strong augmentations (on applicable datasets).</p><p>For all dataset except CIVILCOMMENTS-WILDS, we sample labeled batches uniformly at random.</p><p>In our experiments, we account for class imbalance in CIVILCOMMENTS-WILDS by explicitly sampling class-balanced batches of labeled data when computing L L (h).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 DOMAIN-INVARIANT METHODS</head><p>Domain-invariant methods seek to learn feature representations that are invariant across domains. These methods are motivated by earlier theoretical results showing that the gap between in-and outof-distribution performance depends on some measure of divergence between the source and target distributions (Ben- <ref type="bibr">David et al., 2010)</ref>. To minimize this divergence, the methods described below penalize divergence between feature representations across domains, i.e., they encourage the model to produce feature representations that are similar across domains.</p><p>Consider a model h = g &#8226; f , where the featurizer f : X &#8594; F maps the inputs to some feature space, and the head g : F &#8594; Y maps feature representations to prediction targets. Domain-invariant methods seek to constrain f to output similar representations for labeled and unlabeled data.</p><p>In this work, we adapt all of our domain-invariant methods to use data augmentations on applicable datasets (see Appendix C), and thus the output of f on the labeled batch is</p><p>Similarly, the output of f on an unlabeled batch is</p><p>Domain-invariant methods seek to minimize some divergence &#958; : F &#215; F &#8594; R between the labeled data B L and the unlabeled data B U , where the choice of divergence depends on the specific method. The divergence is expressed as a penalty term:</p><p>The final objective is a combination of the labeled loss and penalty loss. The balance between the two losses is controlled by hyperparameter &#955;, the penalty weight.</p><p>In our experiments, we study two classical domain-invariant methods, Correlation Alignment (CORAL) <ref type="bibr">(Sun et al., 2016;</ref><ref type="bibr">Sun &amp; Saenko, 2016)</ref> and Domain-Adversarial Neural Networks (DANN) <ref type="bibr">(Ganin et al., 2016)</ref>. These methods are well-known and established, but their performance can be lower than that of newer domain-invariant methods that employ different penalties to encourage the source and target representations to be similar <ref type="bibr">(Jiang et al., 2020;</ref><ref type="bibr">Zhang et al., 2021)</ref>.</p><p>Examples of these newer methods are Joint Adaptation Networks (JAN) <ref type="bibr">(Long et al., 2017)</ref>, Conditional Domain Adversarial Networks (CDAN) <ref type="bibr">(Long et al., 2018)</ref>, Collaborative and Adversarial Networks (CAN) <ref type="bibr">(Zhang et al., 2018)</ref>, and models with Adaptive Feature Norm (AFN) <ref type="bibr">(Xu et al., 2019)</ref>, as well as methods that minimize the Maximum Classifier Discrepancy (MCD) <ref type="bibr">(Saito et al., 2018)</ref> and the Margin Disparity Discrepancy (MDD) <ref type="bibr">(Zhang et al., 2019b)</ref>.</p><p>All of the above methods were developed for the single-source single-target setting, where the source domain is treated as a single distribution, and likewise for the target domain. As each WILDS 2.0 dataset comprises multiple source domains and multiple target domains, it is likely that methods that can leverage this additional structure could perform better. Examples of these methods include Multi-source Domain Adversarial Networks (MDAN) <ref type="bibr">(Zhao et al., 2018)</ref> and Moment Matching for Multi-Source Domain Adaptation (M3SDA) <ref type="bibr">(Peng et al., 2019)</ref>. The DomainBed <ref type="bibr">(Gulrajani &amp; Lopez-Paz, 2020)</ref> and <ref type="bibr">WILDS (Koh et al., 2021)</ref> benchmarks also extended single-source algorithms like CORAL and DANN to take advantage of multiple source domains in the domain generalization setting, and similar extensions in the domain adaptation setting could be promising.</p><p>Correlation Alignment (CORAL). Algorithm 1 describes CORAL, proposed by <ref type="bibr">Sun et al. (2016)</ref>; <ref type="bibr">Sun &amp; Saenko (2016)</ref>. CORAL measures the divergence &#958; between batches of feature representations in terms of the deviation between their first and second order statistics. Given a labeled batch and unlabeled batch of features B L &#8712; R nL&#215;m , B U &#8712; R nU&#215;m , define the feature means as</p><p>and covariance matrices as</p><p>We then compute the CORAL penalty as</p><p>We adapted our implementation from DomainBed <ref type="bibr">(Gulrajani &amp; Lopez-Paz, 2020)</ref>, as done in WILDS 1.0. We note that these implementations compute the penalty as a sum of deviations in means and covariances, whereas <ref type="bibr">Sun et al. (2016)</ref>; <ref type="bibr">Sun &amp; Saenko (2016)</ref> penalize deviations in covariances only. <ref type="bibr">(Sun et al. (2016)</ref> considers features that are normalized to zero mean.) On applicable datasets, we also strongly augmented all labeled and unlabeled examples using A strong , whereas <ref type="bibr">Sun et al. (2016)</ref>; <ref type="bibr">Sun &amp; Saenko (2016)</ref> do not explicitly require data augmentations. We add augmentations to allow for a fairer comparison to other methods which use augmentations.</p><p>Note that CORAL has also been adapted by <ref type="bibr">Gulrajani &amp; Lopez-Paz (2020)</ref>; Koh et al. ( <ref type="formula">2021</ref>) for domain generalization. In particular, where the original CORAL paper defines L penalty as the divergence between just two kinds of batches (labeled and unlabeled), these works define L penalty as the divergence between many kinds of batches, where batches are grouped based on domain annotation d (i) . For simplicity, we followed the original CORAL formulation and differentiate only between labeled and unlabeled batches. We leave leveraging the domain adaptations d to future work.</p><p>APPLICABLE DATASETS. We run CORAL on all datasets except GLOBALWHEAT-WILDS and CIVILCOMMENTS-WILDS. We do not evaluate domain invariant methods on CIVILCOMMENTS-WILDS, since the labeled and unlabeled data are drawn from the same distribution. We do not evaluate CORAL on GLOBALWHEAT-WILDS because CORAL does not port straightforwardly to detection settings.</p><p>DANN. Algorithm 2 describes DANN, proposed by <ref type="bibr">Ganin et al. (2016)</ref>. DANN measures the divergence &#958; between batches of feature representations using the performance of a discriminator network h d that aims to discriminate between domains. Given a batch of features (either B L or B U ), Finally, Pseudo-Label increases the balance &#955;(t) between labeled and unlabeled losses over time, initially placing 0 weight on L U (h) and then linearly stepping the unlabeled loss weight until it reaches the full value of hyperparameter &#955; at some threshold step. We fix the step at which &#955;(t) reaches its maximum value (&#955;) to be 40% of the total number of training steps, matching the implementation of <ref type="bibr">Sohn et al. (2020)</ref>. This scheduling allows the algorithm to initially prioritize the labeled loss, as generated pseudolabels are mostly incorrect while the model has low accuracy. Formally, at step t and given total number of steps T ,</p><p>We add augmentations to Pseudo-Label in order to allow for a fairer comparison to other methods that use augmentations. On applicable datasets, we have strongly augmented all labeled and unlabeled examples using A strong , whereas <ref type="bibr">Lee (2013)</ref> do not use any data augmentations, i.e., all instances of A strong are replaced with the identity function.</p><p>Algorithm 3: Pseudo-Label</p><p>L , d</p><p>We evaluate Pseudo-Label on all datasets except POVERTYMAP-WILDS, as POVERTYMAP-WILDS is a regression dataset, and hard pseudolabeling does not port straightforwardly to regression tasks.</p><p>FixMatch. Algorithm 4 describes FixMatch, proposed by <ref type="bibr">Sohn et al. (2020)</ref>. Like Pseudo-Label, this algorithm dynamically generates pseudolabels and updates each batch. FixMatch additionally employs consistency regularization on the unlabeled data. While pseudolabels are generated on a weakly augmented view of the unlabeled examples, the loss is computed with respect to predictions on a strongly augmented view. This encourages models to make predictions on a strongly augmented example consistent with its prediction on the same example when weakly augmented. For details about the strong versus weak augmentations we use, see Appendix C.</p><p>Formally, the pseudolabel-generating function &#968; is given by</p><p>Like Pseudo-Label, FixMatch uses confidence thresholding, and unlabeled examples on which the model has low confidence have zero loss. Following <ref type="bibr">Sohn et al. (2020)</ref>, we keep the balance between labeled and unlabeled losses constant at &#955;(t) = &#955;. FixMatch's original authors justify keeping &#955;(t) at a fixed magnitude (as opposed to slowly increasing &#955;(t) as in Pseudo-Label) by noting that most predictions made by FixMatch are initially low confidence, so for sufficiently high confidence threshold &#964; , most unlabeled examples have loss zero, keeping the magnitude of L U (h) initially small. This magnitude grows over time, providing a natural curriculum <ref type="bibr">(Sohn et al., 2020)</ref>.</p><p>We endeavored to match our implementation of FixMatch to the formulation of <ref type="bibr">Sohn et al. (2020)</ref> </p><p>L , d</p><p>We evaluate FixMatch on image classification datasets IWILDCAM2020-WILDS, CAMELYON17-WILDS, POVERTYMAP-WILDS, and FMOW-WILDS.</p><p>We do not evaluate FixMatch on other datasets because FixMatch relies on enforcing consistency across data augmentations, which we only define for image datasets (see Appendix C).</p><p>Noisy Student. Algorithm 5 describes Noisy Student, proposed by <ref type="bibr">Xie et al. (2020)</ref>. Unlike Pseudo-Label and FixMatch, which update the model and re-generate new pseudolabels each batch, Noisy Student generates pseudolabels, fixes them, and then trains the model until convergence before generating new pseudolabels. First, an initial teacher model is trained on the labeled data; next, the teacher model pseudolabels the unlabeled data, and a student model is trained on the labeled and pseudolabeled data; finally, the student model becomes the new teacher, and the cycle repeats (see Algorithm 5). Each (teacher, student) pair is termed an iteration; we study the results of two iterations.</p><p>We train Noisy Student using hard pseudolabels, which the teacher generates over weakly augmented inputs:</p><p>While the teacher generates pseudolabels on a weakly augmented data, the student must make both labeled and unlabeled predictions on noisy (i.e., strongly augmented) data. Following Xie et al.</p><p>(2020), we add a dropout layer (p = 0.5) before the student's last layer, randomly corrupting final feature maps. Students thus are trained to be consistent across both data-based and model-based noise. We denote the model with inserted dropout as Dropout &#8226; f . Xie et al. ( <ref type="formula">2020</ref>) add even more model-based noise using stochastic depth; for simplicity, we do not use stochastic depth in our implementation.</p><p>We follow the original paper and fix the balance between labeled and unlabeled losses as &#955;(t) = 1. Noisy Student does not use confidence thresholding.</p><p>Note that <ref type="bibr">Xie et al. (2020)</ref> use both dropout and strong data augmentations when training the initial teacher on labeled data. We reuse models from our ERM + Data Augmentation experiments as initial teacher models; thus we differ from <ref type="bibr">Xie et al. (2020)</ref> in that our initial teachers were trained with strong augmentations, but not dropout (see Algorithm 5).</p><p>APPLICABLE DATASETS. We evaluate Noisy Student on all datasets except text datasets CIVILCOMMENTS-WILDS and AMAZON-WILDS.</p><p>For GLOBALWHEAT-WILDS and OGB-MOLPCBA, we run Noisy Student without noise from data augmentations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D EXPERIMENTAL DETAILS D.1 IN-DISTRIBUTION VS. OUT-OF-DISTRIBUTION PERFORMANCE</head><p>We report both in-distribution and out-of-distribution performance metrics on all datasets, with the exception of OGB-MOLPCBA, which does not have a separate in-distribution test set. Using the terminology in WILDS <ref type="bibr">(Koh et al., 2021)</ref>, we consider the train-to-train in-distribution comparison on IWILDCAM2020-WILDS, CAMELYON17-WILDS, FMOW-WILDS, and POVERTYMAP-WILDS, and the average comparison on CIVILCOMMENTS-WILDS and AMAZON-WILDS.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2 MODEL ARCHITECTURES</head><p>For all experiments, we use the same models for each dataset as in WILDS 1.0:</p><p>&#8226; IWILDCAM2020-WILDS: ResNet-50 <ref type="bibr">(He et al., 2016)</ref>.</p><p>&#8226; CAMELYON17-WILDS: DenseNet-121 <ref type="bibr">(Huang et al., 2017)</ref>.</p><p>&#8226; FMOW-WILDS: DenseNet-121 <ref type="bibr">(Huang et al., 2017)</ref>.</p><p>&#8226; POVERTYMAP-WILDS: Multi-spectral ResNet-18 <ref type="bibr">(Yeh et al., 2020)</ref>.</p><p>&#8226; GLOBALWHEAT-WILDS: Faster-RCNN <ref type="bibr">(Ren et al., 2015)</ref>.</p><p>&#8226; OGB-MOLPCBA: Graph Isomorphism Network <ref type="bibr">(Xu et al., 2018)</ref>.</p><p>&#8226; CIVILCOMMENTS-WILDS: DistilBERT <ref type="bibr">(Sanh et al., 2019)</ref>.</p><p>&#8226; AMAZON-WILDS: DistilBERT <ref type="bibr">(Sanh et al., 2019)</ref>.</p><p>The models for IWILDCAM2020-WILDS, FMOW-WILDS, and GLOBALWHEAT-WILDS were initialized with weights pre-trained on ImageNet. Note that models for CAMELYON17-WILDS were not initialized with ImageNet weights. The DistilBERT models were also initialized with pre-trained weights from the Transformers library.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.3 BATCH SIZES AND BATCH NORMALIZATION</head><p>For each dataset, we set the total batch size (where a batch contains both labeled and unlabeled data) to the maximum that can fit on 12GB of GPU memory (Table <ref type="table">13</ref>). For all the methods that leverage unlabeled data, except the pre-training algorithms, we run with 4 steps of gradient accumulation, resulting in a 4&#215; larger effective batch size. For SwAV pre-training, we run with 4 GPUs in parallel, which achieves a similar effect. For masked LM pre-training, we run with the default setting of 256 steps of gradient accumulation. These larger batch sizes deviate from the defaults used in the WILDS paper <ref type="bibr">(Koh et al., 2021)</ref>. We use these larger batch sizes because methods that leverage unlabeled data tend to use larger batch sizes <ref type="bibr">(Sohn et al., 2020;</ref><ref type="bibr">Xie et al., 2020;</ref><ref type="bibr">Caron et al., 2020)</ref> Table <ref type="table">13</ref>: The batch sizes of each dataset from the original WILDS 1.0 paper and the batch sizes used in WILDS 2.0, which correspond to the maximum that can fit into 12GB of GPU memory.</p><p>For models that use batch normalization, the composition of each batch affects the way in which batch normalization is applied. For CORAL, DANN, and Pseudo-Label, we concatenate the labeled and unlabeled data together in each batch, so the labeled and unlabeled data are jointly normalized.</p><p>Dataset \ # epochs ERM 3:1 ratio 7:1 ratio 15:1 ratio Table <ref type="table">14</ref>: The number of epochs (complete passes over the labeled data) used for each dataset, specified for the ERM baseline as well as different ratios of unlabeled to labeled data within a batch.</p><p>For FixMatch, we jointly normalize the labeled data and the strongly augmented unlabeled data, but we separately normalize the weakly augmented unlabeled data in a separate forward pass; we did two forward passes to keep the overall batch sizes consistent with the other algorithms, as in Table <ref type="table">13</ref>, while still fitting in GPU memory. For Noisy Student, MLM pre-training, and SwAV pretraining, the unlabeled data is processed separately from the labeled data, so each batch of labeled or unlabeled data is separately normalized.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.4 HYPERPARAMETER TUNING</head><p>We tune each algorithm separately for each dataset by randomly sampling 10 different hyperparameter configurations within the ranges defined below. We early stop and select the best hyperparameters based on the OOD validation performance, which is computed on the labeled Validation (OOD) data for each dataset; we do not use the labeled Validation (ID) data in our experiments. We then run replicates using the best hyperparameters. For computational reasons, we do not tune hyperparameters for the pre-training algorithms, though we tune the finetuning of their resulting pre-trained models as usual.</p><p>Learning rates. For all the datasets, except for OGB-MOLPCBA, we multiply the learning rates used in WILDS by the ratio of the effective batch size to the original batch size used in WILDS 1.0. We center the learning rate grid around this modified learning rate r, and search over r &#8226; 10 U (-1,1) , where U is the uniform distribution. For OGB-MOLPCBA, we pick r by multiplying the original learning rate by a factor of 10 instead of 4096/32 = 128 (for ERM, which does not have grandient accumulation), because we found that the latter led to unstable optimization.</p><p>L 2 -regularization. Across all datasets and methods, we used the same L 2 -regularization strengths used in WILDS 1.0.</p><p>Ratio of unlabeled to labeled data in a batch. For all the domain-invariant and self-training methods, we search over the ratio of unlabeled to labeled data in a batch, using the values {3 : 1, 7 : 1, 15 : 1}.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Number of epochs.</head><p>We defined an epoch as a complete pass over the labeled data. This means that the number of batches / gradient steps taken per epoch varies with the ratio of unlabeled to labeled data in a batch, as a higher ratio means that each batch contains fewer labeled examples.</p><p>We adjusted the number of epochs accordingly so that the total amount of compute was similar regardless of the ratio of unlabeled to labeled data in a batch. We allocated roughly twice as much compute (i.e., processing twice as many batches) to methods that used unlabeled data, compared to the purely-supervised ERM baseline. Overall, we set the number of epochs based on the WILDS 1.0 defaults, with some upwards adjustments (due to the different batch sizes and the use of unlabeled data) if we found that the best hyperparameter configuration had not converged on the validation set.  For DANN, Pseudo-Label, and FixMatch, we compared our results against the results reported in <ref type="bibr">Zhang et al. (2021)</ref>. Performance was similar for DANN (ours, 39.4%; theirs, 40.0%). For Fix-Match, our implementation performs better (ours, 50.2%; theirs, 45.3%); this is partially due to our use of strong instead of weak augmentation for the labeled data, which increases performance by 0.9%. For Pseudo-Label, our implementation performs worse (ours, 36.1%; theirs, 40.6%), and we believe it is due to variation in hyperparameter tuning.</p><p>For Noisy Student, <ref type="bibr">Berthelot et al. (2021)</ref> reported significantly lower numbers (ours, 39.7%; theirs, 32.6%). However, this is expected as they trained their models from scratch, whereas we used ImageNet-pretrained models.</p><p>We were unable to find comparable results in prior work for CORAL and SwAV pretraining on the real &#8594; sketch split. Prior work has shown that these methods can improve performance on other unsupervised adaptation datasets <ref type="bibr">(Sun &amp; Saenko, 2016;</ref><ref type="bibr">Shen et al., 2021)</ref>. On our DomainNet experiments, we found that SwAV pre-training did improve performance over ERM, though CORAL did not (Table <ref type="table">15</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F FULLY-LABELED ERM EXPERIMENTAL DETAILS</head><p>The self-training methods we evaluate in Section 5 generate a pseudolabel &#7929;U for each unlabeled example x U and then train on (x U , &#7929;U ) as if the pseudolabels were true labels. However, these pseudolabels may not be accurate. In this section, we describe how we ran fully-labeled ERM experiments using ground truth labels on the "unlabeled" data to establish informal upper bounds on how well we might expect a standard self-training approach to perform with perfect pseudolabel accuracy.</p><p>For four of our datasets (AMAZON-WILDS, CIVILCOMMENTS-WILDS, IWILDCAM2020-WILDS, and FMOW-WILDS), we curated the "unlabeled" data by taking labeled examples and discarding the ground truth labels. For example, all 268,761 of the unlabeled target reviews in AMAZON-WILDS actually have associated star ratings; these are available in our data loaders, but in our main experiments we treat these reviews as unlabeled by not loading the star ratings. We evaluated models trained via empirical risk minimization (ERM) on the combination of the standard labeled training set and the unlabeled data with these hidden labels revealed. For example, in AMAZON-WILDS, we pool together the labeled source examples as well as the unlabeled target examples with ground truth labels, and evaluate ERM models trained on all of that data. As with all of the other experiments in this paper, we evaluate test performance for all datasets on the labeled target splits, so at no point are we training on our actual test examples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F.1 HYPERPARAMETERS</head><p>Pooling labeled and unlabeled data. For all datasets, we pooled labeled source examples with examples from the same "unlabeled" split as in our main experiments (Table <ref type="table">2</ref>). We computed gradients for labeled minibatches and unlabeled minibatches separately, which means that for models using batch normalization, the labeled and unlabeled data were normalized separately. However, we fixed the labeled to "unlabeled" batch size ratios to match the overall labeled to unlabeled dataset size ratio, so other than the batch normalization effects, the training procedure can be viewed as running ERM on the pooled labeled and "unlabeled" data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Number of epochs.</head><p>With the exception of IWILDCAM2020-WILDS, detailed below, we followed the procedure in Appendix D.4 to adjust the number of epochs based on the labeled to unlabeled batch size ratios. This resulted in a similar amount of computation allocated to these fully-labeled ERM experiments as the other experiments in Table <ref type="table">2</ref>.</p><p>Other details. Other experimental details were kept similar to the other experiments in the paper. Specifically, we tuned each experiment by randomly sampling 10 different hyperparameters within the ranges defined in Appendix D.4; the only hyperparameter we tuned in these experiments was the learning rate. We early stopped and selected the best hyperparameters based on the OOD validation performance, and then ran replicates using the best hyperparameters. We also used data augmentation for IWILDCAM2020-WILDS and FMOW-WILDS but not for AMAZON-WILDS and CIVILCOMMENTS-WILDS.  <ref type="bibr">(2,927,841 examples)</ref>. However, this did not improve performance. Using the unlabeled target data, we obtained an average accuracy of 73.6 (&#177; 0.1) and a 10th percentile accuracy of 56.4 (&#177; 0.8), whereas using the unlabeled extra data, we obtained an average accuracy of 73.1 (&#177; 0.1) and a 10th percentile accuracy of 54.7 (&#177; 0.0).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>CIVILCOMMENTS-WILDS.</head><p>We used the unlabeled extra split <ref type="bibr">(1,551,515 examples)</ref>. As in our other experiments on CIVILCOMMENTS-WILDS, we accounted for label imbalance by sampling class-balanced labeled and "unlabeled" batches during training. examples. We found that we required twice as many epochs compared to the other unlabeled methods for the fully-labeled ERM training to converge, so we doubled the amount of compute allocated to the fully-labeled IWILDCAM2020-WILDS experiments.</p><p>FMOW-WILDS. We used the unlabeled target split <ref type="bibr">(173,208 examples)</ref>.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>We omitted PY150-WILDS, as code completion data is always labeled by nature of the task, and RXRX1-WILDS, as unlabeled data for that genetic perturbation task is not typically available.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>The data augmentation involves color jitter, which simulates the difference in staining protocols between the source and target distributions in CAMELYON17-WILDS<ref type="bibr">(Koh et al., 2021;</ref><ref type="bibr">Robey et al., 2021)</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>The WCS Camera Traps Dataset can be found at http://lila.science/datasets/ wcscameratraps</p></note>
		</body>
		</text>
</TEI>
