<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>FairBatch: Batch Selection for Model Fairness</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10311451</idno>
					<idno type="doi"></idno>
					<title level='j'>International Conference on Learning Representations</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yuji Roh</author><author>Kangwook Lee</author><author>Steven Euijong Whang</author><author>Changho Suh</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Training a fair machine learning model is essential to prevent demographic disparity. Existing techniques for improving model fairness require broad changes in either data preprocessing or model training, rendering themselves difficult-to-adopt for potentially already complex machine learning systems. We address this problem via the lens of bilevel optimization. While keeping the standard training algorithm as an inner optimizer, we incorporate an outer optimizer so as to equip the inner problem with an additional functionality: Adaptively selecting minibatch sizes for the purpose of improving model fairness. Our batch selection algorithm, which we call FairBatch, implements this optimization and supports prominent fairness measures: equal opportunity, equalized odds, and demographic parity. FairBatch comes with a significant implementation benefit -- it does not require any modification to data preprocessing or model training. For instance, a single-line change of PyTorch code for replacing batch selection part of model training suffices to employ FairBatch. Our experiments conducted both on synthetic and benchmark real data demonstrate that FairBatch can provide such functionalities while achieving comparable (or even greater) performances against the state of the arts.  Furthermore, FairBatch can readily improve fairness of any pre-trained model simply via fine-tuning. It is also compatible with existing batch selection techniques intended for different purposes, such as faster convergence, thus gracefully achieving multiple purposes.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Model fairness is becoming essential in a wide variety of machine learning applications. Fairness issues often arise in sensitive applications like healthcare and finance where a trained model must not discriminate among different individuals based on age, gender, or race.</p><p>While many fairness techniques have recently been proposed, they require a range of changes in either data generation or algorithmic design. There are two popular fairness approaches: (i) pre-processing where training data is debiased <ref type="bibr">(Choi et al., 2020)</ref> or re-weighted <ref type="bibr">(Jiang and Nachum, 2020)</ref>, and (ii) in-processing in which an interested model is retrained via several fairness approaches such as fairness objectives <ref type="bibr">(Zafar et al., 2017a;</ref><ref type="bibr">b)</ref>, adversarial training <ref type="bibr">(Zhang et al., 2018)</ref>, or boosting <ref type="bibr">(Iosifidis and Ntoutsi, 2019)</ref>; see more related works discussed in depth in Sec. 5. However, these approaches may require nontrivial re-configurations in modern machine learning systems, which often consist of many complex components.</p><p>In an effort to enable easier-to-reconfigure implementation for fair machine learning, we address the problem via the lens of bilevel optimization where one problem is embedded within another. While keeping the standard training algorithm as the inner optimizer, we design an outer optimizer that equips the inner problem with an added functionality of improving fairness through batch selection.</p><p>Our main contribution is to develop a batch selection algorithm (called FairBatch) that implements this optimization via adjusting the batch sizes w.r.t. sensitive groups based on the fairness measure of an intermediate model, measured in the current epoch. For example, consider a task of predicting whether individual criminals re-offend in the future subject to satisfying equalized odds <ref type="bibr">(Hardt et al., 2016)</ref> where the model accuracies must be the same across sensitive groups. In case the model is less  accurate for a certain group, FairBatch increases the batch-size ratio of that group in the next batch -see Sec. 3 for our adjusting mechanism described in detail. Fig. <ref type="figure">1a</ref> shows FairBatch's behavior when running on the ProPublica COMPAS dataset <ref type="bibr">(Angwin et al., 2016)</ref>. For equalized odds, our framework (to be described in Sec. 2) introduces two reweighting parameters (&#955; 1 , &#955; 2 ) for the purpose of adjusting the batch-size ratios of two sensitive groups (in this experiment, men and women). After a few epochs, FairBatch indeed achieves equalized odds, i.e., the accuracy disparity between sensitive groups conditioned on the true label (denoted as "ED disparity") is minimized. FairBatch also supports other prominent group fairness measures: equal opportunity <ref type="bibr">(Hardt et al., 2016)</ref> and demographic parity <ref type="bibr">(Feldman et al., 2015)</ref>.</p><p>A key feature of FairBatch is in its great usability and simplicity. It only requires a slight modification in the batch selection part of model training as demonstrated in Fig. <ref type="figure">1b</ref> and does not require any other changes in data preprocessing or model training. Experiments conducted both on synthetic and benchmark real datasets (ProPublica COMPAS <ref type="bibr">(Angwin et al., 2016)</ref>, AdultCensus <ref type="bibr">(Kohavi, 1996)</ref>, and UTKFace <ref type="bibr">(Zhang et al., 2017)</ref>) show that FairBatch exhibits greater (at least comparable) performances relative to the state of the arts (both spanning pre-processing <ref type="bibr">(Kamiran and Calders, 2011;</ref><ref type="bibr">Jiang and Nachum, 2020)</ref> and in-processing <ref type="bibr">(Zafar et al., 2017a;</ref><ref type="bibr">b;</ref><ref type="bibr">Zhang et al., 2018;</ref><ref type="bibr">Iosifidis and Ntoutsi, 2019)</ref> techniques) w.r.t. all aspects in consideration: accuracy, fairness, and runtime. In addition, FairBatch can improve fairness of any pre-trained model via fine-tuning. For example, Sec. 4.2 shows how FairBatch reduces the ED disparities of ResNet18 <ref type="bibr">(He et al., 2016)</ref> and GoogLeNet <ref type="bibr">(Szegedy et al., 2015)</ref> pre-trained models. Finally, FairBatch can be gracefully merged with other batch selection techniques typically used for faster convergence, thereby improving fairness faster as well.</p><p>Notation Let w be the parameter of an interested classifier. Let x &#8712; X be an input feature to the classifier, and let &#375; &#8712; Y be the predicted class. Note that &#375; is a function of (x, w). We consider group fairness that intends to ensure fairness across distinct sensitive groups (e.g., men versus women). Let z &#8712; Z be a sensitive attribute (e.g., gender). Consider the 0/1 loss: (y, &#375;) = 1(y = &#375;), and let m be the total number of train samples. Let L y,z (w) be the empirical risk aggregated over samples subject to y = y and z = z: L y,z (w) := 1 my,z i:y i =y,zi=z (y i , &#375;i ) where m y,z := |{i : y i = y, z i = z}|. Similarly, we define L y, (w) := 1 my, i:y i =y (y i , &#375;i ) and L ,z (w) := 1 m ,z i:zi=z (y i , &#375;i ) where m y, := |{i : y i = y}| and m ,z := |{i : z i = z}|. The overall empirical risk is written as L(w) = 1 m i (y i , &#375;i ). We utilize &#8711; for gradient and &#8706; for subdifferential.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">BILEVEL OPTIMIZATION FOR FAIRNESS</head><p>In order to systematically design an adaptive batch selection algorithm, we formalize an implicit connection between adaptive batch selection and bilevel optimization. Bilevel optimization consists of an outer optimization problem and an inner optimization problem. The inner optimizer solves an  <ref type="bibr">(Bottou, 2010)</ref> as an inner optimizer and viewing the batch selection algorithm as an outer optimizer, the process of training a fair classifier can be seen as a process of solving a bilevel optimization problem.</p><p>Batch selection + minibatch SGD = bilevel optimization solver Consider a scenario where one is minimizing the overall empirical risk L(w) via minibatch SGD. The minibatch SGD algorithm picks b of the m indices uniformly at random, say j 1 , j 2 , . . . , j b , and updates its iterate with 1 b b i=1 &#8711; (y ji , &#375;ji ), called a batch gradient. Note that a batch gradient is an unbiased estimate of the true gradient &#8711;L(w).</p><p>Since the empirical risk minimization (ERM) formulation does not take a fairness criterion into account, its minimizer usually does not satisfy the desired fairness criterion. To address this limitation of ERM, we adjust the way minibatches are drawn so that the desired fairness guarantee is satisfied. For instance, as we described in the introduction, we can draw minibatches with a larger number of train samples from a certain sensitive group so as to achieve a higher accuracy w.r.t. the group.</p><p>Once the minibatch distribution deviates from the uniform distribution, the batch gradient estimate is not anymore an unbiased gradient estimate of the overall empirical risk. Instead, it is an unbiased estimate of a reweighted empirical risk. In other words, if we draw train example i with probability p i for all i such that p i = 1, the batch gradient is an unbiased estimate of L (w) = i p i (y i , &#375;i ).</p><p>This observation enables us the following bilevel optimization-based interpretation of how batch selection interacts with inner optimization algorithm. At initialization, minibatch SGD optimizes the (unweighted) empirical risk. Based on the outcome of the inner optimization, the outer optimizer refines p := (p 1 , p 2 , . . . , p m ), the sampling probability of each train example. The inner optimizer now takes minibatches drawn from a new distribution and reoptimizes the inner objective function. Due to the new minibatch distribution, the inner objective now becomes a reweighted empirical risk w.r.t. p. This procedure is repeated until convergence. See Algorithm 1 for pseudocode.</p><p>Therefore, a batch selection algorithm together with an inner optimization algorithm can be viewed as a pair of outer optimizer and inner optimizer for the following bilevel optimization problem:</p><p>where Cost(&#8226;) captures the goal of the optimization. Two questions arise. First, how can we design the cost function to capture a desired fairness criterion? Second, how can we design an update rule for the outer optimizer? Can we develop an algorithm with a provable guarantee? In the rest of this section, we show how one can design proper cost functions to capture various fairness criteria. In Sec. 3, we will develop an efficient update rule of FairBatch.</p><p>Equal opportunity For illustrative purpose, assume for now the binary setting (Y = Z = {0, 1}). A model satisfies equal opportunity <ref type="bibr">(Hardt et al., 2016)</ref> if we have equal positive prediction rates conditioned on y = 1, i.e., L 1,0 (w) = L 1,1 (w). Since the ERM formulation does not take the fairness criterion into account, these two quantities differ in general. To mitigate this, we adjust the sampling probability between L 1,0 (w) and L 1,1 (w). More specifically, we propose the following procedure to draw a sample. First, we randomly pick which subset of data to sample data from. We pick the set y = 1, z = 0 with probability &#955;, the set y = 1, z = 1 with probability m1, m -&#955;, and the set y = 0 with probability m0, m . We then pick a sample from the chosen set, uniformly at random. This leaves us with a single-dimensional outer optimization variable &#955;, which controls the sampling bias between data with y = 1, z = 0 and data with y = 1, z = 1. Also, we design the cost function as |L 1,0 (w &#955; ) -L 1,1 (w &#955; )| to capture the equal opportunity criterion. Thus, we have the following bilevel optimization problem:</p><p>Equalized odds Similarly, we can design a bilevel optimization problem to capture equalized odds <ref type="bibr">(Hardt et al., 2016)</ref>, which desires the prediction to be independent from the sensitive attribute conditional on the true label, i.e., L 0,0 (w) = L 0,1 (w) and L 1,0 (w) = L 1,1 (w). Again, the empirical risk minimizer does not satisfy these two conditions in general. To mitigate these disparities, we adjust (i) the sampling probability between L 0,0 (w) and L 0,1 (w) and (ii) the sampling probability between L 1,0 (w) and L 1,1 (w). To achieve this, we use the following procedure to draw a sample. First, we pick the set y = 0, z = 0 with probability &#955; 1 , the set y = 0, z = 1 with probability m0, m -&#955; 1 , the set y = 1, z = 0 with probability &#955; 2 , and the set y = 1, z = 1 with probability m1, m -&#955; 2 . We then pick one data point at random from the chosen set. This leaves us with a two-dimensional outer optimization variable &#955; = (&#955; 1 , &#955; 2 ). To capture the equalized odds criterion, we design the outer objective function as: max{|L 0,0 (w) -L 0,1 (w)|, |L 1,0 (w) -L 1,1 (w)|}. This gives us the following bilevel optimization problem:</p><p>Demographic parity Demographic parity <ref type="bibr">(Feldman et al., 2015)</ref> is satisfied if two sensitive groups have equal positive prediction rates. If m y,z 's are all equal, then L 0,0 (w) = L 1,0 (w) and L 0,1 (w) = L 1,1 (w) can serve as a sufficient condition for demographic parity; see Sec. A.1 for why and how to handle demographic parity when this condition does not hold. To satisfy this sufficient condition, we now adjust (i) the the sampling probability between L 0,0 (w) and L 1,0 (w) and (ii) the the sampling probability between L 0,1 (w) and L 1,1 (w). This gives us the following bilevel optimization problem:</p><p>Beyond binary labels/sensitive attributes While the previous examples assumed binary-valued labels and sensitive attributes, our framework is applicable to the cases where the alphabet sizes are beyond binary. As an example, consider the equal opportunity criterion when</p><p>To satisfy this condition, we adjust the sampling probability between L 1,j (w)'s by introducing nz 2 -dimensional outer optimization variable &#955;, and design the outer objective function as max j1,j2&#8712;Z |L 1,j1 (w)-L 1,j2 (w)|. In our implementation, however, we only use (n z -1)-dimensional disparity objectives as an approximation (i.e., max j1&#8712;{0,1,...,nz-2} |L 1,j1 (w) -L 1,j1+1 (w)|) for better efficiency. Suppose the level of disparity is when FairBatch compares all possible combination pairs of sensitive groups. Now suppose we only optimize on the sequential (n z -1) disparity objectives. Then we will fail to ensure that other objectives like |L 1,3 (w) -L 1,1 (w)| are within . In the worst case, the objective |L 1,nz-1 (w) -L 1,1 (w)| may be (n z -1) &#215; , as we only guarantee that each |L 1,j1 (w) -L 1,j1+1 (w)| &#8804; . If is small enough, the disparity of our approximation becomes reasonable as well. One can also handle other fairness criteria in a similar way.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">UPDATE RULE OF FAIRBATCH</head><p>We design efficient update rules of FairBatch for different numbers of disparities. Let us define d as the dimension of the outer optimization variable &#955;, which is the same as the total number of disparities. We first analyze the simplest case where d = 1. We show that a simple gradient descent algorithm can provably solve the outer optimization problem. The equal opportunity example in the previous section falls in this category. We then extend the algorithm developed for the one-dimensional case to the multi-dimensional (d &gt; 1) case. Equalized odds and demographic parity fall in this category.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">UPDATE RULE FOR d = 1</head><p>When d = 1, the general form of our bilevel optimization problem can be written as follows:</p><p>The following lemma shows that F (&#955;) is quasiconvex in &#955; under some mild conditions, and its signed gradient can be efficiently computed.</p><p>Remark 1. The quasiconvexity of F (&#955;) is valid when at least one of the conditions in Lemma 1 holds. For the second condition, if f 1 (&#8226;), g 1 (&#8226;), and h(&#8226;) are convex, this condition will hold unless all the three functions share their stationary points, which is very unlikely. While there is no theoretical guarantee for the non-convex settings, FairBatch still shows on par or better results than the other fairness approaches in general settings where the functions may not be convex (see Sec. 4).</p><p>The proof for Lemma 1 can be found in Sec. A.2. Note that quasiconvexity immediately implies a unique minimum <ref type="bibr">(Boyd et al., 2004)</ref>. Thus, we design the following signed gradient-based optimization algorithm:</p><p>This algorithm increases &#955; by &#945; if f 1 (w &#955; ) &#8804; g 1 (w &#955; ) and decreases &#955; by &#945; otherwise. Recall that this is consistent with our intuition: It increases the sampling probability of a disadvantageous group and decreases that of an advantageous group. The following proposition shows that the proposed algorithm converges to the optimal solution.</p><p>Proposition 1. Let &#955; * = arg min &#955; F (&#955;) and t &#8712; Z 0+ . Then,</p><p>Remark 2. F (&#955;) is not necessarily convex even when we assume the inner objective functions f 1 (&#8226;) and g 1 (&#8226;) are convex or even strongly convex. See Sec. A.3 for a counter example.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">UPDATE RULE FOR d &#8805; 1</head><p>We now develop an efficient update algorithm for the following general bilevel optimization:</p><p>Here</p><p>where c i 's are some positive constants. Denoting by F (&#955;) the outer objective function, let us first derive the gradient of it. Under some mild conditions (see Sec. A.4) on f i (&#8226;)'s, g i (&#8226;)'s, and h(&#8226;):</p><p>, and H &#955; is positive definite. See Sec. A.4 for the derivation. Since subdifferential is always a convex set, it follows that &#947; := (&#947; 1 , &#947; 2 , . . . , &#947; d ) &#8712; &#8706; &#955; F (&#955;). Computing the subgradient &#947; requires us to compute H &#955; , which involves the Hessian matrices of the inner objective function. To avoid this expensive computation, we approximate &#947; &#8776; (0, 0, . . . , &#947; i , . . . , 0). See Sec. A.5 for the rationale and intuition behind this approximation. Then, similar to the case of d = 1, we have sign(&#947;) = (0, 0, . . . , sign (g i * (w &#955; ) -f i * (w &#955; )) , 0, . . . , 0). This gives us the general update rule of FairBatch (see Sec. A.6 for pseudocode):</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">EXPERIMENTS</head><p>We use logistic regression in all experiments except for Sec. 4.2 where we fine-tune ResNet18 <ref type="bibr">(He et al., 2016)</ref> and GoogLeNet <ref type="bibr">(Szegedy et al., 2015)</ref> in order to demonstrate FairBatch's ability to improve fairness of pre-trained models. We evaluate all models on separate test sets and repeat all experiments with 10 different random seeds. We use PyTorch, and our experiments are performed on a server with Intel i7-6850 CPUs and NVIDIA TITAN Xp GPUs. See Sec. B.1 for more details.</p><p>Measuring Fairness Here we first focus on the equal opportunity (EO) and demographic parity (DP) measures in Sec. Datasets We generate a synthetic dataset of 3,000 examples with two non-sensitive attributes (x 1 , x 2 ), a binary sensitive attribute z, and a binary label y, using a method similar to the one in <ref type="bibr">(Zafar et al., 2017a)</ref>. A tuple (x 1 , x 2 , y) is randomly generated based on the two Gaussian distributions:</p><p>For z, we generate biased data using an unfair scenario Pr(z = 1) = Pr((</p><p>We use the real benchmark datasets: ProPublica COMPAS <ref type="bibr">(Angwin et al., 2016)</ref> and AdultCensus (Kohavi, 1996) datasets with 5,278 and 43,131 examples, respectively. We use the same pre-processing as in IBM's AI Fairness 360 <ref type="bibr">(Bellamy et al., 2019)</ref> and use GENDER as the sensitive attribute. We also employ the UTKFace dataset <ref type="bibr">(Zhang et al., 2017)</ref> with 23,708 images to demonstrate the fine-tuning ability of FairBatch in Sec. 4.2.</p><p>Baselines We employ three types of baselines: (1) non-fair training with logistic regression (LR);</p><p>(2) fair training via pre-processing; and (3) fair training via in-processing.</p><p>For pre-processing methods, we first consider a simple approach that we call Cutting, which evens the data sizes of sensitive groups via saturating them to the smallest-group data size. One can think of a similar alternative approach: Boosting all of the smaller-group data sizes to the largest one, but we do not report herein due to similar performances that we found relative to Cutting. The other two are the state of the arts: reweighing (Kamiran and Calders, 2011) (RW) and Label Bias Correction (Jiang and Nachum, 2020) (LBC). RW intends to balance importance levels across sensitive groups via example weighting, but sticks with these weights throughout the entire model training, unlike FairBatch. LBC iteratively trains an entire model with example weighting towards an unbiased data distribution.</p><p>For in-processing methods, we compare with the following three: Fairness Constraints (Zafar et al., 2017a;b) (FC), Adversarial Debiasing <ref type="bibr">(Zhang et al., 2018)</ref> (AD), and AdaFair <ref type="bibr">(Iosifidis and Ntoutsi, 2019)</ref>. FC incorporates a regularization term in an effort to reduce the disparities among sensitive groups. AD is an adversarial learning approach that intends to maximize the independence between the predicted labels and sensitive attributes. In our experiments, a slight modification is made to AD for improving training stability: Not employing one regularization term used for restricting the training direction. AdaFair is an ensemble technique that equips the prominent AdaBoost <ref type="bibr">(Friedman et al., 2000)</ref> with a fairness aspect. Here the examples that lead to unfair and inaccurate performances are considered to be the difficult instances. In our experiments, natural generalization of AdaFair intended for ED is made to encompass EO and DP; see Sec. B.3 for the generalization. While AdaFair bears spiritual similarity to FairBatch in a sense that mistreated examples are weighted progressively, it comes with a significant distinction in update scale. It is basically a boosting technique; hence such updates are done in distinctive predictors through different rounds; see Sec. 5 for details.</p><p>FairBatch Settings To set &#945;, we start from a candidate set of values within the range [0.0001, 0.05] and use cross-validation on the training set to choose the value that results in the highest accuracy with low fairness violation. The default batch sizes are: 100 (synthetic); 200 (COMPAS), 1,000 (AdultCensus); and 32 (UTKFace).</p><p>Table <ref type="table">1</ref>: Performances on the synthetic, COMPAS, and AdultCensus test sets w.r.t. equal opportunity (EO). We compare FairBatch with three types of baselines: (1) non-fair method: LR; (2) fair training via pre-processing: Cutting, RW <ref type="bibr">(Kamiran and Calders, 2011)</ref>, and LBC <ref type="bibr">(Jiang and Nachum, 2020)</ref>; (3) fair training via in-processing: FC <ref type="bibr">(Zafar et al., 2017b)</ref>, AD <ref type="bibr">(Zhang et al., 2018)</ref>, and AdaFair <ref type="bibr">(Iosifidis and Ntoutsi, 2019)</ref>. Experiments are repeated 10 times.  <ref type="table">1</ref> compares FairBatch against the other approaches on the synthetic, COMPAS, and AdultCensus test sets w.r.t. accuracy, EO disparity, and complexity (reflected in the number of epochs). In Sec. B.4, we also present the convergence plot of EO disparity as a function of the number of epochs. LR in row 1 is logistic regression without any fairness technique. The pre-processing techniques in rows 2-4 reduce EO disparity yet while sacrificing the accuracy performance. The in-processing techniques in rows 5-7 further reduce EO disparity yet still sacrificing accuracy. FairBatch, presented in the last row, offers comparable (or even greater) fairness performance while sacrificing less accuracy. We also present accuracy and fairness trade-off curves of FairBatch in Sec. B.5. One key implementation benefit is reflected in the small numbers of epochs. We also obtain consistent wall clock times, presented in Sec. B.6. As mentioned earlier, AdaFair is the most similar in spirit to FairBatch as it adjusts example weights based on the fairness performances of prior models. We demonstrate in Sec. B.7 that FairBatch and AdaFair indeed show similar convergence behaviors yet in different scales (rounds for AdaFair vs. epochs for FairBatch). One distinctive feature of FairBatch relative to AdaFair is the use of a single model training, thus enabling much faster speed (22.5-96x). We also make similar comparisons yet w.r.t. another fairness measure: DP disparity. See Table <ref type="table">2</ref>. Recall that minimizing DP disparity involves adjusting two hyperparameters (&#955; 1 , &#955; 2 ), which also means that d = 2. Although FairBatch's theoretical guarantees hold only when using one hyperparameter (i.e., d = 1), we nonetheless see similar results where FairBatch is on par or better than the other approaches, while being the most robust in all aspects.  <ref type="table">1</ref> and<ref type="table">2</ref> already demonstrate FairBatch's performance against the state of the arts, in this section we emphasize the usability of FairBatch by showing how it can improve fairness of any pretrained unfair model via fine-tuning and only compare it with Cutting, which is also easy to adopt. Table <ref type="table">3</ref> shows how FairBatch improves fairness of pre-trained models (ResNet18 <ref type="bibr">(He et al., 2016)</ref> and GoogLeNet <ref type="bibr">(Szegedy et al., 2015)</ref>) on the UTKFace dataset <ref type="bibr">(Zhang et al., 2017)</ref>. Each image has three types of attributes: GENDER, RACE, and AGE. We use RACE as the sensitive attribute and consider two scenarios where the label attribute is GENDER or AGE. While GENDER is binary, AGE is multi-valued (&lt;21, 21-40, 41-60, and &gt;60), so we extend FairBatch in a straightforward fashion; see Sec. B.8 for details. Both Cutting and FairBatch reduce the ED disparities of the original pre-trained models. However, only FairBatch does so without sacrificing accuracy performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Synthetic</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">COMPATIBILITY WITH OTHER BATCH SELECTION TECHNIQUES</head><p>We demonstrate another key aspect of FairBatch: Compatibility with existing batch selection approaches that use importance sampling for faster convergence in training. The key functionality of the prior batch selection techniques is that examples considered to be "important" are given higher weights so as to be sampled more frequently. FairBatch can easily be tuned to accommodate such functionality: determining the batch-ratios of sensitive groups and then sampling using the importance weights per group. We evaluate FairBatch combined with one prominent technique, loss-based weighting <ref type="bibr">(Loshchilov and Hutter, 2016)</ref>, on our synthetic dataset using EO and DP. We find that FairBatch indeed converges more quickly. It uses about 50 fewer epochs with similar fairness performances; see Sec. B.9 for the EO and DP convergence plots.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">RELATED WORK</head><p>Model Fairness Various fairness measures have been proposed to reflect legal and social issues <ref type="bibr">(Narayanan, 2018)</ref>. Among them, we focus on group fairness measures: equal opportunity <ref type="bibr">(Hardt et al., 2016)</ref>, equalized odds <ref type="bibr">(Hardt et al., 2016)</ref>, and demographic parity <ref type="bibr">(Feldman et al., 2015)</ref>.</p><p>A variety of techniques have been proposed and can be categorized into (1) pre-processing techniques <ref type="bibr">(Kamiran and Calders, 2011;</ref><ref type="bibr">Zemel et al., 2013;</ref><ref type="bibr">Feldman et al., 2015;</ref><ref type="bibr">du Pin Calmon et al., 2017;</ref><ref type="bibr">Choi et al., 2020;</ref><ref type="bibr">Jiang and Nachum, 2020)</ref>, which debias or reweight data, (2) in-processing techniques <ref type="bibr">(Kamishima et al., 2012;</ref><ref type="bibr">Zafar et al., 2017a;</ref><ref type="bibr">b;</ref><ref type="bibr">Agarwal et al., 2018;</ref><ref type="bibr">Zhang et al., 2018;</ref><ref type="bibr">Cotter et al., 2019;</ref><ref type="bibr">Roh et al., 2020)</ref>, which tailor the model training for fairness, and (3) postprocessing techniques <ref type="bibr">(Kamiran et al., 2012;</ref><ref type="bibr">Hardt et al., 2016;</ref><ref type="bibr">Pleiss et al., 2017;</ref><ref type="bibr">Chzhen et al., 2019)</ref>, which perturb only the model output without touching upon the inside. Most of these methods require broad changes in data preprocessing, model training, or model outputs in machine learning systems <ref type="bibr">(Venkatasubramanian, 2019)</ref>. In contrast, FairBatch only requires a single-line change in code to replace batch selection while achieving comparable or even greater performances against the state of the arts.</p><p>Among the fairness techniques, AdaFair <ref type="bibr">(Iosifidis and Ntoutsi, 2019)</ref> is the most similar in spirit to FairBatch. AdaFair extends the well-known AdaBoost <ref type="bibr">(Friedman et al., 2000)</ref> where examples that lead to poor accuracy or fairness are boosted, i.e., given higher weights during the next round of training a new model that is added to the ensemble. In comparison, FairBatch is based on theoretical foundations of bilevel optimization and effectively performs the reweighting during each epoch (not through rounds), which leads to an order of magnitude improvement in speed as shown in Sec. 4.1.</p><p>Although not our immediate focus, there are other noteworthy fairness measures: (1) individual fairness <ref type="bibr">(Dwork et al., 2012)</ref> where close examples should be treated similarly, (2) causality-based fairness <ref type="bibr">(Kilbertus et al., 2017;</ref><ref type="bibr">Kusner et al., 2017;</ref><ref type="bibr">Zhang and Bareinboim, 2018;</ref><ref type="bibr">Nabi and Shpitser, 2018;</ref><ref type="bibr">Khademi et al., 2019)</ref>, which aims to overcome the limitations of non-causal approaches by understanding the causal relationship between attributes, and (3) distributionally robust optimization (DRO) <ref type="bibr">(Sinha et al., 2017)</ref>-based fairness <ref type="bibr">(Hashimoto et al., 2018)</ref>, which achieves accuracy parity without the knowledge of sensitive attribute by balancing the risks across all distributions. Extending FairBatch to support these measures is an interesting future work.</p><p>Finally, <ref type="bibr">Chouldechova and Roth (2018)</ref> describe three causes of unfairness that help clarify Fair-Batch's fairness contributions: (1) minimizing average error fits majority populations, (2) bias encoded in data, and (3) the need to explore and gather more data. FairBatch addresses the cause (1) via balancing the sensitive group ratios within a batch. FairBatch also addresses (2) in some cases. For example, consider the recidivism prediction problem described in <ref type="bibr">(Chouldechova and Roth, 2018)</ref> where minority populations have biased labels. In this case, FairBatch can be configured to make the recidivism prediction rate for the minority population similar to those of other populations. There may be other types of data bias that FairBatch is not able to address. Finally, FairBatch does not directly address (3) where one must gather more data for better fairness. Instead, there is a recent line of work that studies data collection techniques (Tae and Whang, 2021) for fairness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Batch Selection</head><p>The batch selection literature for SGD focuses on analyzing the effect of batch sizes <ref type="bibr">(Keskar et al., 2017;</ref><ref type="bibr">Masters and Luschi, 2018)</ref> and various sampling techniques <ref type="bibr">(Shamir, 2016;</ref><ref type="bibr">G&#252;rb&#252;zbalaban et al., 2019)</ref>. More recently, importance sampling techniques have been proposed for faster convergence <ref type="bibr">(Loshchilov and Hutter, 2016;</ref><ref type="bibr">Alain et al., 2016;</ref><ref type="bibr">Stich et al., 2017;</ref><ref type="bibr">Csiba and Richt&#225;rik, 2018;</ref><ref type="bibr">Katharopoulos and Fleuret, 2018;</ref><ref type="bibr">Johnson and Guestrin, 2018)</ref>. In comparison, FairBatch takes the novel approach of using batch selection for better fairness and is compatible with other existing techniques.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">CONCLUSION</head><p>We addressed model fairness via the lens of bilevel optimization and proposed the FairBatch batch selection algorithm. The bilevel optimization provides a natural framework where the inner optimizer is SGD, and the outer optimizer performs adaptive batch selection to improve fairness. We presented FairBatch for implementing this optimization and showed how its underlying theory supports the fairness measures: equal opportunity, equalized odds, and demographic parity. We showed that FairBatch offers respectful performances that are on par or even better than the state of the arts w.r.t. all aspects in consideration: accuracy, fairness, and runtime. Also, FairBatch can readily be adopted to machine learning systems with a minimal change of replacing the batch selection with a single-line of code and be gracefully merged with other batch selection techniques used for faster convergence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A APPENDIX -THEORY AND ALGORITHMS</head><p>A.1 DEMOGRAPHIC PARITY</p><p>We continue from Sec. 2 and provide more details on how we can capture demographic parity using our bilevel optimization framework. Proposition 2. If m 0,0 = m 0,1 = m 1,0 = m 1,1 , then L 0,0 (w) = L 1,0 (w) and L 0,1 (w) = L 1,1 (w) can serve as a sufficient condition for demographic parity.</p><p>Proof. Slightly abusing the notation, we denote by Pr(&#8226;) the empirical probability. The demographic parity is satisfied when Pr(&#375; = 1|z = 0) = Pr(&#375; = 1|z = 1) holds. Thus,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>By replacing</head><p>.</p><p>, the above condition reduces to</p><p>A sufficient condition to the above condition is L 0,0 (w) = L 1,0 (w) and L 0,1 (w) = L 1,1 (w).</p><p>In general, the condition of the above proposition does not hold. Observe that another sufficient condition to demographic parity is as follows:</p><p>Then, we have the following bilevel optimization problem:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 PROOF FOR LEMMA 1</head><p>We continue from Sec. 3.1 and provide a full proof for Lemma 1. Here we recall Lemma 1.</p><p>Proof. It it known that a continuous function f : R &#8594; R is quasiconvex if and only if at least one of the following conditions holds: 1) nondecreasing, 2) nonincreasing, and 3) nonincreasing and then nondecreasing <ref type="bibr">(Boyd et al., 2004)</ref>. We will prove the lemma by showing that the function F (&#955;) is quasiconvex by showing that it is nonincreasing and then nondecreasing. More precisely, we will show that f 1 (w &#955; ) -g 1 (w &#955; ) is a nonincreasing function. It is easy to see that this directly implies that |f 1 (w &#955; ) -g 1 (w &#955; )| is nonincreasing and then nondecreasing.</p><p>Case 1 (h(w) = 0) Consider &#955; 1 and &#955; 2 such that &#955; 1 &gt; &#955; 2 . If we can show f 1 (w * &#955;1 ) &#8804; f 1 (w * &#955;2 ) and g 1 (w * &#955;1 ) &#8805; g 1 (w * &#955;2 ), then this implies that f 1 (w &#955; )-g 1 (w &#955; ) is a nonincreasing function. Indeed, this is very intuitive: If we increase &#955;, the inner optimization problems puts a higher weight on f 1 (&#8226;), resulting in a lower value of f 1 (w * ) and a higher value of g 1 (w * ). We formally show this by contradiction. By the definition of w &#955; , we have the following two conditions:</p><p>This completes the proof of the first claim by contradiction.</p><p>The second claim immediately follows the first claim. Since</p><p>As shown in the earlier part of this proof, f 1 (w &#955; ) -g 1 (w &#955; ) is a nonincreasing function, i.e., df1(w &#955; )-g1(w &#955; ) d&#955; &#8804; 0. Thus, sign( dF (&#955;) d&#955; ) = sign(g 1 (w &#955; ) -f 1 (w &#955; )).</p><p>Case 2 (If f 1 (&#8226;), g 1 (&#8226;), and h(&#8226;) are twice differentiable, &#955;&#8711; 2 f 1 (w &#955; ) + (c 1 -&#955;)&#8711; 2 g 1 (w &#955; ) + &#8711; 2 h(w &#955; ) 0 for every &#955; &#8712; [0, c 1 ]) In this part of the proof, we will denote w &#955; by w for simplicity. To show that f 1 (w) -g 1 (w) is a nondecreasing function (in &#955;), consider the derivative:</p><p>To compute dw d&#955; , we implicitly differentiate (with respect to &#955;) the following stationary equation.</p><p>By rearranging terms, we have</p><p>By the assumption, &#955;&#8711; 2 f 1 (w)+(c 1 -&#955;)&#8711; 2 g 1 (w)+&#8711; 2 h(w) is positive definite and hence invertible. Thus,</p><p>Therefore,</p><p>Note that &#955;&#8711; 2 f 1 (w) + (c 1 -&#955;)&#8711; 2 g 1 (w) + &#8711; 2 h(w) -1 is also positive definite. Thus,</p><p>) is always negative, and hence f 1 (w) -g 1 (w) is a decreasing function. Now, observe that</p><p>Therefore, if F (&#8226;) = 0, then &#8706; &#955; F (&#955;) = {v} and sign (v) = sign (g 1 (w) -f 1 (w)). We continue from Sec. 3.2 and provide an example where inner objective's convexity does not imply outer objective's convexity. Consider the following strongly convex functions f 1 (&#8226;) and g 1 (&#8226;):</p><p>Shown in Fig. <ref type="figure">2</ref> is the outer objective function F (&#955;). One can observe that it is not convex. Note that it is quasiconvex by Lemma 1.</p><p>A.4 GRADIENT WHEN d &#8805; 1</p><p>We continue from Sec. 3.2 and derive the gradient of the outer objective function. Recall how we formulated the general bilevel optimization problem:</p><p>In this section, we will prove the following:</p><p>In this part of the proof, we will denote w &#955; by w for simplicity.</p><p>To compute dw d&#955;i , we implicitly differentiate (with respect to &#955; i ) the following stationary equation. [&#955; j &#8711;f j (w) + (c j -&#955; j )&#8711;g j (w)] + &#8711;h(w) = 0</p><p>By rearranging terms, we have</p><p>By the assumption, H &#955; := d j=1 &#955; j &#8711; 2 f j (w) + (c j -&#955; j )&#8711; 2 g j (w) +&#8711; 2 h(w) is positive definite and hence invertible. Thus,</p><p>we have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.5 RATIONALE AND INTUITION BEHIND THE APPROXIMATION</head><p>We continue from Sec. 3.2 and provide more justifications for the gradient approximation technique. Assume that</p><p>Then, the gradient can be fully characterized as in equation 4.</p><p>The rationale behind the approximation &#947; &#8776; (0, 0, . . . , &#947; i * , . . . , 0) is that |&#947; i * | will be maximized at i</p><p>), and they are always perfectly aligned when i = i * . This approximation is also intuitive. Recall that changing &#955; i * affects the weights associated with f i * (w) and g i * (w) in the inner optimization problem. Thus, changes in &#955; i * will directly affect F (&#955;) = |f i * (w) -g i * (w)|. On the other hand, changing &#955; i for i = i * does not affect the weights associated with f i * (w) and g i * (w) but only affects the weights of other terms, so it will only indirectly and weakly affect F (&#955;).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.6 FAIRBATCH ALGORITHMS IN PSEUDOCODE</head><p>We continue from Sec. 3.2 and present the FairBatch algorithms in pseudocode. Algorithms 2, 3, and 4 show how &#955; is adjusted for equal opportunity, equalized odds, and demographic parity, respectively. From the intermediate model at each epoch (or after a certain iterations), we first obtain f (w) and g(w), which correspond to the losses conditioned on each class. Then, one can update the current value of &#955; by comparing f (w) and g(w).</p><p>Algorithm 4: Adaptive adjustment of &#955; w.r.t. demographic parity. Input: Intermediate model, criterion, train data (x train , z train , y train ), previous lambda &#955; (t-1) , and FairBatch's learning rate &#945; output = model (x train ) loss = criterion (output, 1) d y=0 = sum(loss[(y = 0, z = 0)])/len(z = 0)sum(loss[(y = 0, z = 1)])/len(z = 1) </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.4 FAIRNESS CURVES</head><p>We continue from Sec. 4.1 and show in Figures <ref type="figure">3</ref> and<ref type="figure">4</ref> the EO and DP disparity curves against the number of epochs for each fairness technique on the synthetic dataset. We also directly compare the curves of all fairness techniques in one graph as shown in Figure <ref type="figure">5</ref>. Since LBC and AdaFair require more than 10x many epochs than other methods, we only show their first 1000 epochs. As a result, FairBatch is one of the fastest methods to converge to low EO or DP disparities.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.5 TRADE-OFF CURVES OF FAIRBATCH</head><p>We continue from Sec. 4.1 and show in Fig. <ref type="figure">6</ref> the accuracy-fairness trade-off curves of FairBatch for EO and DP on the synthetic dataset. FairBatch can be tuned by making it "less sensitive" to disparity. In Algorithms 2 and 4, notice that the &#955; parameters are updated if there is any disparity among sensitive groups. We modify this logic where the &#955; parameters are only updated if the disparity is above some threshold T . The trade-off curves in Fig. <ref type="figure">6</ref> are thus generated by adjusting T . For both EO and DP, we observe that there is a clear trade-off between accuracy and disparity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.6 WALL CLOCK TIMES</head><p>We continue from Sec. 4.1 and show in Table <ref type="table">5</ref> the wall clock times (in seconds) of the experiments in Table <ref type="table">1</ref> where we compare FairBatch against all the fairness techniques on the synthetic, COMPAS, and AdultCensus datasets. As a result, each runtime is proportional to the number of epochs shown in Table <ref type="table">1</ref>. When comparing the runtimes of individual batches, FairBatch's batch takes 1.5x longer to run than LR's batch.       </p></div></body>
		</text>
</TEI>
