<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>f-FERM: A Scalable Framework for Robust Fair Empirical Risk Minimization</title></titleStmt>
			<publicationStmt>
				<publisher>International Conference on Learning Representations</publisher>
				<date>01/16/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10515371</idno>
					<idno type="doi"></idno>
					
					<author>Sina Baharlouei</author><author>Shivam Patel</author><author>Meisam Razaviyayn</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Training and deploying machine learning models that meet fairness criteria for protected groups are fundamental in modern artificial intelligence. While numerous constraints and regularization terms have been proposed in the literature to promote fairness in machine learning tasks, most of these approaches are not amenable to stochastic optimization due to the complex and nonlinear structure of constraints and regularizers. Here, the term "stochastic" refers to the ability of the algorithm to work with small mini-batches of data. Motivated by the limitation of existing literature, this paper presents a unified stochastic optimization framework for fair empirical risk minimization based on f -divergence measures (f -FERM). The proposed stochastic algorithm enjoys theoretical convergence guarantees. In addition, our experiments demonstrate the superiority of fairness-accuracy tradeoffs offered by f -FERM for almost all batch sizes (ranging from full-batch to batch size of one). Moreover, we show that our framework can be extended to the case where there is a distribution shift from training to the test data. Our extension is based on a distributionally robust optimization reformulation of f -FERM objective under `p norms as uncertainty sets. Again, in this distributionally robust setting, f -FERM not only enjoys theoretical convergence guarantees but also outperforms other baselines in the literature in the tasks involving distribution shifts. An efficient stochastic implementation of f -FERM is publicly available 1 .]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Machine learning models are increasingly deployed in critical applications ranging from healthcare <ref type="bibr">(Ahmad et al., 2018)</ref> to image processing <ref type="bibr">(Krizhevsky et al., 2017)</ref>, education to job recruitment <ref type="bibr">(Boselli et al., 2018)</ref>, and social networking to cybersecurity <ref type="bibr">(Xin et al., 2018)</ref>. Machine learning practitioners have adopted learning algorithms to fathom inherently difficult and crucial problems. However, na&#239;ve deployment of these models may lead to serious shortcomings such as biased predictions against minority groups <ref type="bibr">(Angwin et al., 2016;</ref><ref type="bibr">Buolamwini &amp; Gebru, 2018)</ref>, vulnerability to adversarial attacks <ref type="bibr">(Madry et al., 2017;</ref><ref type="bibr">Carlini &amp; Wagner, 2017;</ref><ref type="bibr">Baharlouei et al., 2023)</ref>, or lack of generalizability <ref type="bibr">(Arjovsky et al., 2019)</ref>. Consequently, it is of utmost importance to have reliable and trustworthy models that are, in particular, fair and comply with equality norms and provisions worldwide <ref type="bibr">(Act, 1964;</ref><ref type="bibr">Elford, 2023)</ref>.</p><p>With the increasing concern for the trustworthiness of unchecked machine learning algorithms, a broad class of paradigms has been proposed to counteract and mitigate both the cause and effects of model unreliability. Imposing statistical independence between model output and particular input features is of interest in various domains, especially when the generalization of a trained model is based on a collection of spurious features present in the training dataset <ref type="bibr">(Dwork et al., 2012;</ref><ref type="bibr">Hardt et al., 2016;</ref><ref type="bibr">Yan et al., 2017)</ref>. These could be sensitive features like gender, race, age, and/or income in the context of fairness or confounding factors like environmental artifacts in the context of image classification <ref type="bibr">(Arjovsky et al., 2019)</ref>. Existing literature on imposing statistical independence between selected input features and model outputs is directed into three approaches: pre-processing, post-processing, and in-processing methods.</p><p>Pre-processing methods entail upstream changes made in datasets to mask sensitive features or reduce the dependency of output variables on sensitive features through transforming data in a stage before the training phase <ref type="bibr">(Kamiran &amp; Calders, 2012;</ref><ref type="bibr">Zemel et al., 2013;</ref><ref type="bibr">Ustun et al., 2019)</ref>. Post-processing methods involve model-specific adjustments to the model's output to ensure the independence of predictions and sensitive attributes <ref type="bibr">(Hardt et al., 2016;</ref><ref type="bibr">Alghamdi et al., 2022)</ref>. While pre-processing and post-processing methods do not affect the training procedure, they fail to exploit underlying training mechanisms for the best achievable accuracy-fairness tradeoffs. Unsurprisingly enough, optimizing accuracy and fairness jointly (in-processing) leads to better tradeoffs than sequentially optimizing fairness and accuracy in a pre-processing or post-processing fashion.</p><p>In-processing methods alternatively add fairness constraints or regularizers, penalizing dependence between sensitive attributes and output variables. <ref type="bibr">(Zafar et al., 2017)</ref> utilizes covariance as the measure of independence between the sensitive attributes and the predictions. While such a measure is amenable to stochastic updates, it fails to capture correlations beyond linear. Alternatively, several non-linear measures such as R&#233;nyi correlation <ref type="bibr">(Baharlouei et al., 2020)</ref>, 2 divergence <ref type="bibr">(Lowy et al., 2022)</ref>, L 1 distance <ref type="bibr">(Donini et al., 2018)</ref>, and Maximum Mean Discrepancy (MMD) <ref type="bibr">(Prost et al., 2019)</ref> are proposed in the literature to establish the independence of the predictors and sensitive attributes. In-processing techniques can be model-specific <ref type="bibr">(Wan et al., 2021;</ref><ref type="bibr">Aghaei et al., 2019)</ref> or generalizable to different training algorithms <ref type="bibr">(Baharlouei et al., 2020;</ref><ref type="bibr">Lowy et al., 2022)</ref>.</p><p>In the spirit of in-processing methods, input data-driven constraints or regularization terms are used to modify training objectives of problems like learning generalizable models to new environments, invariant learning, and learning in the presence of distribution shifts <ref type="bibr">(Arjovsky et al., 2019;</ref><ref type="bibr">Mary et al., 2019;</ref><ref type="bibr">Baharlouei et al., 2020)</ref>. Such constrained/regularized reformulations are prevalent in learning robust classifiers against adversarial attacks <ref type="bibr">(Sinha et al., 2018)</ref>, meta-learning <ref type="bibr">(Balaji et al., 2018)</ref>, federated learning <ref type="bibr">(Deng et al., 2023)</ref>, and alternative learning paradigms such as learning distributionally robust optimization (DRO) models <ref type="bibr">(Kuhn et al., 2019;</ref><ref type="bibr">Levy et al., 2020)</ref>, tilted empirical risk minimization (TERM) <ref type="bibr">(Li et al., 2020)</ref>, and Squared-root Lasso <ref type="bibr">(Belloni et al., 2011)</ref>.</p><p>While in-processing techniques outperform pre-processing and post-processing approaches, they are not scalable to large datasets because of a lack of adaptability to stochastic optimization <ref type="bibr">(Mary et al., 2019;</ref><ref type="bibr">Lowy et al., 2022)</ref>. All aforementioned examples consist of regularization terms in their objective functions where the gradient cannot be described as a linear combination of data point functions. As a result, applying stochastic gradient descent or other stochastic first-order methods on the objective functions of such problems might not converge, especially for small batch sizes.</p><p>Motivated by this, <ref type="bibr">(Lowy et al., 2022)</ref> proposes a stochastic optimization framework for Exponential R&#233;nyi Mutual Information as the measure of independence. More recently <ref type="bibr">Zhong &amp; Tandon (2023)</ref> use f -divergences as regularization terms to establish the independence between sensitive attributes and predictions. They estimate the f -divergence regularizers offline through multi-layer neural networks to avoid the computational challenges of devising scalable stochastic methods for nonconvex min-max problems. Our approach, on the other hand, directly solves the variational formulation for both full-batch and stochastic settings with convergence guarantees to non-spurious solutions. In Section 2, using the variational representation of f -divergences, we present a convergent stochastic optimization framework for fair learning via f -divergences. <ref type="bibr">(Lowy et al., 2022)</ref> is a special case of f -divergences where f (t) = t 2 1 ( 2 divergence). Aside from 2 , all other divergences listed in Table <ref type="table">1</ref> are not introduced in the literature to the best of our knowledge.</p><p>Designing convergent stochastic algorithms for fair empirical risk minimization can be further explored in scenarios involving changes in the data distribution from the source to the target domain. Detection and mitigation of biases against protected groups in the presence of distribution shifts have been extensively studied in recent years. <ref type="bibr">Lechner et al. (2021)</ref> theoretically shows that learning fair representations (pre-processing) is nearly impossible for the popular notions of fairness, such as demographic parity in the presence of the distribution shift. <ref type="bibr">Ding et al. (2021)</ref>, on the other hand, experimentally demonstrates that applying post-processing fairness techniques <ref type="bibr">(Hardt et al., 2016)</ref> to learn fair predictors of income concerning race, gender, and age fails to transfer from one US state (training domain) to another state. Overlooking distribution shifts can lead to catastrophic decisions threatening the well-being of human subjects when deploying a trained model in certain hospitals to other hospitals <ref type="bibr">(Schrouff et al., 2022)</ref>. The current literature for handling distribution shifts with in-processing methods relies on certain assumptions on the type of distribution shift (demographic shift <ref type="bibr">(Fang et al., 2020;</ref><ref type="bibr">Du &amp; Wu, 2021;</ref><ref type="bibr">Maity et al., 2021;</ref><ref type="bibr">Giguere et al., 2021)</ref>, label shift <ref type="bibr">(Dai &amp; Brown, 2020)</ref>, and/or covariate shift <ref type="bibr">(Rezaei et al., 2021;</ref><ref type="bibr">Singh et al., 2021)</ref>) or explicit access to the causal graph <ref type="bibr">(Mishler &amp; Dalmasso, 2022;</ref><ref type="bibr">Schrouff et al., 2022)</ref> of predictors, sensitive attributes, and target variables. As a result, they face practical limitations and cannot cope with most real-world problems involving complex shifts that cannot be categorized in the ones assumed in their works. <ref type="bibr">Alternatively, Taskesen et al. (2020)</ref> provides convex objective functions for imposing fairness on logistic regression using constraint optimization. <ref type="bibr">Staib &amp; Jegelka (2019)</ref> use MMD for defining uncertainty sets around training distribution, whereas Husain (2020) use Integral Probability Measure (IPM) to mitigate the distribution shift. The main limitation of these approaches is their reliance on the convexity of the underlying learning model and lack of scalability due to incompatibility with stochastic optimization algorithms. <ref type="bibr">Wang et al. (2023)</ref> uses the Maximum Mean Discrepancy (MMD) distance between the spectral norm of the Hessian matrix at advantaged and disadvantaged data points. However, they do not provide convergence guarantees for their proposed algorithm to any notion of optimality. In addition, the method is not necessarily amenable to stochastic updates. While we naturally define the uncertainty set directly on the joint distribution of sensitive attributes and predictions, they use the curvature of the obtained solution quantified by the norm of the Hessian matrix as a heuristic for promoting the robustness of the fair solution.</p><p>Contributions: This paper establishes a scalable (stochastic) fair empirical risk minimization framework through regularization via f -divergences (f -FERM) for both standard and distributed shift settings. f -FERM presents a unified methodology based on the Legendre-Fenchel transformation, enabling us to develop theoretically convergent first-order stochastic algorithms when only small batches of data are available at each iteration. Further, we have presented the first distributionally robust optimization framework under `p norms uncertainty sets covering nonconvex losses such as neural networks. The presented framework for fair inference in the presence of distribution shifts does not rely on the causal graph describing the causal interaction of input features, sensitive attributes, and target variables, which is rarely available in practical problems.</p><p>Paper Organization: We structure our response towards designing scalable, robust, and fair algorithms into two sections. Section 2 motivates the design of unbiased gradient estimators of objectives with information-theoretic f -divergence regularizers. In Section 3, we present our approach for fair inference in the presence of the distribution shift in detail. Our experiments provide an extensive examination of various f -divergences and their suitability as regularizers and also show the consistency of our method across all batch sizes in contrast to existing benchmarks. Similar experiments are carried out for robust training on varying amounts of distributional shifts in data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">FAIR EMPIRICAL RISK MINIMIZATION VIA f -DIVERGENCES</head><p>A widely studied problem in algorithmic fairness is promoting a notion of group fairness, such as demographic parity, equalized odds, equality of opportunity, or sufficiency through an in-processing method. For these notions, we aim to establish a [conditional] statistical independence between the predictions (e.g., the creditworthiness of the individual) and the sensitive attributes (e.g., gender, race). For simplicity of presentation, we formulate all problems under the demographic parity notion, which requires statistical independence between the prediction and the sensitive attribute. Without loss of generality, all formulations and methods are generalizable to other aforementioned notions of group fairness by considering conditional random variables (see Appendix A). A popular in-processing approach for training fair (classification) models under the demographic parity notion is to regularize the empirical risk minimization:</p><p>where &#10003; is the learning parameters (e.g., weights of the neural network); x i 2 R d is the i-th input feature vector; y i is the actual label/class for sample i; &#375;&#10003; (x i ) is the prediction of the model for sample i; and `(&#375; &#10003; (x i ), y i ) is the loss function measuring the "goodness-of-fit" for sample i. Here, D is a divergence between the joint probability distribution of the predictions and sensitive attributes and the Kronecker product of their marginal distributions. Recall that &#375;&#10003; and s are statistically independent iff P(&#375; &#10003; (x), s) follows P(&#375; &#10003; (x)) &#8998; P(s). Thus, the second term in (1) is zero iff &#375;&#10003; and s are statistically independent (complete fairness under the demographic parity notion). This section studies the fair empirical risk minimization regularized by a broad class of f -divergence measures. Let P and Q be two discrete probability measures taking values in P = {1, . . . , m}. The f -divergence between P and Q is defined as <ref type="bibr">(Polyanskiy &amp; Wu, 2022, Def 4</ref>.9)(see Appendix B for the general continuous case):</p><p>The above definition, which is also known as f -mutual information <ref type="bibr">(Lu et al., 2023;</ref><ref type="bibr">Csisz&#225;r, 1967)</ref>, covers many known divergence measures used for imposing fairness, such as KL-divergence for the choice of f (t) = t log(t) <ref type="bibr">(Shui et al., 2022)</ref>, or 2 divergence when f (t) = (t 1) 2 <ref type="bibr">(Lowy et al., 2022)</ref>. As shown in Appendix C, D f in (1) is zero if and only if the probability distribution of s and &#375;&#10003; are statistically independent for the choices of f listed in Table <ref type="table">1</ref>. In addition, we prove that these f -divergences either cover or provide upper bounds for the popular notions of fairness violations in the literature, such as `p distances, R&#233;nyi correlation <ref type="bibr">(Baharlouei et al., 2020)</ref>, and demographic parity (equalized odds) violation. This means that by minimizing these regularizers, we are minimizing an upper bound of (other) popular fairness violation measures, and thus we are controlling them implicitly. Further, unlike R&#233;nyi correlation <ref type="bibr">(Baharlouei et al., 2020;</ref><ref type="bibr">Grari et al., 2020)</ref>, we can utilize Legendre-Fenchel duality (and variational representation) to develop (provably) convergent algorithms with stochastic (mini-batch) updates. This formulation and the resulting stochastic optimization algorithm are described in the next subsection.</p><p>2.1 A CONVERGENT STOCHASTIC ALGORITHM FOR FAIR ERM VIA f -DIVERGENCES Let us start by rewriting (1) using f -divergences as the divergence measure:</p><p>While the non-linearity of f -divergences in (f -FERM) empowers the underlying model to capture more complex dependencies between sensitive attributes and predictions compared to the linear measures <ref type="bibr">(Zafar et al., 2017)</ref>, the objective function can no longer be represented as a summation of functions over input data points. Consequently, one cannot directly apply the stochastic gradient descent method (or its variations, such as Adam) to the objective function in (f -FERM). In particular, directly evaluating the gradient of the objective function of (f -FERM) on a mini-batch of data leads to a statistically biased estimation of the entire objective's gradient. Such statistical biases prevent the convergence of algorithms such as SGD (even with a strongly convex minimization landscape) <ref type="bibr">(Ajalloeian &amp; Stich, 2020;</ref><ref type="bibr">Chen &amp; Luss, 2018)</ref>, let aside the more complex objectives arising in modern-day neural networks.</p><p>To derive stochastic algorithms, one can use the variational forms of f -divergences to delineate them as a pointwise supremum of affine transformation over probability densities. The most commonly used and well-behaved transform is the Legendre-Fenchel transform (often called the convex conjugates), which linearizes the dependence of the objective function to input data points using a variational reformulation. Particularly, we can rewrite (f -FERM) using the following result:</p><p>Proposition 2.1. Let f (&#8226;) be a convex function. Then, (f -FERM) can be reformulated as:</p><p>where f &#8676; (z) = sup w2dom(f ) w T z f (w) is the Legendre-Fenchel transformation of the function f .</p><p>Proof. The proof is standard and appears in Appendix D.</p><p>In order to solve (3), we will use (stochastic) first-order methods. Notice that P s (s = k) is constant through the optimization procedure and is computed once by counting the number of data points whose sensitive attribute takes the value of k:</p><p>Assume we use the softmax layer to compute the probabilities of different classes in our classification task (as it is standard in logistic regression or using neural networks for classification). Let F j (x i ; &#10003;) be the j-th entry of the softmax layer output for datapoint x i , predicting the probability of class j. Then it is easy to show that we can obtain unbiased estimators of P &#375;&#10003; (&#375; &#10003; = j) and P &#375;&#10003; ,s (&#375; &#10003; = j, s = k) using i.i.d. mini-batch B of data points. More precisely, we have</p><p>i .</p><p>(4) As a result, Problem (3) can be written as a linearly separable function of input data points (x i 's):</p><p>Thus, evaluating the gradient of the objective function w.r.t. the variables &#10003; and A over a random batch of data points leads to an unbiased estimator of the gradient of the objective function.</p><p>In addition to providing an unbiased estimator of gradients, the reformulation (5) has another crucial property: the objective function is concave in A. Therefore, optimization problem (5) falls under the category of nonconvex-concave min-max optimization problems. That is, the objective is (possibly) nonconvex in &#10003; and is concave in A. Thus, we can borrow tools from the (stochastic) nonconvexconcave min-max optimization literature <ref type="bibr">(Lin et al., 2020;</ref><ref type="bibr">Razaviyayn et al., 2020a;</ref><ref type="bibr">Li et al., 2023)</ref> to derive a convergent first-order stochastic algorithm as presented in Algorithm 1. We listed the closed-form of f (&#8226;), f &#8676; (&#8226;), for several widely-used f -divergence measures including KL-divergence, Reverse KL-divergence, 2 -divergence, Squared Hellinger distance, Jensen-Shannon divergence, and total variation distance in Table <ref type="table">1</ref>. For the derivation, see Appendix E.</p><p>Algorithm 1 Stochastic Gradient Descent-Ascent (SGDA) for f -FERM </p><p>&#8226;) and F j (&#8226;, &#10003;) are Lipschitz continuous for any given j and &#10003; and their gradients are L-Lipshitz. Further, assume that P(s = k) &gt; 0 for all protected groups and P(&#375; &#10003; = j) &gt; 0 at every iteration for all labels j. Then, for any given batch size 1 &#63743; |B| &#63743; n, Algorithm 1 finds an &#9999;-stationary solution of (f -FERM) in O( 1 &#9999; 8 ) for any given &#9999; &gt; 0.</p><p>Proof. The formal statement and proof are relegated to Appendix F.</p><p>Theorem 2.2 applies to all f -divergences listed in Table <ref type="table">1</ref> for all batch-sizes (even as small as the batch size of 1). More sophisticated algorithms can be used to obtain O(&#9999; 6 ) iteration complexity Rafique et al. ( <ref type="formula">1810</ref>); <ref type="bibr">Zhang et al. (2022)</ref>. However, such algorithms use nested loops and require more hyperparameter tunings. We provide an example of such an algorithm in Appendix G. If the fdivergence leads to a strongly concave function in A or satisfies Polyak-&#321;ojasiewicz condition (e.g., for 2 divergence), a faster rate of O(&#9999; 5 ) can be obtained for this algorithm (Appendix F). In addition, if larger batch size of O(&#9999; 2 ) is used, we can further improve this rate to O(&#9999; 4 ) iteration complexity (see Appendix F). Finally, when full batch size is used, then double/triple-loop algorithms can lead to the iteration complexity bounds of O(&#9999; 2 ) in the nonconvex-strongly concave setting and O(&#9999; 3 ) in the general nonconvex-concave setting; see <ref type="bibr">(Kong &amp; Monteiro, 2021;</ref><ref type="bibr">Nouiehed et al., 2019;</ref><ref type="bibr">Ostrovskii et al., 2021b;</ref><ref type="bibr">Thekumparampil et al., 2019)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">ROBUST f -FERM IN THE PRESENCE OF DISTRIBUTION SHIFTS</head><p>In the previous section, we assumed that the training and test domains have the same distribution. However, this assumption is not necessarily valid in certain applications <ref type="bibr">(Fang et al., 2020)</ref>. In particular, a model that behaves fairly on the training data distribution may have an unfair performance in the test phase. To address this issue, this section develops stochastic algorithms for fair empirical risk minimization via f -divergences in the presence of the distribution shifts.</p><p>Assume that Ps,y (s, &#375;) is the joint distribution of sensitive attributes and predictions on the training data. The distributionally robust fair empirical risk minimization via f -divergences is formulated as:</p><p>B = B( P, ) is the distributional uncertainty set defined as a certain ball around the training distribution P with radius . This formulation guarantees that the model fairness is preserved (up to a violence of f-divergence less than &#63743;) even when the test distribution slightly changes. With a slight change of notation, P refers to the training distribution, whereas P is the optimization parameter.</p><p>One can define the uncertainty set through an &#9999; neighborhood around the joint distribution of the training data characterized by a distance measure such as `p norms, Wasserstein distance, or MMD distance. While these distributionally robust uncertainty sets are thoroughly analyzed for empirical risk minimization (ERM) <ref type="bibr">(Kuhn et al., 2019;</ref><ref type="bibr">Blanchet et al., 2019;</ref><ref type="bibr">Levy et al., 2020)</ref>, the DRO formulation for ERM is limited to the Wasserstein distance for the fair logistic regression <ref type="bibr">(Taskesen et al., 2020)</ref> and MMD distance <ref type="bibr">(Wang et al., 2023)</ref> on the distribution curvature as a heuristic for robustness. Unfortunately, none of these approaches offer a convergent algorithm with stochastic updates. Further, some of these approaches are limited to special loss functions and heuristics. On the other hand, we study imposing the distributionally robust fairness via f -divergences for a general loss function where the uncertainty set is characterized by `p norms (Section 3.1) or f -divergences (Section 3.2). Our results show that the former approach is more suitable when lower levels of robustness for fairness are required, and the latter works better for handling larger distribution shifts.</p><p>3.1 ROBUST f -FERM UNDER `p NORMS AND SMALL DISTRIBUTION SHIFTS This section focuses on the widely studied `p norms as the uncertainty set for the distributional distance between the training and test domains. In this case, Problem ( <ref type="formula">6</ref>) can be written as:</p><p>where P represents the joint distribution of the sensitive attributes and predictions and Q denotes the Kronecker product of the marginal distributions between sensitive attributes and predictions.</p><p>Since handling non-convex constraints is challenging, as it is standard in training machine learning models, we consider the Lagrangian relaxation of Problem (7) as follows:</p><p>This problem falls under the nonconvex-nonconcave, min-max optimization category and is most likely to be computationally hard for general uncertainty sets <ref type="bibr">(Daskalakis et al., 2021)</ref>. However, such a min-max optimization problem can be solved to stationarity when the diameter of set B is small (i.e., under small domain shift), see <ref type="bibr">(Ostrovskii et al., 2021a)</ref>. The core idea is to approximate the inner maximization problem with the Taylor approximation, leading to a nonconvex-concave min-max optimization, which is easier to solve <ref type="bibr">(Daskalakis et al., 2021;</ref><ref type="bibr">Razaviyayn et al., 2020b)</ref>. This idea has been used and been successful in machine learning (see <ref type="bibr">Foret et al. (2020)</ref> for its use in Sharpness-aware minimization). Utilizing this idea, Problem (8) can be approximated as:</p><p>where we used the change of variables U := P P and</p><p>where k &#8226; k q is the dual of the `p norm with 1 p + 1 q = 1. Proposition 3.1. Assume that the gradient of the loss function is L-Lipshitz, and the second-order derivative of the loss exists. Then, a given &#9999; approximate stationary solution of Problem (10) is an O(&#9999;) approximate stationary solution of Problem (8) whenever L . &#9999;. This proposition, which is an immediate application of <ref type="bibr">Ostrovskii et al. (2021a, Theorem 3.1)</ref>, states that if the desired training accuracy &#9999; is comparable with the distribution shift amount (i.e. small distribution shift regime), then one can solve problem (10) instead of (8). Thus, in this regime, we need to solve (10) or equivalently (9). To this end, we need to obtain the (sub)-gradients of the objective function in (9) w.r.t the &#10003;, U, and V variables. First, notice that</p><p>where &#8629; &#8676; ( P, Q) 2 arg max &#8629; P j &#8629; j pj (&#10003;) qj (&#10003;)f &#8676; (&#8629; j ). Here we invoked Danskin's theorem on the variational form of D f ; pj (&#10003;) and qj (&#10003;) is the j-th element of P and Q, respectively. Next, we need to compute r &#10003; h(&#10003;, U, V). Notice that the derivative of the first term in h(&#8226;) w.r.t. &#10003; is easy to compute. We next calculate the derivative of the second term of h(&#10003;, U, V) w.r.t. &#10003;. As the derivative of the third term can be computed similarly, we omit its derivation here.</p><p>where in the last equation, we used the implicit function theorem to compute the derivative of &#8629; &#8676; w.r.t. &#10003;. Notice that an implicit assumption here is that f is differentiable (which holds for KL-divergence, 2 divergence, reverse KL, Jensen-Shannon, and Squared Hellinger distance). Having access to the gradients, we can apply the standard [sub-]gradient descent-ascent algorithm to obtain a solution to Problem (10) (see Appendix H for the details).</p><p>A semi-stochastic memory-efficient first-order training algorithm. To apply (stochastic) gradient descent-ascent algorithm <ref type="bibr">(Lin et al., 2020)</ref> to problem (9), we need to have unbiased estimator of the function h(&#10003;, U, V) w.r.t. &#10003;, U, and V variables. While it seems challenging to obtain unbiased estimator w.r.t. all variables, one can notice that if pj (&#10003;) and qj (&#10003;) can be computed easily with one forward pass over all data points (i.e., in O(m &#8677; n) memory requirement). Consequently, the gradient of h(&#10003;, U, V) w.r.t. U and V can be computed with one forward pass over all data points (without the need for doing backpropagation). On the other hand, one can easily obtain unbiased estimator of r &#10003; pj (&#10003;) and r &#10003; qj (&#10003;) in (11) using a small mini-batch of data. Such a task requires O(b &#8677; d) memory with d being the number of parameters (i.e., &#10003; 2 R d ) and b being the batch size. Combining this unbiased estimation with the computed values of pj (&#10003;) and qj (&#10003;) leads to an unbiased estimator of the objective of (9) w.r.t. &#10003; variable. To summarize, we need to do one forward propagation to obtain gradients w.r.t. U and V, and we only do backpropagation for computing gradients w.r.t. &#10003; over the mini-batch of data. Such an algorithm requires O(mn + bd) memory requirement and thus can be used for training large models (with d, n b, m). It is known that memory requirements are the major limiting factors in training large models such as LLMs <ref type="bibr">(Malladi et al., 2023)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">ROBUST f -FERM UNDER `1 NORMS AND POTENTIALLY LARGE DISTRIBUTION SHIFTS</head><p>The developed framework in the previous section assumes the distribution shift is small (the uncertainty set diameter is smaller than a certain threshold). When preserving fairness in the presence of large distribution shifts is a priority, our previous methodology might not work well. As discussed before, the formulation (8) leads to a nonconvex-nonconcave min-max optimization problem and this class of problems is hard to solve computationally in general (even to stationarity notions). Thus, we need to exploit the structure of the problem. In this section, we show that we can exploit the structure to develop a first-order algorithm under large distribution shifts. Particularly, we focus on the case where the uncertainty set is `1 ball and the divergence satisfies certain assumptions (i.e., f &#8676; (&#8629; &#8676; ) &gt; 0 and &#8629; &#8676; &gt; 0, which is satisfied for KL divergence). Since the function D f is convex in P and Q, under `1 uncertainty set on P and Q, the optimal solution of the maximization problem in (8) will be at an extreme point. Moreover, under the assumption that f &#8676; (&#8629; &#8676; ) &gt; 0 and &#8629; &#8676; &gt; 0 (which is satisfied for KL divergence), one can easily see that the optimal p j = min{p j + , 1} and q j = max{ qj , 0} (see Appendix I for the exact proof). Notice that we need to relax the probability simplex constraint to obtain this efficient, optimal closed-form solution. Thus under this assumption, problem (8) can be reformulated as</p><p>which is a regular minimization problem and (sub)gradient descent can be utilized to solve it.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">EXPERIMENTS</head><p>We use three popular notions of group fairness: demographic parity, equalized odds, and equality of opportunity violations (see Appendix A for definitions) to measure the fairness of trained models.</p><p>To run Algorithm 1, we set &#8984; &#10003; and &#8984; &#8629; to 10 5 and 10 6 respectively in all experiments. Further, by changing , we get different points in the trade-off curve between accuracy and fairness. The range of depends on the f -divergence (see Appendix J for more information on tuning hyper-parameters).</p><p>In the inference phase of our experiments, we use the standard maximum likelihood decoding based on the output of the softmax layer, i.e., the predicted label is the label with the highest logit value.</p><p>As we will see in this section, several f -divergence measures lead to reasonable fairness/accuracy tradeoffs and can outperform existing benchmarks. However, no single f -divergence measure uniformly outperforms other measures in all the experiments. Thus, we believe in applications, the choice of the f -divergence can be viewed as a hyperparameter that can be tuned by cross-validation. We do not display the performance for larger batch sizes or when only one sensitive attribute is available due to the insignificant difference between the performance of different f -divergences.</p><p>In the first set of experiments, we compare different fdivergence formulations for (f -FERM) to each other and several state-of-the-art approaches supporting multiple sensitive attributes. Figure <ref type="figure">1</ref> demonstrates the given tradeoff on the adult dataset <ref type="bibr">(Becker &amp; Kohavi, 1996)</ref> with gender and race as the sensitive attributes (black-female, black-male, white-female, white-male). To measure fairness, we use the demographic parity violation defined as: DPV = max i,j2S</p><p>|P(&#375; = 1|s = i) P(&#375; = 1|s = j)|</p><p>In the case of binary sensitive attributes (e.g., gender), there is no significant variation between different f -divergences. However, when we have 2 sensitive attributes and the batch size is small (8 in Figure <ref type="figure">1</ref>), the results significantly differ for various fdivergences. Interestingly, KL-divergence for smaller values shows improvement in fairness violation and accuracy simultaneously. We do not observe such a phenomenon for other f -divergences and state-of-theart approaches in the literature. Further, in Figure <ref type="figure">2</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">FAIRNESS-ACCURACY TRADEOFFS IN THE PRESENCE OF THE DISTRIBUTION SHIFT</head><p>We perform two experiments to evaluate the Algorithms developed in Section 3. In the first experiment, we randomly switch the label of genders for n% of the data points (n ranges from 1 to 20) in the Adult dataset. Then, we train models on the new datasets with a proportion of corrupted sensitive attributes and evaluate the performance on the test data. Figure <ref type="figure">3</ref> is obtained by training different models to achieve 80% accuracy on the test data and comparing their demographic parity violation. By increasing the percentage of corrupted sensitive attributes, we see that both f -DRO and f -infinity achieve less DP violation than SOTA approaches in the literature. In this specific experiment, f -DRO works better than f -infinity, and there is no significant difference between choosing KLdivergence or 2 as the function f . Among the papers designed for handling distribution shifts, <ref type="bibr">Rezaei et al. (2021)</ref> and <ref type="bibr">Wang et al. (2020)</ref> were the only options with the available implementation.</p><p>Figure <ref type="figure">3</ref>: Performance of different state-of-the-art approaches and our two methods for handling distribution shift. The dataset is adult, and the sensitive attribute is gender. We randomly flip the label of a proportion of gender entries (from 0 to 20%). As we observe, our approach demonstrates more robustness against the drop in DP violation compared to other approaches.</p><p>In a more recently collected dataset (new adult) <ref type="bibr">(Ding et al., 2021)</ref>, the users are separated based on their living state. We train different fair models in a single state and evaluate the fairness-accuracy tradeoff in other states. <ref type="bibr">Figure 4</ref> depicts the performance of different methods. For each method, the center point is the average of accuracy and fairness among 50 states. The horizontal and vertical lines show the 25percentile to 75-percentile range of performance among the states. The training fairness violation is set to 0.02 for all methods. We observe that f -infinity preserves the fairness level better than other approaches. In comparison, f -DRO has a better accuracy. Depending on the application, we suggest using f -infinity if preserving a high level of fairness is a priority and f -DRO for the cases when a better tradeoff between fairness and accuracy is expected. Note that both approaches offer better fairness-accuracy tradeoffs compared to the SOTA approaches in the literature. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">CONCLUSION</head><p>This paper presented a unified stochastic framework for fair empirical risk minimization via fdivergences (f -FERM). The key idea is to reformulate the objective function as a min-max optimization problem using Legendre-Fenchel duality of f -divergence. This enables us to develop an unbiased gradient estimator and a convergent stochastic first-order algorithm. Furthermore, we robustified f -FERM using `p norm balls as the uncertainty set against distributional changes. While our empirical investigation delves into the performance and fairness distinctions among various f -divergences, a more comprehensive analysis is warranted to determine the optimal f -divergence concerning the tradeoff between performance and fairness, faster convergence, and asymptotic behaviors. Furthermore, the distributionally robust formulation of fair empirical risk minimization and the advantages of each formulation can be explored beyond f -divergences as the measure of fairness violation and `p norm balls as uncertainty sets.</p></div></body>
		</text>
</TEI>
