<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Hardware-Sensitive Fairness in Heterogeneous Federated Learning</title></titleStmt>
			<publicationStmt>
				<publisher>ACM Digital Library</publisher>
				<date>03/31/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10627902</idno>
					<idno type="doi">10.1145/3703627</idno>
					<title level='j'>ACM Transactions on Modeling and Performance Evaluation of Computing Systems</title>
<idno>2376-3639</idno>
<biblScope unit="volume">10</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Zahidur Talukder</author><author>Bingqian Lu</author><author>Shaolei Ren</author><author>Mohammad Atiqul Islam</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Federated learning (FL) is a promising technique for decentralized privacy-preserving Machine Learning (ML) with a diverse pool of participating devices with varying device capabilities. However, existing approaches to handle such heterogeneous environments do not consider “fairness” in model aggregation, resulting in significant performance variation among devices. Meanwhile, prior works on FL fairness remain hardware-oblivious and cannot be applied directly without severe performance penalties. To address this issue, we propose a novel hardware-sensitive FL method called FairHetero that promotes fairness among heterogeneous federated clients. Our approach offers tunable fairness within a group of devices with the same ML architecture as well as across different groups with heterogeneous models.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"> <ab><ref type="bibr">(G-5)</ref></ab><p>, where G-1 to G-5 represent decreasing model architecture sizes and limited hardware capability, suffers from inferior performance due to its smaller model. With fairness constraints, all clients are forced to adopt the smallest architecture, leading to severe performance degradation. Hence, existing approaches for performance fairness result in hardware unfairness as clients with better hardware do not perform better.</p><p>perform ML model training locally and send the model updates to a central server for aggregation <ref type="bibr">[15,</ref><ref type="bibr">16,</ref><ref type="bibr">28]</ref>. These early works on FL force all devices to adopt identical ML models (i.e., homogeneous model architecture) for local training, even when different participating devices have different hardware capabilities. Meanwhile, in the pursuit of performance improvement, increasingly complex and specialized ML models are being developed and deployed, pushing devices to their computation limits <ref type="bibr">[10,</ref><ref type="bibr">31,</ref><ref type="bibr">33]</ref>. With such progression in the ML model complexity, it has become impractical to restrict FL to homogeneous model architecture, which is limited by the weakest participating device. Consequently, new FL approaches are introduced, which allow devices to undertake ML model complexities in line with their hardware capabilities <ref type="bibr">[8,</ref><ref type="bibr">9,</ref><ref type="bibr">20,</ref><ref type="bibr">39]</ref>. However, heterogeneous FL exacerbates the performance disparity among devices as the distribution of device-level data and the model updates may vary significantly among different devices, leading to non-uniform performance among devices on the final trained model <ref type="bibr">[13,</ref><ref type="bibr">22,</ref><ref type="bibr">23,</ref><ref type="bibr">[34]</ref><ref type="bibr">[35]</ref><ref type="bibr">[36]</ref><ref type="bibr">[37]</ref>. These performance variations are undesirable as these "unfairly" advantage or disadvantage some devices. Our goal in this article is to systematically rectify such performance bias and improve FL "fairness. "</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Limitations of Prior Works</head><p>Fairness in ML has garnered significant attention in recent years, with recent works in federated settings trying to address the performance disparity among FL clients <ref type="bibr">[11,</ref><ref type="bibr">23,</ref><ref type="bibr">27,</ref><ref type="bibr">40]</ref>. However, prior works suffer from two fundamental limitations. First, these approaches force every device to have the same model architecture, leading to significant degradation of overall performance due to the architecture bottleneck of the weakest hardware/device. We illustrate this in Figure <ref type="figure">1</ref> where imposing fairness (forcing the same model architecture for all) on heterogeneous devices causes severe performance loss. Note here that while fairness constraints typically cause some performance loss (mainly for clients with better performance without fairness), the performance degradation in Figure <ref type="figure">1</ref> is mostly due to enforcing architecture homogeneity. Second, existing FL fairness approaches typically focus on reducing the performance gap among clients by boosting the performance of those who are lagging behind. This approach assumes that all clients should achieve similar performance levels. However, in practical FL scenarios, it is expected that clients with better hardware, capable of running more complex ML architectures, will naturally outperform those with less capable devices <ref type="bibr">[9]</ref>. While it is important to support weaker clients,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Preliminaries</head><p>In this section, we provide an overview of the fundamental concepts related to our work. We start by defining FL and its optimization objective. We then discuss FL in the context of heterogeneous model architecture, where devices may have different computational capabilities and model sizes. Finally, we introduce the concept of fairness in FL and define our notion of performance fairness, which aims to ensure balanced performance across devices with varying hardware and data characteristics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Federated Learning</head><p>The goal of FL can be formulated as the following optimization problem:</p><p>where N is the total number of devices while f i (&#952; ) and p i &gt; 0 are the local objective and weight parameter of device i, respectively. The typical choice of local objective f i (&#952; ) is the empirical risk over the local dataset D i , i.e., f i (&#952; ) = 1</p><p>(x,y)&#8712;D i l(&#952;, x, y). We can set</p><p>to achieve the minimum empirical risk over the entire data set across all devices. The solution of (1) in prior works (particularly federated averaging or FedAVG) involves communication-efficient update, where a subset of all devices apply stochastic gradient descent on their local dataset for multiple epochs before sending it to the aggregation server <ref type="bibr">[28]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">FL with Heterogeneous Architecture</head><p>Instead of a homogeneously shared model, Reference <ref type="bibr">[9]</ref> proposes to utilize heterogeneous models where each device trains a model appropriate to its own device capability. The key idea here is that weaker devices get a smaller model that can be nested within the centralized larger model.</p><p>Let us consider that N devices in FL are divided into M groups of devices, each group with N m members sharing the same model architecture. Groups m's model architecture, &#952; m is extracted from the centralized model as &#952; A m , where A m is a matrix with the same dimension as &#952; consisting of only 0's and 1's, serving as a mask applied to the global model to obtain group m's local training parameter &#952; m . &#952; m is a matrix of the same dimension as &#952; , with 0's at positions outside of its desired model architecture A m . Then for client i in group m, its loss function can be written as</p><p>We can update (1) as follows to incorporate the architecture heterogeneity:</p><p>We adopt the objective mentioned above for our design, utilizing the HeteroFL architecture <ref type="bibr">[9]</ref> to categorize FL clients based on hardware capabilities. During aggregation, we use regions denoted by M (j) contributed by different subsets of model groups. Our objective can also be applied to other PT-based algorithms, such as FedRolex <ref type="bibr">[1]</ref>, with slight modifications. It is important to ensure that clients within the same group train the same subset of model parameters in each training round.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Fairness</head><p>An FL system that solves (2) can introduce performance variation among devices due to their heterogeneity in model architecture and data. For instance, the central model will be biased toward devices with larger models and more data. Data creates performance variation among the devices in the same architecture group, and we call this intra-group performance variation. Meanwhile, the architecture causes performance variation among different groups, and we call this inter-group performance variation. In this work, we seek to improve overall "performance fairness" in FL with heterogeneous architecture and define fairness as follows. </p><p>Definition 3.2 (Inter-group fairness-Hardware Fairness). For total M hardware groups trained with global model &#952; and &#952; , we call &#952; is more fair if</p><p>where &#952; m corresponds to the model architecture for group m.</p><p>Our definition of performance fairness is based on the uniformity of clients' test loss on local data. While we formalize fairness using variance, other uniformity metrics, such as the accuracy distribution of clients and cosine similarity <ref type="bibr">[23]</ref>, can also used without any loss of generality of our definition. Also, note that while we define inter-group fairness (Hardware Fairness) as having less performance variation among different hardware groups, our goal is to allow separate tuning capabilities to control the two types of fairness above. In the process, we become aware of hardware capability differences (allowing better hardware to perform better) instead of imposing a blanket performance fairness goal.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Hardware Sensitive Fairness: FairHetero</head><p>In this section, we provide a detailed explanation of FairHetero's objective and solution, along with theoretical analyses on its convergence, uniformity, and generalization bounds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Objective of FairHetero</head><p>To impose the fairness condition on (2), we reweight the objective function to favor the devices with higher loss by giving them higher weights. Our solution is inspired by &#945;-fairness in wireless networks and q-fairness for FL with homogeneous architecture <ref type="bibr">[18,</ref><ref type="bibr">23]</ref>. More specifically, we define the objective of our Fair FL with heterogeneous architecture (FairHetero) as</p><p>Here q and q m are hyperparameters for tuning the inter-group and intra-group fairness, respectively. &#1076; m is the group weight meeting the condition m &#1076; m = 1. We achieve both intra-group and inter-group fairness in (3) by employing a layered weighting approach. The hyperparameter q m is used to minimize variance in the losses among devices within a group m, which shares the architecture mask A m . By adjusting q m , we control intra-group fairness, with each group potentially having a unique value of q m . To minimize variance across different groups, we use a global parameter q, which applies uniformly to all groups. In general, larger values of q and q m enforce stricter fairness requirements. Figure <ref type="figure">2</ref>   (1)  is shared across all client groups &#952; 1 , &#952; 2 , and &#952; 3 . Region M (2) is shared between groups &#952; 1 and &#952; 2 , while region M (j) is exclusive to group &#952; 1 . The FairHetero algorithm introduces parameters q m and q to ensure fair performance both within and across these federated client groups.</p><p>only one group (i.e., M = 1), we set &#1076; 1 to 1, q to 0, and remove the normalization term 1 q m +1 , since there is no inter-group fairness, then the former formulation is equivalent as the formulation of q-fairness <ref type="bibr">[23]</ref>.</p><p>Necessity of layered weighting. The model heterogeneity introduces additional performance variance, and naively applying global fairness as in prior work will result in significant performance degradation among devices with larger architecture and lower loss. By separating the performance variance due to architecture (i.e., inter-group variance) from performance variance due to data (i.e., intra-group variance), we allow a graceful implementation of fairness. Note that (3) is a generalized version of FL fairness and, therefore, also applicable to homogeneous architecture (i.e., M = 1).</p><p>Tuning hyperparameters q and q m . The impact of certain values of q and q m on the FL clients' performance distribution depends on the client datasets, loss function, and model architecture, making it impossible to "directly determine" the values of q and q m for a certain level of fairness. Hence, FairHetero requires hyperparameter tuning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Solution of FairHetero</head><p>We adopt a communication-efficient FL approach where, in each iteration, a device i in group m trains its masked model &#952; A m . The device then sends its loss f m,i and gradient &#8711; &#952; f m,i A m back to the central server. In scenarios where the server cannot determine the clients' capabilities or group affiliations, it can send the global model directly to the client, allowing the client to select a suitable subset of the model architecture. Our design imposes no restrictions on the number of groups, so a group may consist of a single client. Additionally, our approach does not require a coordinator for each group, unlike hierarchical designs.</p><p>The calculation of the group gradient and the norm of the group Hessian is essential for the FairHetero algorithm due to the heterogeneous architectures in federated learning settings. These calculations determine each group's contribution to the global model update, ensuring that updates from different groups are appropriately weighted based on their impact on the global model.</p><p>The calculation of group gradient and norm of group Hessian is necessary for the FairHetero algorithm because of the heterogeneous architecture in federated learning settings. The group gradient and norm of group Hessian are used to calculate the contribution of each group to the global 5:8 Z. Talukder et al. model update. Specifically, the norm of the Hessian is crucial for estimating the local Lipschitz constant of the gradient in FairHetero. More details are provided in Lemmas 1 and 2. This estimation allows for dynamic adjustment of the step size in gradient-based optimization methods, ensuring stable convergence without the need for manual tuning for each q and q m value. By providing an upper bound on how much the gradient can change, the Hessian norm helps maintain efficiency and balance between accuracy and fairness across different q and q m settings. This calculation ensures that updates from different groups are appropriately weighted based on their impact on the global model.</p><p>Calculation of group gradient (&#916; m ). The group gradient regarding global model parameter &#952; is as follows:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Calculation of norm of group Hessian (H m ).</head><p>The Hessian regarding global model parameter &#952; is as follows:</p><p>For gradient in the first term, we have the following:</p><p>Hardware-Sensitive Fairness in Heterogeneous Federated Learning</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5:9</head><p>For gradient in the second term, we have the following:</p><p>Plug the two equations above into Equation (4) as follows:</p><p>For term ( N m i=1 p m,i (q m + 1)f</p><p>A m terms in the Hessian above, we have the following:</p><p>and</p><p>Suppose the non-negative function f (&#8226;) has a Lipschitz gradient with constant L,</p><p>Plugging the three inequalities above into Equation ( <ref type="formula">7</ref>), we have the following:</p><p>Therefore, the norm of the group Hessian (H m ) can be written as follows:</p><p>Adjusting the learning rate A critical concern in tuning hyperparameters q and q m is the adjustment of the learning rate &#947; for each pair value. This adjustment becomes particularly challenging for gradient-based methods, where the step size is inversely proportional to the Lipschitz constant of the function's gradient. Changing the values of q and q m can lead to significant fluctuations in the step size, potentially causing training instability or divergence.</p><p>To address this issue, we propose estimating the local Lipschitz constant for FairHetero's objective by tuning the step size on just one value of q and q m (e.g., q = 0 and q m = 0). This approach allows us to dynamically adjust the step size for our gradient-based optimization method without the need for manual tuning of q and q m . The norm of the Hessian is used to estimate the local and global Lipschitz constant of the gradient, which in turn helps in determining an appropriate step size for gradient-based optimization methods when solving FairHetero objectives. Lemma 1 (Upper Bound for Group Level Hessians). If the non-negative function f (&#8226;) has a Lipschitz gradient with constant L, then for any q m &#8805; 0 and any point &#952; ,</p><p>ACM Trans. Model. Perform. Eval. Comput. Syst., Vol. 10, No. 1, Article 5. Publication date: March 2025. Hardware-Sensitive Fairness in Heterogeneous Federated Learning 5:11 is an upper bound for the local Lipschitz constant of the gradient of p m,i f q m +1 i (&#952; A m ) at point &#952; . Refer to the Appendix in supplementary materials for the complete proof of Lemma 1.</p><p>Lemma 2 (Upper Bound for Global Level Hessians). If the non-negative function F (&#8226;) has a Lipschitz gradient with constant L, then for any q &#8805; 0 and any point &#952; ,</p><p>is an upper bound for the group Lipschitz constant of the gradient of</p><p>at point &#952; . Refer to the Appendix in supplementary materials for the complete proof of Lemma 2.</p><p>Remark 1. Fairness-performance tradeoff. Increasing the value of q enhances the performance of smaller architecture clients. However, if q approaches +&#8734;, then it transforms into a max-min problem <ref type="bibr">[32]</ref>, favoring smaller architectures most at the expense of larger ones. Striking a balance is crucial, aiming for a slight decline in larger architecture performance to significantly boost smaller architecture clients, ensuring an overall improved average performance across all participants. Our experiments in Section 5 demonstrate that a lower q achieves greater uniformity among heterogeneous architecture clients.</p><p>Model aggregation. Due to heterogeneous architecture, we aggregate the models by dividing the global model &#952; into non-overlapping regions, which have an equal number of devices contributing to model updates. For instance, in Figure <ref type="figure">2</ref>, we have three regions in the global model &#952; : the green region gets model updates from all devices, the blue region gets model updates from devices with architecture &#952; 1 and &#952; 2 , and the light-red region gets updates from only devices with &#952; 1 architecture.</p><p>Let us consider there are J regions in the global model. We denote all the groups that contain region j's parameter (non-zero value in A m ) as set M (j) . For a group m &#8712; M (j) , its contribution to global model update is</p><p>, where &#916; (j)  m and H (j) m denote the part of &#916; m or H m that belongs to region j. Finally, the global server updates the model parameter as</p><p>Our solution to (3), FairHetero, is summarized in Algorithm 1. for each group m = 1, . . . , M in parallel do 7:</p><p>Get the desired local model architecture A m 8:</p><p>Local trainable model parameter:</p><p>Client local update:</p><p>11: for each local epoch t = 1, . . . ,T do 12:</p><p>end for 14:</p><p>Local parameter update after T epochs:</p><p>Each client sends loss f m,i and gradient &#8711; &#952; f m,i A m to the central server</p><p>16: end for 17: end for 18:</p><p>Server global aggregation:</p><p>19:</p><p>for each region j = 1, . . . , J do 20:</p><p>end for 22: end for</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Theoretical Analysis of FairHetero</head><p>In this section, we provide convergence analysis, generalization bound, and uniformity analysis for FairHetero. For detailed theoretical analysis and proof of theorems and lemmas, refer to the Appendix in supplementary materials.</p><p>Assumption 1 (Smoothness). Loss functions f 1 , . . . , f N are all L-smooth: &#8704;&#952;, &#981; &#8712; R d and any client i from group m, we assume that there exists L &gt; 0:</p><p>Assumption 2 (Architecture Slicing-induced Noise). We assume that for some &#948; &#8712; [0, 1) and any round r , group m with desired architecture A m , the architecture slicing-induced noise is bounded by:</p><p>where &#952; denotes the global model parameters in round r , and A m is the desired model architecture for clients in group m.</p><p>Assumption 3 (Bounded Gradient). The expected squared norm of stochastic gradients is bounded uniformly, i.e., for constant G &gt; 0 and any round r , client i from group m, and its local training epoch t:</p><p>where &#958; m,i,t is the local training dataset for client i used in local training epoch t and &#952; m is the trainable model parameter for group m: &#952; m = &#952; A m . Hardware-Sensitive Fairness in Heterogeneous Federated Learning 5:13 Assumption 4 (Gradient Noise for IID Data). Under IID data distribution. for any round r , client i from group m and its local training epoch t, we assume that</p><p>for constant &#963; &gt; 0 and independent samples &#958; m,i,t .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1">Convergence Analysis.</head><p>For the convergence analysis, we show that the sum of the squared norm of the gradient of the global loss &#8711; &#952; F (&#952; ) converges over R federated learning rounds. Specifically, in our formulation,</p><p>Theorem 4.1. Under the Assumptions stated above, for a learning rate &#947; and R as total number of federated learning rounds and T as local training epochs, the upper bound for</p><p>where &#915; * min is a composite coefficient with denominator |M (j) | min . For detailed proof refer to the Appendix in supplementary materials.</p><p>This effectively demonstrates the convergence of our objective function after conducting R federated rounds. As mentioned earlier, FairHetero extends prior work in federated learning by offering a flexible tradeoff between performance and fairness through the parameterization of q and q m .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2">Generalization bound for FairHetero.</head><p>In this section, we first describe the setup of FairHetero in more details and then provide the generalization bound. One benefit of FairHetero is that it allows a flexible tradeoff between fairness and performance that generalizes to q-fairness (a special case when M = 1, we set &#1076; 1 to 1, q to 0, and remove the normalization term 1 q m +1 ). The generalization bound for FairHetero provides insights into its performance in federated learning scenarios with heterogeneous groups. The bound ensures that the empirical loss of the model, compared to its true loss, remains within a certain range with high probability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Group level generalization:</head><p>Total loss of group m with unknown loss of each devices,</p><p>where &#955; m is in a probability simplex &#923; m , N m is the total number of clients in group m, D m,i is the local data distribution for client i from group m, &#952; m is the local model parameter for clients in group m, and l is the loss. We use L&#955; m as empirical loss,</p><p>where n i is the number of local data samples of client i from group m and (x i, j , y i, j ) &#8764; D m,i . </p><p>consider an unweighted version of min &#952; m</p><p>m,i , which is equivalent to minimizing the empirical loss</p><p>where 1 p m + 1 q m +1 = 1 (p m &#8805; 1, q m &#8805; 0). Lemma 3 (Generalization Bound for a Specific &#955; m ). Assume that the loss l is bounded by Q &gt; 0 and the numbers of local samples are (n 1 , . . . , n N m ). Then for any &#948; &gt; 0, the following inequality holds with probability at least 1-&#963; for any &#955; m &#8712; &#923; m , &#952; &#8712; &#920;:</p><p>where</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Refer to the Appendix in supplementary materials for the complete proof of Lemma 3.</head><p>Generalization bound for any &#955; m . Assume that the loss l is bounded by Q &gt; 0 and the numbers of local samples are (n 1 , . . . , n N m ). Then for any &#948; &gt; 0, the following inequality holds with probability at least 1-&#963; for any &#955; m &#8712; &#923; m , &#952; &#8712; &#920;:</p><p>where A q m (&#955; m ) = &#955; m p m , and 1 p m + 1 q m +1 = 1. Key takeaways. For a given &#955; m &#8712; &#923; m , the bound ensures that the model's empirical loss is constrained by a combination of the weighted empirical loss and a term related to the distribution of local samples. This result highlights FairHetero's effectiveness in controlling the influence of each group's contribution to the overall loss, promoting fairness and balanced learning across different groups. Furthermore, for any &#955; m &#8712; &#923; m , the bound provides a broader guarantee by considering the maximum weighted empirical loss across all possible &#955; m . This reinforces FairHetero's capacity to adapt to varying group characteristics, maintaining both fairness and performance in heterogeneous environments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Global level generalization:</head><p>Total loss of all M groups (treat each group as a "client") with unknown loss of each groups,</p><p>where &#955; is in a probability simplex &#923;, M is the total number of groups, D m is the local data distribution for group m, &#952; m is the local model parameter for group m, and l is the loss. We define empirical loss L&#955; (&#952; ) as</p><p>where n m is the number of local data samples of group m and (x m, j , y m, j ) &#8764; D m .</p><p>Hardware-Sensitive Fairness in Heterogeneous Federated Learning</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>5:15</head><p>The objective of federated learning over m groups is as follows: </p><p>where</p><p>Lemma 4 (Generalization Bound for a Specific &#955;). Assume that the loss l is bounded by Q &#1076;p &gt; 0 and the numbers of local samples are (n 1 , . . . , n M ). Then for any &#948; &gt; 0, the following inequality holds with probability at least 1-&#963; for any &#955; &#8712; &#923;, &#952; &#8712; &#920;:</p><p>where where A q (&#955;) = &#955; p , and</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Refer to the Appendix in the supplementary materials for the complete proof of Lemma 4.</head><p>Generalization bound for any &#955;. Assume that the loss l is bounded by Q &#1076;p &gt; 0 and the numbers of local samples are (n 1 , . . . , n M ). Then for any &#948; &gt; 0, the following inequality holds with probability at least 1-&#963; for any &#955; &#8712; &#923;, &#952; &#8712; &#920;:</p><p>where A q (&#955;) = &#955; p , and 1 p + 1 q+1 = 1. Key takeaways. The analysis extends to the global level by considering each group as a "client" in FL. The generalization bound on the total loss across all groups highlights FairHetero's effectiveness in achieving fair and accurate models, even in the presence of diverse data distributions and varying group sizes.</p><p>In summary, the generalization bound for FairHetero highlights its ability to generalize well to unseen data while maintaining fairness and performance in federated learning settings with heterogeneous groups.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.3">Uniformity Induced by FairHetero.</head><p>In this section, we analyze the uniformity induced by FairHetero in both inter-group and intra-group contexts, as established by the convergence analysis. FairHetero can encourage more fair solution in terms of the entropy of the accuracy distribution (larger entropy). We begin by formally defining the notion of fairness in terms of entropy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Definition 4.2 (Intra-group: Entropy of Performance Distribution).</head><p>We say that the performance distribution for any hardware group m with</p><p>where H (f (&#952; )) is the entropy of the stochastic vector obtained by normalizing {(f m,1 (&#952; m ), . . . , f m, N m (&#952; m ))}, defined as</p><p>Definition 4.3 (Inter-group: Entropy of Performance Distribution). For total M hardware groups {(F 1 (&#952; 1 ), &#8226; &#8226; &#8226; , F m (&#952; m ))} trained with global &#952; and &#952; , we call &#952; is more fair if</p><p>where H (f (&#952; )) is the entropy of the stochastic vector obtained by normalizing {(F 1 (&#952; 1 ), . . . , F m (&#952; m ))}, defined as follows:</p><p>We next provide results based on Definitions 4.2 and 4.3. It states that for arbitrary q &#8805; 0 and q m &#8805; 0, by increasing q and q m slightly, we can achieve more uniform performance distributions defined over higher orders of performance for both inter-and intra-group clients.</p><p>Inter-group uniformity. The objective function of FairHetero promotes inter-group uniformity, ensuring that each group's contribution to the overall loss is balanced. Mathematically, the unweighted version of the objective function can be expressed as</p><p>where</p><p>Lemma 5. Let F (&#952; ) be twice differentiable in &#952; with &#8711; 2 F (&#952; ) &gt; 0 (positve definite). The derivative of H (F q+1 qm +1 (&#952; * q ))| q=p with respect to the variable q evaluated at the point q = p is non-negative, i.e., &#8706; &#8706;q</p><p>where</p><p>A complete proof of the Lemma 5 is provided in the Appendix in the supplementary materials.</p><p>where F m (&#952; m ) represents the loss of group m. The corresponding entropy term, H (F q+1 qm +1 (&#952; * q )), ensures that the distribution of losses across groups remains balanced. The proof shows that the derivative of this term with respect to q at q = p is non-negative, indicating that the objective promotes inter-group uniformity.</p><p>Intra-group uniformity. Similarly, FairHetero encourages intra-group uniformity by ensuring that each sample within a group contributes equally to the group's loss. The objective function for Hardware-Sensitive Fairness in Heterogeneous Federated Learning 5:17 </p><p>Lemma 6. Let F m (&#952; m ) be twice differentiable in &#952; m with &#8711; 2 F m (&#952; ) &gt; 0 (positve definite). The derivative of H (f q m +1 (&#952; * q m ))| q m =p m with respect to the variable q m evaluated at the point q m = p m is non-negative, i.e., &#8706; &#8706;q</p><p>where</p><p>A complete proof of the Lemma 6 is provided in the Appendix in the supplementary materials.</p><p>The corresponding entropy term, H (f q m +1 (&#952; * q m )), guarantees that the loss distribution within each group remains balanced. The proof establishes that the derivative of this term with respect to q m at q m = p m is non-negative, demonstrating the promotion of intra-group uniformity.</p><p>Algorithmic uniformity. The algorithmic design of FairHetero ensures uniformity in both inter-group and intra-group contexts. By iteratively updating the model parameters to minimize the objective function while preserving the fairness constraints, FairHetero effectively balances the contributions of different groups and samples, thereby promoting uniformity in the federated learning process.</p><p>Overall, the uniformity induced by FairHetero plays a crucial role in ensuring fair and balanced federated learning across diverse and heterogeneous groups. A complete proof is provided in the Appendix in supplementary materials.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Evaluation Setup</head><p>This section discusses the experimental setup used to evaluate the performance of our algorithm. We first describe the datasets used and then detail the model parameters for each dataset, including the architectural diversity introduced to simulate different hardware capabilities among participating clients.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Dataset</head><p>We adopt four popular datasets MNIST, CIFAR10, FEMNIST, and SHAKESPEARE which are commonly used in literature <ref type="bibr">[28]</ref>.</p><p>MNIST: This dataset is well known for handwriting recognition and includes 70,000 grayscale images measuring 28&#215;28 pixels. The dataset is split into 60,000 training samples and 10,000 test samples. There are 10 different classes for the images, ranging from 0 to 9. We adopt three cases for MNIST data distribution for the clients i.e., IID, Non-IID, and Non-IID extreme. We distribute the training dataset evenly among 100 clients, with each client receiving 600 samples for IID cases. For the Non-IID cases we have one dominant class for each client having 80% of data and all other classes have the rest 20% data. Finally, for the Non-IID extreme cases, each class would have at most two classes of data. We also separate 10% of the client data for testing the model. The actual test set is also used to test the global performance of the model over time.</p><p>CIFAR10: Another popular dataset, CIFAR10 includes 60,000 colored images measuring 32x32 pixels. The dataset is divided into 50,000 training images and 10,000 test images, grouped into ten separate classes. Like the MNIST dataset, we have three types of data distribution for CIFAR10 i.e., IID, Non-IID, and Non-IID extreme. we adopt We divide the dataset into 100 clients, with each client receiving 500 samples for the IID cases. Similarly, for the Non-IID cases we have one dominant class for each client having 80% of data and all other classes have the rest 20% data. Finally, for the Non-IID extreme cases, each class would have at most two classes of data. We also separate 10% of the client's data for the test of the model performance.</p><p>FEMNIST: The FEMNIST dataset, derived from the LEAF dataset and implemented in TensorFlow Federated, is divided among 3,383 unique users (we used the first 1000 users). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Model Parameters</head><p>We focus on an edge computing setup where our clients are IoT devices. Given the limited power and computational capacity of IoT devices, we opted for simpler, lightweight models. Instead of using actual hardware, we emulated the setup, creating different groups of clients to mimic varying hardware capabilities. The model parameters used in our training are detailed below.</p><p>MNIST: For the MNIST dataset, we employ a basic multi-layer perceptron (MLP) classifier using TensorFlow's Keras sequential model. The MLP has two hidden layers with ReLU activation, comprising 200 and 100 neurons, respectively, followed by an output layer with 10 neurons and softmax activation. Before training, the input features are flattened, and the labels are one-hot encoded. We utilize the Adam optimizer with a learning rate of 0.001 for IID and 0.00012 for Non-IID and extreme cases and categorical cross-entropy as the loss function. Training is conducted for 300 epochs across various scenarios, including IId, Non-IID, and extreme cases. To create architectural diversity, we prune the global model into five distinct groups, denoted as Group 1 to Group 5, each with different performance levels. Each group is composed of 20 clients with a subset of the model's parameters. Specifically, Group 1 retains 100% of the global model, Group 2 retains 70%, Group 3 retains 50%, Group 4 retains 25%, and Group 5 retains 12%. These groups represent different hardware capabilities among participating clients. The model parameters for each group in the MNIST dataset are detailed in Table <ref type="table">2</ref> convolutional layers, followed by a max-pooling layer, a dropout layer, and two fully connected layers with dropout regularization. ReLU is used as the activation function for the convolutional layers, while softmax is applied to the output layer. We utilize the "categorical-crossentropy" loss function, along with the Adam optimizer set to a learning rate of 0.001 for IID, and 0.00005 for Non-IID and extreme scenarios. The model is trained for 1,000 epochs for IID, and 2,000 epochs for Non-IID and extreme cases. To introduce architectural diversity, we vary the rate parameter in the model, which controls the number of neurons in the fully connected layers, resulting in five distinct architectures. Similarly to the MNIST setup, we create five architectural groups, each consisting of 20 clients. In this setup, Group 1 retains 100% of the model parameters, Group 2 retains 75%, Group 3 retains 50%, Group 4 retains 35%, and Group 5 retains 25%. These groups are designed to mimic different hardware capabilities among the participating clients. The model parameters for each architectural group for the CIFAR10 dataset are detailed in Table <ref type="table">2</ref>.</p><p>FEMNIST: For the FEMNIST dataset, we employ a simple MLP with two hidden layers, using fully connected dense layers with ReLU activation functions. The input shape of the model is 784, corresponding to the number of pixels in each image. The first hidden layer consists of 64 neurons, and the output layer contains 10 neurons without an activation function (as the loss function used is SparseCategoricalCrossentropy with from-logits=True). The optimization process uses a learning rate of 0.001, without any regularization techniques applied. Training is conducted for 600 epochs to ensure convergence. To introduce architectural diversity, we prune the global model to create five distinct architectures with varying performances. Each group consists of 200 clients. Similarly to the setups for other datasets, Group 1 retains 100% of the model parameters, Group 2 retains 70%, Group 3 retains 45%, Group 4 retains 25%, and Group 5 retains 12%. These groups are designed to simulate different hardware capabilities among the participating clients. The model parameters for each group for the FEMNIST dataset are provided in Table <ref type="table">2</ref>.</p><p>SHAKESPEARE: For the SHAKESPEARE dataset, we implement a Recurrent Neural Network (RNN) using a GRU layer with stateful=True, ensuring the model's state is maintained across batches. Input data is pre-processed using a lookup table that maps each ASCII character to an index, then segmented into sequences of length 50+1. The model includes an embedding layer with a batch input shape of [8, None], followed by a GRU layer, and concludes with a dense layer containing 86 output units. Training spans 400 epochs, with a custom function serving as the evaluation metric, measuring the accuracy of the model's predictions across all characters in the input sequence. To introduce architectural diversity, we create five groups with 30 clients in each. Group 1 retains 100% of the model's parameters, Group 2 retains 75%, Group 3 retains 50%, Group 4 retains 35%, and Group 5 retains 25%. These groups simulate varying hardware capabilities among the clients. The model parameters for each group for the SHAKESPEARE dataset are outlined in Table <ref type="table">2</ref>.</p><p>Our FairHetero approach is scalable with respect to the number of groups, as adding more groups does not increase computational overhead. By using client masking, it accommodates various model architectures efficiently. Additionally, FairHetero is adaptable to more advanced architectures as federated learning systems evolve. It remains effective for a range of models, including deep CNNs, RNNs, and transformers, through architecture-specific adjustments like layerwise or attention-head pruning. This flexibility ensures that FairHetero can handle increased complexity and continue to minimize performance disparities even as model architectures become more sophisticated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Results</head><p>In this section, we present the results of our experiments evaluating the performance of FairHetero.</p><p>In the figure, G-1 corresponds to clients with the highest trainable model parameters, and G-5 corresponds to the lowest, in decreasing order, with the OVERALL column representing the performance of all clients together. We begin by analyzing the effect of the design parameter q m on intra-group or data fairness, followed by the impact of q on inter-group or hardware fairness. We then discuss the overall performance of FairHetero across different datasets. Finally, we provide tabulated results for various datasets, showcasing the group-level variances and means for different values of q and q m . 6.1 Effect of q m :-Intra-Group/Data Fairness In our evaluation, we investigate the impact of the intra-group or data fairness metric (q m ) on promoting fairness among clients within groups (similar hardware or same model architecture) for data heterogeneity. We maintain the inter-group or hardware fairness metric, q = 0, and vary the q m value from 0 to 100 on the MNIST, CIFAR10, FEMNIST, and SHAKESPEARE datasets. The dataset consists of five groups based on Table <ref type="table">2</ref>, each representing different hardware characteristics. In our analysis, we keep q m the same for all groups, i.e., for example, if q m is set to 10, then it means that all groups have a q m value of 10. Our findings found that increasing the q m value reduces the variance within each group, as shown in Figure <ref type="figure">3</ref>. However, there is a tradeoff between fairness and performance. With increasing q m , we see a slight degradation of overall average performance. Notably, the client-level q m metric cannot capture variance caused by architectural differences among client groups, leading to persistent performance disparities between groups due to hardware heterogeneity. Therefore, while the group-level q m metric effectively reduces performance variance among clients within the same architectural groups due to data heterogeneity, it is unable to address hardware heterogeneity among different hardware groups.</p><p>Key takeaways. Increasing the intra-group/data fairness parameter (q m ) reduces variance within groups but results in a slight degradation of overall average performance, highlighting a tradeoff between fairness and performance, particularly in addressing data heterogeneity within groups. It is unable to address performance variation due to hardware heterogeneity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Effect of q:-Inter-group/Hardware Fairness</head><p>To assess the impact of the inter-group or hardware fairness metric (q) on reducing variance among groups with architectural heterogeneity, we conducted experiments using the MNIST, CIFAR10, and FEMNIST datasets. Five distinct groups were formed based on Table <ref type="table">2</ref>, each representing different hardware characteristics. By setting the intra-group or data fairness metric (q m ) to 0, we varied the value of q from 0 to 500 for different datasets. Our findings reveal that increasing the value of q enhances model fairness by reducing variance among participating groups, particularly benefiting lower architectural client groups by improving their performance and reducing test loss leading to min-max problems.</p><p>Interestingly, we observed a trend where increasing the value of q up to a certain point leads to minimal degradation of larger architectural clients while improving the performance of smaller architectural clients, resulting in both fairer and improved performance. However, beyond a certain threshold, further increases in q can lead to divergence, causing the performance of all client groups to degrade despite reduced performance variance, ultimately resulting in a more equitable performance among groups. This highlights a tradeoff between performance and fairness, where increasing fairness too much may lead to performance loss.</p><p>Figure <ref type="figure">4</ref> illustrates that higher values of q decrease inter-group variance, while also reducing the overall performance as measured by the average test loss up to a certain q value, indicating an overall improvement in performance. However, beyond this optimal point, further increases in q lead to performance degradation. Thus, while the global fairness metric (q) can effectively reduce inter-group variance among groups with architectural heterogeneity, it comes with a tradeoff in performance after a certain threshold of q.</p><p>Key takeaways. Increasing the inter-group fairness metric (q) reduces variance among groups with different hardware capabilities, particularly benefiting less capable groups. However, this improvement is balanced by a tradeoff with performance, which can degrade after reaching a certain point. We found a wide spectrum of q values where it ensures both fairness and performance for all the datasets making it suitable for tuning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Performance of FairHetero</head><p>FairHetero demonstrates a notable reduction in overall performance variance among clients with varying hardware computational power. In our experiments on four datasets (MNIST, FEMNIST, CIFAR10, and SHAKESPEARE) with architecturally heterogeneous groups, we compare FairHetero with existing partial training-based algorithms-HeteroFL <ref type="bibr">[9]</ref>, FedRolex <ref type="bibr">[1]</ref> and FjORD <ref type="bibr">[13]</ref> and fair q-FFL <ref type="bibr">[23]</ref> algorithm, modified to address architectural heterogeneity-in terms of test loss Fig. <ref type="figure">4</ref>. Boxplot illustrating the impact of q on test loss for all datasets, where each box represents a group (G-1-G-5 from highest to lowest model architecture), and the overall box represents all clients. Each group contains multiple clients, and the boxplot visually depicts the test loss distribution within and across groups.</p><p>and test accuracy for each client, averaged across five random shuffles of each dataset. The results, as shown in Figure <ref type="figure">5</ref> for testing loss and Figure <ref type="figure">6</ref> for testing accuracy, indicate that with the tuning of the parameters q and q m in Table <ref type="table">3</ref>, FairHetero achieves fairer solutions compared to existing methods. On average, FairHetero reduces the variance of test loss across all devices by up to 30% with improved average performance. In Table <ref type="table">3</ref>, we provide details on the worst and best 10% testing losses of the participating clients, as well as the variance of the final loss distributions. Comparing FairHetero with the existing algorithms, we observe that the proposed objective maintains similar or better average testing loss and accuracy while significantly reducing the variance of the performance of participating devices, ensuring uniformity.</p><p>Key takeaways. FairHetero demonstrates a significant reduction in overall performance variance among clients with varying hardware computational power compared to existing algorithms while maintaining similar or better average testing loss and accuracy, ensuring fairness and uniformity in performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">Comparison of Different Datasets with Varied q and q m : Group-level Variance and</head><p>Mean Analysis A natural question comes as to how to choose the appropriate values of q and q m . Our proposed FairHetero is flexible to tune q and q m as tradeoffs between fairness (more uniformity of performance among participating clients) and overall performance (Average performance of all the clients) after a certain value of q. However, we found that we can attain a large degree of fairness among participating clients with overall better performance in most of the cases with smaller Hardware-Sensitive Fairness in Heterogeneous Federated Learning 5:23 values of q and q m . Another important benefit of q m is that we can tune the degree of fairness at the group level. Each group with a different architecture can have different q m , so q and q m can be tuned like hyperparameters based on the use cases and objectives.</p><p>We provide the detailed outcomes of our experiments where we tested various combinations of q and q m values for the MNIST, CIFAR10, FEMNIST, and SHAKESPEARE datasets. The following table presents the results, illustrating how the distribution of clients' performance can be influenced by different values of q and q m . In all the cases we found that added q and q m can effectively reduce the variance of the performance of the clients having different architectures.</p><p>Outcome of MNIST dataset evaluation. Tables <ref type="table">4</ref><ref type="table">5</ref><ref type="table">6</ref>present the average performance of each group in the MNIST IID, Non-IID, and extreme datasets, showing the mean and variance of the loss for different values of q and q m . Across all groups, as we keep the q value fixed and increase the q m value, we observe an increase in mean losses for all groups, but the variance of performance within each group decreases. This indicates that increasing q m helps address the data heterogeneity within each group but does not address the hardware heterogeneity.</p><p>For increasing values of q, we notice a significant increase in the performance of the lowest architecture groups, with minimal degradation in performance for the highest architecture group's clients. This leads to a more uniform performance across different architecture groups. However, as the q value increases further, we observe an interesting trend. Instead of benefiting the performance of the lowest architectural clients, the performance of all clients starts to degrade, with the highest architecture clients suffering more. While this may result in a more balanced performance across the groups, it comes at the cost of average performance degradation, highlighting the tradeoff between fairness and performance. The optimal values of q for MNIST IID, MNIST Non-IID, and MNIST extreme are found to be in the range 1-10, 1-10, and 1-10, respectively. Across all groups, FairHetero has a greater impact on the performance of clients with limited resources, such as those in Group 5. Conversely, Group 1, representing clients with the highest computational power, is less affected by increasing q and q m , indicating that these clients are less constrained by fairness considerations. The results demonstrate FairHetero's ability to balance fairness and performance across a range of computational capabilities for MNIST dataset, making it effective in addressing heterogeneity in federated learning environments.</p><p>Outcome of CIFAR10 dataset evaluation. Tables 7-9 display the average performance of each group in the CIFAR10 IID, Non-IID, and extreme datasets, indicating the mean and variance of the loss for various q and q m values. Increasing q m reduces the variance within each group but does not address hardware heterogeneity. For increasing q, there is initially a significant improvement in the performance of lower architecture groups, but beyond a certain point, all groups experience performance degradation, with higher architecture groups suffering more. The optimal q values for CIFAR10 IID, Non-IID, and extreme datasets are in the range of 1-10. FairHetero has a greater impact on clients with limited resources (e.g., Group 5) but less on clients with higher computational power (e.g., Group 1). Overall, FairHetero effectively balances fairness and performance across various computational capabilities for the CIFAR10 dataset for the CNN model, addressing heterogeneity in federated learning environments.</p><p>Outcome of FEMNIST dataset evaluation. Table <ref type="table">10</ref> presents the average performance of each group in the FEMNIST dataset, showing the mean and variance of the loss for different q and q m Hardware-Sensitive Fairness in Heterogeneous Federated Learning 5:25 values. Increasing q m reduces the variance within each group but does not address hardware heterogeneity. With increasing q, there is initially a notable enhancement in the performance of lower architecture groups, but beyond a certain threshold, all groups experience performance degradation, with higher architecture groups being more affected. The optimal q values for the FEMNIST dataset fall in the range of 10-100. FairHetero has a more pronounced impact on clients with limited resources (e.g., Group 5) compared to those with higher computational power (e.g., Group 1). Overall, FairHetero effectively balances fairness and performance across various computational capabilities for the FEMNIST dataset, addressing heterogeneity in federated learning environments.  0,0,0,0,0 0.19 &#177; 0.12 0.01 &#177; 0.12 0.18 &#177; 0.07 0.01 &#177; 0.07 0.19 &#177; 0.05 0.00 &#177; 0.05 0.25 &#177; 0.02 0.00 &#177; 0.02 0.58 &#177; 0.02 0.00 &#177; 0.02 0.1,0.1,0.1,0.1,0.1 0.14 &#177; 0.08 0.01 &#177; 0.08 0.17 &#177; 0.06 0.00 &#177; 0.06 0.20 &#177; 0.05 0.00 &#177; 0.05 0.28 &#177; 0.02 0.00 &#177; 0.02 0.74 &#177; 0.02 0.00 &#177; 0.02 1,1,1,1,1 0.14 &#177; 0.06 0.00 &#177; 0.06 0.22 &#177; 0.06 0.00 &#177; 0.06 0.25 &#177; 0.05 0.00 &#177; 0.05 0.39 &#177; 0.02 0.00 &#177; 0.02 1.11 &#177; 0.02 0.00 &#177; 0.02 10,10,10,10,10 0.26 &#177; 0.06 0.00 &#177; 0.06 0.36 &#177; 0.08 0.01 &#177; 0.08 0.46 &#177; 0.05 0.00 &#177; 0.05 0.85 &#177; 0.02 0.00 &#177; 0.02 1.73 &#177; 0.01 0.00 &#177; 0.01 50,50,50,50,50 0.38 &#177; 0.06 0.00 &#177; 0.06 0.59 &#177; 0.06 0.00 &#177; 0.06 0.94 &#177; 0.04 0.00 &#177; 0.04 1.45 &#177; 0.02 0.00 &#177; 0.02 2.03 &#177; 0.01 0.00 &#177; 0.01 0.1 0,0,0,0,0 0.16 &#177; 0.11 0.01 &#177; 0.11 0.14 &#177; 0.06 0.00 &#177; 0.06 0.15 &#177; 0.06 0.00 &#177; 0.06 0.24 &#177; 0.02 0.00 &#177; 0.02 0.46 &#177; 0.02 0.00 &#177; 0.02 0.1,0.1,0.1,0.1,0.1 0.12 &#177; 0.07 0.01 &#177; 0.07 0.14 &#177; 0.05 0.00 &#177; 0.05 0.15 &#177; 0.05 0.00 &#177; 0.05 0.27 &#177; 0.02 0.00 &#177; 0.02 0.58 &#177; 0.02 0.00 &#177; 0.02 1,1,1,1,1 0.14 &#177; 0.05 0.00 &#177; 0.05 0.19 &#177; 0.05 0.00 &#177; 0.05 0.20 &#177; 0.04 0.00 &#177; 0.04 0.41 &#177; 0.02 0.00 &#177; 0.02 1.02 &#177; 0.02 0.00 &#177; 0.02 10,10,10,10,10 0.26 &#177; 0.06 0.00 &#177; 0.06 0.35 &#177; 0.08 0.01 &#177; 0.08 0.44 &#177; 0.05 0.00 &#177; 0.05 0.82 &#177; 0.02 0.00 &#177; 0.02 1.69 &#177; 0.01 0.00 &#177; 0.01 50,50,50,50,50 0.37 &#177; 0.06 0.00 &#177; 0.06 0.58 &#177; 0.07 0.00 &#177; 0.07 0.92 &#177; 0.04 0.00 &#177; 0.04 1.42 &#177; 0.02 0.00 &#177; 0.02 2.01 &#177; 0.01 0.00 &#177; 0.01 1 0,0,0,0,0 0.13 &#177; 0.08 0.01 &#177; 0.08 0.15 &#177; 0.06 0.00 &#177; 0.06 0.17 &#177; 0.06 0.00 &#177; 0.06 0.27 &#177; 0.03 0.00 &#177; 0.03 0.40 &#177; 0.03 0.00 &#177; 0.03 0.1,0.1,0.1,0.1,0.1 0.12 &#177; 0.07 0.00 &#177; 0.07 0.15 &#177; 0.05 0.00 &#177; 0.05 0.17 &#177; 0.06 0.00 &#177; 0.06 0.27 &#177; 0.03 0.00 &#177; 0.05 0.42 &#177; 0.03 0.00 &#177; 0.03 0,0,0,0,0 0.20 &#177; 0.10 0.01 &#177; 0.10 0.28 &#177; 0.12 0.01 &#177; 0.12 0.26 &#177; 0.15 0.02 &#177; 0.15 0.65 &#177; 0.23 0.05 &#177; 0.23 1.28 &#177; 0.34 0.12 &#177; 0.34 1,0.1,0.1,0.1,0.1 0.23 &#177; 0.09 0.01 &#177; 0.09 0.34 &#177; 0.13 0.02 &#177; 0.13 0.33 &#177; 0.17 0.03 &#177; 0.17 0.88 &#177; 0.31 0.09 &#177; 0.31 1.55 &#177; 0.32 0.10 &#177; 0.32 10,1,1,1,1 0.52 &#177; 0.08 0.01 &#177; 0.08 0.69 &#177; 0.18 0.03 &#177; 0.18 0.82 &#177; 0.24 0.06 &#177; 0.24 1.55 &#177; 0.35 0.12 &#177; 0.35 2.05 &#177; 0.25 0.06 &#177; 0.25 10,10,10,10,10 0.69 &#177; 0.10 0.01 &#177; 0.10 1.00 &#177; 0.13 0.02 &#177; 0.13 1.24 &#177; 0.15 0.02 &#177; 0.15 1.84 &#177; 0.22 0.05 &#177; 0.22 2.17 &#177; 0.14 0.02 &#177; 0.14 1 0,0,0,0,0 0.20 &#177; 0.10 0.01 &#177; 0.10 0.29 &#177; 0.13 0.02 &#177; 0.13 0.26 &#177; 0.14 0.02 &#177; 0.14 0.48 &#177; 0.18 0.03 &#177; 0.18 0.92 &#177; 0.3 0.12 &#177; 0.35 0.1,0.1,0.1,0.1,0.1 0.21 &#177; 0.10 0.01 &#177; 0.10 0.30 &#177; 0.13 0.02 &#177; 0.13 0.27 &#177; 0.15 0.02 &#177; 0.15 0.51 &#177; 0.18 0.03 &#177; 0.18 0.96 &#177; 0.35 0.12 &#177; 0.35 .0001,.001,.01,.1,1 0.20 &#177; 0.10 0.01 &#177; 0.10 0.29 &#177; 0.12 0.01 &#177; 0.12 0.28 &#177; 0.16 0.02 &#177; 0.16 1.06 &#177; 0.35 0.12 &#177; 0.35 1.75 &#177; 0.29 0.08 &#177; 0.29 1,1,1,1,1 0.26 &#177; 0.10 0.01 &#177; 0.10 0.37 &#177; 0.13 0.02 &#177; 0.13 0.36 &#177; 0.16 0.02 &#177; 0.16 0.76 &#177; 0.21 0.04 &#177; 0.21 1.29 &#177; 0.31 0.10 &#177; 0.31 10 0,0,0,0,0 0.24 &#177; 0.11 0.01 &#177; 0.11 0.33 &#177; 0.14 0.02 &#177; 0.14 0.30 &#177; 0.15 0.02 &#177; 0.15 0.47 &#177; 0.17 0.03 &#177; 0.17 0.77 &#177; 0.31 0.10 &#177; 0.31 0.1,0.1,0.1,0.1,0.1 0.24 &#177; 0.11 0.01 &#177; 0.11 0.33 &#177; 0.14 0.02 &#177; 0.14 0.31 &#177; 0.15 0.02 &#177; 0.15 0.48 &#177; 0.17 0.03 &#177; 0.17 0.78 &#177; 0.31 0.10 &#177; 0.31 1,1,1,1,1 0.27 &#177; 0.12 0.01 &#177; 0.12 0.38 &#177; 0.14 0.02 &#177; 0.14 0.</p><p>35 &#177; 0.15 0.02 &#177; 0.15 0.55 &#177; 0.16 0.03 &#177; 0.16 0.84 &#177; 0.27 0.07 &#177; 0.27 .0001,.001,.01,.1,1 0.26 &#177; 0.11 0.01 &#177; 0.11 0.37 &#177; 0.15 0.02 &#177; 0.15 0.36 &#177; 0.19 0.04 &#177; 0.19 1.12 &#177; 0.35 0.12 &#177; 0.35 1.61 &#177; 0.38 0.14 &#177; 0.38 10,10,10,10,10 0.58 &#177; 0.09 0.01 &#177; 0.09 0.73 &#177; 0.15 0.02 &#177; 0.15 0.75 &#177; 0.14 0.02 &#177; 0.14 1.08 &#177; 0.13 0.02 &#177; 0.13 1.40 &#177; 0.23 0.05 &#177; 0.23 50 0,0,0,0,0 0.56 &#177; 0.15 0.02 &#177; 0.15 0.68 &#177; 0.23 0.05 &#177; 0.23 0.64 &#177; 0.27 0.07 &#177; 0.27 0.93 &#177; 0.28 0.08 &#177; 0.28 1.18 &#177; 0.46 0.21 &#177; 0.46 0.1,0.1,0.1,0.1,0.1 0.56 &#177; 0.14 0.02 &#177; 0.14 0.68 &#177; 0.23 0.05 &#177; 0.23 0.65 &#177; 0.26 0.07 &#177; 0.26 0.93 &#177; 0.27 0.07 &#177; 0.27 1.18 &#177; 0.45 0.20 &#177; 0.45 1,1,1,1,1 0.67 &#177; 0.11 0.01 &#177; 0.11 0.75 &#177; 0.23 0.05 &#177; 0.23 0.71 &#177; 0.24 0.06 &#177; 0.24 0.98 &#177; 0.24 0.06 &#177; 0.24 1.22 &#177; 0.39 0.15 &#177; 0.39 10,10,10,10,10 0.85 &#177; 0.08 0.01 &#177; 0.08 1.01 &#177; 0.18 0.03 &#177; 0.18 1.03 &#177; 0.18 0.03 &#177; 0.18 1.28 &#177; 0.16 0.03 &#177; 0.16 1.49 &#177; 0.25 0.06 &#177; 0.25 50,50,50,50,50 1.51 &#177; 0.11 0.01 &#177; 0.11 1.65 &#177; 0.23 0.05 &#177; 0.23 1.72 &#177; 0.16 0.03 &#177; 0.16 1.85 &#177; 0.16 0.02 &#177; 0.16 1.99 &#177; 0.05 0.00 &#177; 0.05</p><p>Outcome of SHAKESPEARE dataset evaluation. Table <ref type="table">11</ref> illustrates the average performance of each group in the SHAKESPEARE dataset, displaying the mean and variance of the loss for different q and q m values. Increasing q m decreases the variance within each group but does not mitigate hardware heterogeneity. As q increases, there is an initial enhancement in the performance of lower architecture groups. However, beyond a certain point, all groups experience performance degradation, with higher architecture groups being more affected. The optimal q values for the SHAKESPEARE dataset range from 0 to 1, indicating a narrow range for RNN models. Overall, FairHetero effectively balances fairness and performance across various computational capabilities for the SHAKESPEARE dataset in RNN models, addressing heterogeneity in federated learning environments.</p><p>Hardware-Sensitive Fairness in Heterogeneous Federated Learning 5:27 1.49 &#177; 0.08 0.01 &#177; 0.08 1.70 &#177; 0.08 0.01 &#177; 0.08 1.88 &#177; 0.06 0.00 &#177; 0.06 2.13 &#177; 0.09 0.01 &#177; 0.09 2.25 &#177; 0.04 0.00 &#177; 0.04 10 0,0,0,0,0 0.35 &#177; 0.14 0.02 &#177; 0.14 0.32 &#177; 0.13 0.02 &#177; 0.13 0.35 &#177; 0.18 0.03 &#177; 0.18 0.49 &#177; 0.17 0.03 &#177; 0.17 0.78 &#177; 0.41 0.17 &#177; 0.41 0.001,0.01,0.1,1,10 0.36 &#177; 0.14 0.02 &#177; 0.14 0.36 &#177; 0.15 0.02 &#177; 0.15 0.43 &#177; 0.18 0.03 &#177; 0.18 1.14 &#177; 0.37 0.14 &#177; 0.37 1.65 &#177; 0.28 0.08 &#177; 0.28 0.1,0.1,0.1,0.1,0.1 0.36 &#177; 0.14 0.02 &#177; 0.14 0.33 &#177; 0.13 0.02 &#177; 0.13 0.35 &#177; 0.17 0.03 &#177; 0.17 0.50 &#177; 0.17 0.03 &#177; 0.17 0.79 &#177; 0.40 0.16 &#177; 0. 0,0,0,0,0 1.18 &#177; 0.14 0.02 &#177; 0.14 1.21 &#177; 0.11 0.01 &#177; 0.11 1.37 &#177; 0.08 0.01 &#177; 0.08 1.85 &#177; 0.02 0.00 &#177; 0.02 2.03 &#177; 0.02 0.00 &#177; 0.02 0.1,0.1,0.1,0.1,0.1 1.17 &#177; 0.13 0.02 &#177; 0.13 1.22 &#177; 0.11 0.01 &#177; 0.11 1.38 &#177; 0.08 0.01 &#177; 0.08 1.86 &#177; 0.02 0.00 &#177; 0.02 2.03 &#177; 0.02 0.00 &#177; 0.02 1,1,1,1,1 1.16 &#177; 0.12 0.01 &#177; 0.12 1.27 &#177; 0.10 0.01 &#177; 0.10 1.44 &#177; 0.07 0.00 &#177; 0.07 1.93 &#177; 0.02 0.00 &#177; 0.02 2.08 &#177; 0.02 0.00 &#177; 0.02 10,10,10,10,10 1.40 &#177; 0.10 0.01 &#177; 0.10 1.50 &#177; 0.09 0.01 &#177; 0.09 1.67 &#177; 0.04 0.00 &#177; 0.04 2.18 &#177; 0.01 0.00 &#177; 0.01 2.24 &#177; 0.01 0.00 &#177; 0.01 50,50,50,50,50 1.68 &#177; 0.08 0.01 &#177; 0.08 1.79 &#177; 0.05 0.00 &#177; 0.05 1.98 &#177; 0.04 0.00 &#177; 0.04 2.28 &#177; 0.00 0.00 &#177; 0.00 2.29 &#177; 0.00 0.00 &#177; 0.00 0.1 0,0,0,0,0 1.18 &#177; 0.14 0.02 &#177; 0.14 1.21 &#177; 0.11 0.01 &#177; 0.11 1.37 &#177; 0.08 0.01 &#177; 0.08 1.84 &#177; 0.02 0.00 &#177; 0.02 2.02 &#177; 0.02 0.00 &#177; 0.02 0.1,0.1,0.1,0.1,0.1 1.17 &#177; 0.13 0.02 &#177; 0.13 1.22 &#177; 0.11 0.01 &#177; 0.11 1.37 &#177; 0.08 0.01 &#177; 0.08 1.85 &#177; 0.02 0.00 &#177; 0.02 2.02 &#177; 0.02 0.00 &#177; 0.02 1,1,1,1,1 1.16 &#177; 0.11 0.01 &#177; 0.11 1.27 &#177; 0.10 0.01 &#177; 0.10 1.43 &#177; 0.07 0.00 &#177; 0.07 1.92 &#177; 0.02 0.00 &#177; 0.02 2.07 &#177; 0.02 0.00 &#177; 0.02 10,10,10,10,10 1.40 &#177; 0.10 0.01 &#177; 0.10 1.50 &#177; 0.09 0.01 &#177; 0.09 1.67 &#177; 0.04 0.00 &#177; 0.04 2.18 &#177; 0.01 0.00 &#177; 0.01 2.24 &#177; 0.01 0.00 &#177; 0.01 1 0,0,0,0,0 1.16 &#177; 0.13 0.02 &#177; 0.13 1.22 &#177; 0.11 0.01 &#177; 0.11 1.36 &#177; 0.08 0.01 &#177; 0.08 1.80 &#177; 0.02 0.00 &#177; 0.02 1.97 &#177; 0.02 0.00 &#177; 0.02 .01,.01,.01,.01,.01 1.16 &#177; 0.13 0.02 &#177; 0.13 1.21 &#177; 0.11 0.01 &#177; 0.11 1.36 &#177; 0.08 0.01 &#177; 0.08 1.80 &#177; 0.02 0.00 &#177; 0.02 1.97 &#177; 0.02 0.00 &#177; 0.02 1,1,1,1,1 1.17 &#177; 0.11 0.01 &#177; 0.11 1.26 &#177; 0.10 0.01 &#177; 0.10 1.42 &#177; 0.07 0.01 &#177; 0.07 1.86 &#177; 0.02 0.00 &#177; 0.02 2.02 &#177; 0.02 0.00 &#177; 0.02 10,10,10,10,10 1.40 &#177; 0.10 0.01 &#177; 0.10 1.50 &#177; 0.09 0.01 &#177; 0.09 1.66 &#177; 0.04 0.01 &#177; 0.04 2.14 &#177; 0.01 0.00 &#177; 0.01 2.21 &#177; 0.01 0.00 &#177; 0.01 10 0,0,0,0,0 1.19 &#177; 0.11 0.01 &#177; 0.11 1.25 &#177; 0.10 0.01 &#177; 0.10 1.39 &#177; 0.08 0.01 &#177; 0.08 1.74 &#177; 0.03 0.00 &#177; 0.03 1.87 &#177; 0.02 0.00 &#177; 0.02 0.1,0.1,0.1,0.1,0.1 1.19 &#177; 0.11 0.01 &#177; 0.11 1.26 &#177; 0.10 0.01 &#177; 0.10 1.39 &#177; 0.08 0.01 &#177; 0.08 1.74 &#177; 0.03 0.00 &#177; 0.03 1.87 &#177; 0.02 0.00 &#177; 0.02 1,1,1,1,1 1.23 &#177; 0.11 0.01 &#177; 0.11 1.30 &#177; 0.09 0.01 &#177; 0.09 1.43 &#177; 0.08 0.01 &#177; 0.08 1.76 &#177; 0.03 0.00 &#177; 0.03 1.88 &#177; 0.02 0.00 &#177; 0.02 10,10,10,10,10 1.45 &#177; 0.10 0.01 &#177; 0.10 1.51 &#177; 0.08 0.01 &#177; 0.08 1.62 &#177; 0.05 0.00 &#177; 0.05 1.91 &#177; 0.02 0.00 &#177; 0.02 2.02 &#177; 0.02 0.00 &#177; 0.02 50 0,0,0,0,0 1.40 &#177; 0.10 0.01 &#177; 0.10 1.43 &#177; 0.08 0.01 &#177; 0.08 1.52 &#177; 0.07 0.00 &#177; 0.07 1.79 &#177; 0.03 0.00 &#177; 0.03 1.87 &#177; 0.02 0.00 &#177; 0.02 0.1,0.1,0.1,0.1,0.1 1.41 &#177; 0.10 0.01 &#177; 0.10 1.44 &#177; 0.08 0.01 &#177; 0.08 1.52 &#177; 0.07 0.00 &#177; 0.07 1.79 &#177; 0.03 0.00 &#177; 0.03 1.87 &#177; 0.02 0.00 &#177; 0.02 1,1,1,1,1 1.43 &#177; 0.10 0.01 &#177; 0.10 1.46 &#177; 0.08 0.01 &#177; 0.08 1.53 &#177; 0.07 0.00 &#177; 0.07 1.80 &#177; 0.03 0.00 &#177; 0.03 1.87 &#177; 0.02 0.00 &#177; 0.02 10,10,10,10,10 1.87 &#177; 0.02 0.01 &#177; 0.09 1.60 &#177; 0.07 0.00 &#177; 0.07 1.65 &#177; 0.05 0.00 &#177; 0.05 1.84 &#177; 0.03 0.00 &#177; 0.03 1.90 &#177; 0.02 0.00 &#177; 0.02 50,50,50,50,50 1.77 &#177; 0.07 0.00 &#177; 0.07 1.80 &#177; 0.05 0.00 &#177; 0.05 1.85 &#177; 0.04 0.00 &#177; 0.04 2.01 &#177; 0.02 0.00 &#177; 0.02 2.06 &#177; 0.02 0.00 &#177; 0.02 Key takeaways. The values of q and q m are not overly sensitive regarding performance, generally exhibiting a wide range where they ensure both fairness and performance. This range was found to be broader for MLP and CNN model architectures, as seen in the MNIST, CIFAR10, and FEMNIST datasets, while narrower for the RNN model, as observed in the SHAKESPEARE dataset. In most cases, even a small value of q and q m can ensure improved fairness with added performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>In this article, we proposed a novel FL method, FairHetero, that promotes fairness among clients with heterogeneous hardware or model architectures, ensuring balance and equity in model training. Our approach offers tunable fairness, addressing both data and hardware heterogeneity. We conducted an extensive theoretical and experimental evaluation and demonstrated that FairHetero can reduce performance variability among participating devices.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>ACM Trans. Model. Perform. Eval. Comput. Syst., Vol. 10, No. 1, Article 5. Publication date: March 2025.</p></note>
		</body>
		</text>
</TEI>
