<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Federated Learning withFlexible Architectures</title></titleStmt>
			<publicationStmt>
				<publisher>Springer Nature Switzerland</publisher>
				<date>01/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10656678</idno>
					<idno type="doi">10.1007/978-3-031-70344-7_9</idno>
					
					<author>Jong-Ik Park</author><author>Carlee Joe-Wong</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Not Available]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>As the need for advanced decision-making capabilities in scenarios such as medical imaging and military operations grows, modern machine learning and deep learning techniques are increasingly in demand <ref type="bibr">[2,</ref><ref type="bibr">4]</ref>. To tackle potential privacy concerns associated with handling clients' data in these sensitive areas, federated learning (FL) has been developed. FL allows for training a shared machine learning model using data from multiple clients without the need to exchange or reveal their local data directly <ref type="bibr">[23,</ref><ref type="bibr">24,</ref><ref type="bibr">27]</ref>. Clients iteratively compute updates on local models and periodically synchronize these updates with an aggregation server to create a global model, which is sent back to the clients for another round of local model updates. However, traditional FL strategies struggle to integrate heterogeneous clients with varying computational and communication resources <ref type="bibr">[5,</ref><ref type="bibr">6,</ref><ref type="bibr">19]</ref>. For example, devices in a federated network may range from powerful fighter jets to less capable tanks in a military context or large hospitals to smaller clinics in healthcare. Prior works along these lines generally focus on reducing the number of local model updates that weaker clients complete during the training <ref type="bibr">[21,</ref><ref type="bibr">30,</ref><ref type="bibr">32,</ref><ref type="bibr">34]</ref>, which can slow model convergence. Moreover, each client must store a copy of and compute gradients on the full, possibly very large, model.</p><p>In this work, we explicitly account for heterogeneity in clients' compute and communication resources by allowing clients to customize their local model architectures according to their specific resources, ensuring efficient participation by avoiding delays from slower (straggler) clients, which could hinder the FL process and negatively impact global model updates <ref type="bibr">[20,</ref><ref type="bibr">34]</ref>. Thus, our work falls into the same category as width-flexible FL aggregation strategies like Het-eroFL <ref type="bibr">[6]</ref>, depth-flexible FL aggregation strategies such as FlexiFed <ref type="bibr">[41]</ref>, and strategies flexible in both width and depth like NeFL <ref type="bibr">[16]</ref>. These strategies enable FL clients to train network architectures with variable depths and widths.</p><p>However, prior works do not consider the fact that client networks' diversity (or heterogeneity) in FL presents unique security challenges. Combining models with different network architectures introduces weak points in the aggregation process that are susceptible to attacks <ref type="bibr">[3,</ref><ref type="bibr">23]</ref>. These weak points refer to weights that are incompletely aggregated, since only a subset of clients compute their values due to the differences in network structures. Attackers can exploit these vulnerabilities in commonly used backdoor attacks <ref type="bibr">[3,</ref><ref type="bibr">23]</ref>, which aim to induce inaccurate predictions on specific data inputs by manipulating model updates from malicious clients. By manipulating weights that are only updated by a few clients, attackers can successfully compromise the model, as depicted in the last two layers of the global model in Figure <ref type="figure">1</ref>.</p><p>Following these concerns, another critical challenge arises from scale variations in client weights due to the heterogeneous nature of network architectures <ref type="bibr">[5,</ref><ref type="bibr">6,</ref><ref type="bibr">39]</ref>. When clients possess varying numbers of layers and filters, scale variations arise, potentially causing unfairness in the global model aggregation: data from specific clients whose model weights have a larger scale may be disproportionately emphasized in the global model. In response to these challenges, this paper proposes a novel strategy, 'Federated Learning with Flexible Architectures' (FedFA), that retains the benefits of employing heterogeneous model architectures while minimizing the impact of weak point attacks and addressing scale variation in aggregation. Our strategy aggregates model layers uniformly, regardless of the complexities of individual networks. We thus establish a global model that matches the greatest depth and width found among all local models. This setup allows each client to contribute to the value of each weight in the global model, minimizing the risks associated with specific weak point attacks. Lastly, we propose a fair-scalable aggregation method to ensure fairness across local models and reduce the model bias from scale variations.</p><p>In essence, our FedFA framework delivers four significant contributions: 1) We introduce a novel aggregation strategy that is the first, to the best of our knowledge, to address security challenges in FL on heterogeneous architectures. FedFA uniformly incorporates layers from various client models into a unified global model, exploiting similarities between layers of a neural network induced by the common presence of skip connections.</p><p>2) We are the first to effectively address scale variations in a dynamic training environment. We propose the scalable aggregation method to compensate for scale variations in the weights of heterogeneous network architectures.</p><p>3) FedFA utilizes NAS (Neural Architecture Search) <ref type="bibr">[18]</ref> to optimize each client's model architecture based on its specific data characteristics, thus elucidating the impact of employing optimal model architectures tailored to local data characteristics on both local and global model performance. <ref type="bibr">4)</ref> In our experiments on Pre-ResNet, MobileNetV2, and EfficientNetV2, FedFA outperforms previous width-and depth-flexible strategies. FedFA achieves accuracy improvements by factors of up to 1.16 in IID (independent and identically distributed) data settings and 1.20 in non-IID settings on the global model. In non-IID environments, clients' local accuracies increase by up to 1.13. Furthermore, FedFA demonstrates increased robustness. In contrast, prior strategies experienced accuracy declines under backdoor attacks by up to 2.11 in IID and up to 3.31 globally and 1.74 locally in non-IID settings compared to FedFA. Additionally, our experiments with a Transformer-based language model showed a significant reduction in perplexity, improving by 1.07 to 4.50 times.</p><p>We outline the 'Related Work' in Section 2 and motivate FedFA in Section 3, which introduces previous width-and depth-flexible strategies and the model properties that we exploit. Next, 'Flexible Federated Learning' in Section 4 details the design of FedFA, and Section 5 discusses our experimental results. We discuss directions of future work in Section 6 and conclude in Section 7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Heterogeneous Network Aggregation in Federated Learning</head><p>FlexiFed <ref type="bibr">[41]</ref> is a depth-flexible strategy that aggregates common layers of clients' networks with varying depths, like in VGG-16 and VGG-19, forming global models from different layer clusters. HeteroFL <ref type="bibr">[6]</ref> is a width-flexible strategy that accommodates clients with varying resources by aggregating networks of different widths. It selectively aggregates weights where available and employs a heuristic to manage weight variability <ref type="bibr">[28]</ref>. NeFL <ref type="bibr">[16]</ref>  and depth flexibility, using skip connections to omit certain blocks and structured pruning for width control, similar to HeteroFL <ref type="bibr">[6]</ref>. These strategies result in incomplete aggregation, which poses security risks to the global model (see Figure <ref type="figure">1</ref>). Here, incomplete aggregation refers to the process where weights in layers or filters in the global model at a specific position are updated with contributions from only a subset of the participating local models rather than all. Unlike HeteroFL, FlexiFed and NeFL do not consider scale variations between clients' model weights, leading to potential unfairness in model aggregation. Moreover, HeteroFL's scaling factors might be less relevant in architectures with batch normalization layers <ref type="bibr">[15]</ref>, which stabilize learning and reduce the need for additional scaling. For more on HeteroFL's scaling, refer to Appendix G. Furthermore, several other width-and depth-flexible strategies like Sub-FedAvg <ref type="bibr">[39]</ref> and TailorFL <ref type="bibr">[5]</ref> predominantly rely on online filter pruning, which can lead to significant computational overhead, contradicting their aim for computational efficiency. Therefore, in this study, we benchmarked our proposed FedFA strategy against HeteroFL, FlexiFed, and NeFL (see Section 5).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Split Learning</head><p>The split learning FL framework also allows clients to maintain a variable number of neural network layers, which connect to common layers stored at the aggregation server <ref type="bibr">[7,</ref><ref type="bibr">31]</ref>. Clients can choose the number of local layers according to their computing resources and data characteristics <ref type="bibr">[35]</ref>. However, split learning requires intensive client-server communication, as clients cannot compute local model updates without communicating with the layers at the server <ref type="bibr">[10,</ref><ref type="bibr">29,</ref><ref type="bibr">38]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Skip Connections</head><p>Modern network architectures have highlighted the significance of skip connections (or residual connections) in neural networks, a feature we utilize in our layer grafting method to mitigate the security risks of incomplete aggregation (e.g., ResNets <ref type="bibr">[13]</ref>, MobileNets <ref type="bibr">[14]</ref>, and EfficientNets <ref type="bibr">[36]</ref>). Skip connections allow gradients to bypass specific layers, mitigating vanishing gradients in deep learning <ref type="bibr">[22]</ref>. This functionality preserves training stability and enhances pattern recognition efficiency, making these networks suitable for resource-constrained devices <ref type="bibr">[13,</ref><ref type="bibr">26]</ref>, and making layers similar <ref type="bibr">[9,</ref><ref type="bibr">16,</ref><ref type="bibr">40]</ref> (See Appendix B.).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Motivation: Challenges in Heterogeneous Aggregation</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Security Concerns of Heterogeneous Network Aggregation</head><p>Aggregating models from diverse network architectures presents security challenges, as malicious actors can exploit these strategies to carry out sophisticated and covert attacks, implanting subtle yet harmful alterations within model updates <ref type="bibr">[3,</ref><ref type="bibr">37]</ref>. These modifications, often in the form of triggers or slight changes, are designed to exploit the aggregation process covertly <ref type="bibr">[1]</ref> and steer a model to degrade its accuracy or to embed hidden vulnerabilities, which become more pronounced over time <ref type="bibr">[23,</ref><ref type="bibr">33]</ref>. A particular point of vulnerability with heterogeneous client architectures is the layers that are not fully aggregated (i.e., incomplete aggregation) in the global model (which has the largest width and depth across all local models) due to limited contributions from few clients, as depicted in Figure <ref type="figure">1</ref>.</p><p>A common attack embeds a backdoor into a malicious client's local model in a heterogeneous FL setting. This hidden function or behavior, designed to remain dormant, activates only under specific conditions. Once integrated into the global model, these backdoors can trigger significant security breaches, such as targeted misclassifications <ref type="bibr">[1,</ref><ref type="bibr">3]</ref>. Mathematically, a backdoor attack computes a malicious model update as follows:</p><p>Here, M t c is the original update from client c at global iteration t, and M backdoor represents the backdoor modification. determines the intensity of the backdoor effect. The contribution of &#8226; M backdoor to the aggregation process of all clients' model updates dictates the extent of damage to the global model. Specifically, weights of the global model that undergo incomplete aggregation are more susceptible to being compromised, as can be seen from the aggregation:</p><p>For the global model update, M t G , in the presence of many clients N , the influence of an attack by malicious clients could be diluted. However, this dilution is limited to the weights that are updated by most or all clients. Furthermore, by selecting the largest network architecture, attackers can amplify the effect of their attacks, in contrast to local clients who select network architectures based on their resource capabilities or the characteristics of their local data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Scale Variations in Heterogeneous Networks</head><p>During the training process, the gradients of weights can vary significantly across different network architectures, influenced by the number of weights present in each network before applying the loss function. As a result, the magnitudes of weight changes during each step of gradient descent (i.e., step sizes) can differ due to the variations in gradients during optimization. This leads to scale variations across the heterogeneous networks <ref type="bibr">[11]</ref> (refer to Appendix F for more details). In FL environments, such variations significantly impact the performance and accuracy of the aggregated global model. For instance, consider two client models within an FL system, Model A and Model B; each has a distinct architecture. The magnitude of their updates in a given round of FL, M A and M B , can differ significantly based on their respective models' complexities. When these models are aggregated using an unweighted averaging method, as is typical in federated learning, we have:</p><p>This may lead to an imbalance, causing the aggregated model update, M , to be skewed towards the model with a larger magnitude of weights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Flexible Federated Learning</head><p>This section details our methodology for implementing Federated Learning with Flexible Architectures (FedFA). We first introduce an overview of our FedFA procedure and then focus on the procedure for the layer grafting method, which ensures security by enforcing complete aggregation. Additionally, we introduce scaling factors for normalizing model weights, a critical component for achieving fair aggregation within the FedFA framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">FedFA Procedure</head><p>The FedFA algorithm, as presented in Algorithm 1, incorporates the layer grafting and the scalable aggregation methods into the FL paradigm. Initially, the server proposes a variety of network architectures (line 1). Clients then choose their architectures, e.g., using Neural Architecture Search (NAS) methods (line 2), followed by the server setting up the global model with the maximum possible number of weights (line 3). The algorithm begins its iterative process by selecting a random subset of clients to update the model in each round (lines 5-9). Subsequently, the layer grafting method (explained in Section 4.2) will adjust each client's updated local model to align with the global model's architecture (line 11).</p><p>For scalable aggregation, the server first finds the weights below the 95th percentile in each layer of each local model from each participating client (line 12). With these extracted weights, which exclude outliers, the server calculates</p><p>Architecture 1 Architecture 10 Architecture 2 &#8230; Architecture 3 Architecture 4 Architecture 5 Architecture 4 is the Largest. Global model follows architecture 4! Global Model Local Model Local Model Local Model Updated Local Model Updated Local Model Updated Local Model Updated Local Models Grafted Local Models Normalized Local Models Aggregated Global Model 1) Server announces possible network architectures to local clients. (line 1) 2) Clients choose their network architectures. (line 2) 3) Server sets up the global model, configuring it to match the largest architecture of local client networks. (line 3) 4) Server extracts weights from the global model based on clients' network architectures. (line 7) 6) Local training using local datasets by local clients. (line 9) 7) Local clients send updated local models back to server. (line 10) 5) Server distributes the extracted weights to each local client. (line 8) 8) Layer grafting by server. (line 11) 9) Normalizes the local models by server. (lines 12-22) 10) Aggregate normalized local models by server. (line 22)</p><p>Steps 4 to 10 are repeated iteratively until the predefined criteria are met, thereby refining the global model over time. the scaling factor, &#8629;</p><p>c , for each layer (line 18). Normalization of each local model and aggregation of these normalized local models then ensures that the updates from all participating clients are aggregated in a balanced manner, preserving uniformity and scale consistency throughout the network (lines <ref type="bibr">16</ref>-22) (more details are in Section 4.3). Algorithm 1 FedFA with the layer grafting and the scalable aggregation methods. The algorithm operates over T rounds with a client set C. Here, the clients' participating rate is C. Each client c 2 C selects a model architecture from a predefined set A. The server determines the maximal architecture width N (l) width,max and depth N (s) depth,max . The global model, M t G , is updated at the server through the aggregation of client updates. Require: Local datasets D = {Dc|c 2 C}. Ensure: Updated global model M T G . 1: Server proposes architecture set A. 2: Clients select network architectures from A using NAS methods and report their architectures (width N (l) width,c and depth N (s) depth,c ) to the server. 3: Initialize the global model, M 0 G , with N (l) width,max , N (s) depth,max . {Server} 4: for t = 0 to T do 5: Select a subset C sel of m = C &#8677; |C| clients. {Server} 6: for all clients c in C sel do 7: Extract M t c from M t G according to N (l) width,c and N (s) depth,c . {Server} 8: Distribute M t G to the client c. {Server} 9: M t+1 c LocalUpdate(M t c , Dc) {Client c} 10: Send M t+1 c to the server. {Client c} 11: Apply the layer grafting to M t+1 c . {Server, Algorithm 2} 12: M 95%,c Under 95th percentile values of M t+1 c for each layer {Server} 13: end for 14:</p><p>for all layers l in M t G do 15:</p><p>) {Server} 16:</p><p>for all clients c in C sel do 17: G is used for accumulating local updates (line 19), and (l) is for the weighted average of these local updates (line 20). (l)  considers the number of data samples for each client of line 1, N Dc , aligning with the original FedAvg algorithm <ref type="bibr">[24]</ref>. If the server cannot even access the number of data samples, we take N Dc = 1. After accumulation, the server can obtain the updated global model (line 24) by element-wise dividing M 0 (l) G by (l) for every layer l (line 22). The algorithm continues through these rounds until predefined criteria are met. The overall FedFA process is visually summarized in Figure <ref type="figure">2</ref>. We also show the effectiveness of heterogeneous network aggregation strategies in Appendix D and the convergence analysis of FedFA in Appendix E.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Layer Grafting Method for Ensuring Security</head><p>In FL, client models can vary in architecture due to differences in computational resources and data characteristics. This heterogeneity can lead to adversarial attacks during the aggregation of local models, potentially compromising the security of the global model. To address these issues, we introduce the layer grafting method (line 11 in Algorithm 1), which ensures uniformity in model architectures while accommodating client-specific characteristics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Algorithm 2</head><p>The Layer Grafting method. This algorithm standardizes the depth of each section M </p><p>end if 10: end for</p><p>The layer grafting method, as described in Algorithm 2 and illustrated in Figure <ref type="figure">6</ref> in the Appendix, aims to standardize the depth of each section across the FL client models. In this context, a 'section' is a part of the model where residual blocks share the same sequence of filter numbers in layers. A single model may comprise multiple such sections, each containing several residual blocks.</p><p>This addition is iteratively performed until the section reaches the specified maximum depth (lines 4-9 of the Algorithm 2). This systematic addition of residual blocks guarantees a consistent depth across all client models, thereby preserving architectural coherence within the FL network. Further details and the rationale behind layer grafting, particularly regarding the similarity of layers within residual blocks, are elaborated in Appendix B.</p><p>To see how layer grafting mitigates the potential risk of backdoor or poisoning attacks in aggregations across heterogeneous client architectures <ref type="bibr">[6,</ref><ref type="bibr">16,</ref><ref type="bibr">41]</ref>, we examine the aggregation for the global FL model with layer grafting, assuming the commonly used averaging method <ref type="bibr">[24]</ref>:</p><p>In this simple aggregation formula <ref type="bibr">[24]</ref>, M t G and M t 1</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G</head><p>represent the global models at iterations t and t 1, respectively. N denotes the total number of clients, and M t c are the updates from each individual client c. This equation illustrates how the influence of malicious updates is diluted in a complete aggregation, especially for a large number (N ) of clients. This mitigation is effective if the updated weights, M t 1 G + M t c and M t 1 G + M t malicious are on the same scale.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Scalable Aggregation: Normalization of Local Model Weights</head><p>In addressing scale variations stemming from heterogeneous network architectures in FL, we incorporate a crucial normalization step in the model aggregation process (lines 18-19 in Algorithm 1). This involves applying scaling factors to the weight updates from each client model to ensure balanced contributions in the aggregated model. The scaling factor, denoted as</p><p>c for layer l of each local model, is calculated in response to the diverse scales of weight updates from different network architectures. These factors are determined based on the L2 norm to prevent larger updates from disproportionately influencing the global model.</p><p>The formula for the scaled weights in the FedFA framework is:</p><p>Here, M 95%,c || signifies the weights under the 95th percentile in layer l for client c. We utilize the 95th percentile as it effectively mitigates the impact of outliers, which could otherwise skew the accuracy of scale calculations. This approach is beneficial in reducing the influence of anomalous weight values that may arise from noisy data or atypical client models.</p><p>This normalization process, with the scaling factor</p><p>, applied layerwise, ensures that the aggregated model accurately reflects the diverse architectures in the network. Furthermore, the averaging component, represented by the numerator</p><p>95%,&#63743; ||, moderates the convergence speed by averaging the magnitude of updates across the participating clients. This leads to a more balanced and representative global model, adjusting for scale variations across different client models and enhancing the overall fairness of the FL system.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Global Model Distribution Step</head><p>The model distribution step, detailed in Algorithm 3 in the Appendix, focuses on tailoring the aggregated global model to align with the unique architectural requirements of each client (line 7 in Algorithm 1). This critical process involves modifying the global model to conform to each client's model's specific depth and width weights. To achieve this, the algorithm systematically adjusts the global model by reducing its depth (lines 3-6 in Algorithm 3) and width (lines 8-11 in Algorithm 3) to those specified by each client's architecture. By following this procedure, the global model is effectively customized, making it compatible with the diverse architectures of all participating clients in the FL network.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experimental Evaluation</head><p>This section evaluates FedFA's performance by benchmarking its testing accuracy, robustness against backdoor attacks, and computational complexities. This evaluation uses local and global test datasets, comparing FedFA with previous aggregation strategies offering width and depth flexibility.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Experimental Setup</head><p>We assess Pre-ResNet, MobileNetV2, and EfficientNetV2 using the CIFAR-10, CIFAR-100, and Fashion MNIST datasets in IID and non-IID environments. In IID settings, each client has samples from all classes, with a uniform data distribution where the minimum number of samples for a client can be up to half the maximum number of samples for any other client.</p><p>For Non-IID settings, clients get samples from 20% of the dataset classes but maintain equal samples for each class they hold. Here, during local training, clients zero-out logits for absent classes. We replace typical batch normalization layers with static versions, as seen in HeteroFL <ref type="bibr">[6]</ref>. We also utilized a language model with a Transformer using WikiText-2 to demonstrate that our method is generalizable. Detailed network structures are presented in Table <ref type="table">4</ref>, and more training details are in Table <ref type="table">6</ref> in the Appendix.</p><p>Evaluations were conducted under four scenarios with varying impacts of backdoor attacks from malicious clients. Here, the backdoor attacks involve the random shuffling of the data labels among clients to induce misclassification. Scenarios have different portions of malicious clients over entire local clients (0 %, 1 %, and 20 %) and two intensities of attacks, ( = 1, 20 in Eq. 1). Also, for all scenarios, we assume that half of the clients have limited computational resources and choose the smallest architectures. The other clients choose their architectures employing ZiCo <ref type="bibr">[18]</ref>, a cost-effective NAS method that requires</p><p>0 30 60 FedFA (Depth) FlexiFed FedFA (Width) HeteroFL FedFA (Both) NeFL 40 70 100 FedFA (Depth) FlexiFed FedFA (Width) HeteroFL FedFA (Both) NeFL &#955;=1, 0% &#955;=1, 1% &#955;=1, 20% &#955;=20, 20% a) CIFAR 10 -Pre-ResNet 0 50 100 FedFA (Depth) FlexiFed FedFA (Width) HeteroFL FedFA (Both) NeFL Test Accuracy (%)</p><p>IID Global Test Accuracy Non-IID Local Test Accuracy Non-IID Global Test Accuracy 0 20 40 FedFA (Depth) FlexiFed FedFA (Width) HeteroFL FedFA (Both) NeFL 0 30 60 FedFA (Depth) FlexiFed FedFA (Width) HeteroFL FedFA (Both) NeFL &#955;=1, 0% &#955;=1, 1% &#955;=1, 20% &#955;=20, 20% b) CIFAR 100 -MobileNetV2 0 25 50 FedFA (Depth) FlexiFed FedFA (Width) HeteroFL FedFA (Both) NeFL Test Accuracy (%) IID Global Test Accuracy Non-IID Local Test Accuracy Non-IID Global Test Accuracy 0 30 60 FedFA (Depth) FlexiFed FedFA (Width) HeteroFL FedFA (Both) NeFL 60 80 100 FedFA (Depth) FlexiFed FedFA (Width) HeteroFL FedFA (Both) NeFL &#955;=1, 0% &#955;=1, 1% &#955;=1, 20% &#955;=20, 20% c) Fashion MNIST -EfficientNetV2 0 50 100 FedFA (Depth) FlexiFed FedFA (Width) HeteroFL FedFA (Both) NeFL Test Accuracy (%) IID Global Test Accuracy Non-IID Local Test Accuracy Non-IID Global Test Accuracy only forward passes and uses an evolutionary algorithm. This method decides local network architectures among the network candidates specified in Table <ref type="table">5</ref> in the Appendix, based on local data for each client.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Baselines and Metrics</head><p>After aggregating the local models, we calculate the global testing accuracy using a global test dataset. In non-IID settings, we additionally use several local test datasets extracted from the global dataset, ensuring they reflect the local clients' class distributions. After local training and before aggregation, we test the local models to determine an average local testing accuracy. This metric allows us to measure the effectiveness of local personalization. For the Transformer model, we use average local perplexity to assess the performance of the local clients' language models after local training for every round. Here, perplexity measures how well a probability model predicts a sample. To evaluate computational complexity, we rely on multiply-accumulate (MAC) calculations. MACS n=i is the MAC for one local epoch of a local model with given local data, differentiated by architecture n = i. N n=i counts such architectures in the FL system. The average MAC is MACS given varied complexities among local models. The MAC is MACE = N p &#8226; MACS for a single local epoch with all clients. The FL system's total MAC, TMAC, is found by multiplying total rounds of aggregation steps, T , by local epochs, E.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Evaluation</head><p>Testing Performance and Robustness Against Backdoor Attacks Table 1 shows the testing results of scenarios with varying model depths and widths for FedFA and previous flexible aggregation strategies in width and depth. Each scenario was tested three times, and the table presents the average results.</p><p>When only depth is varied, FedFA outperforms FlexiFed with a 1.00 (equivalent) to 1.04 times improvement globally in IID, and 0.98 to 1.02 times locally and 1.00 to 1.20 times globally in non-IID settings. With width variations, FedFA exceeds HeteroFL, achieving 1.01 to 1.16 times better accuracy globally in IID, 0.99 to 1.13 times locally, and 1.00 to 1.16 times globally in non-IID settings. For combined width and depth changes, FedFA surpasses NeFL, showing a 1.00 to 1.02 times improvement globally in IID, 0.99 to 1.06 times locally, and 0.95 to 1.06 times globally in non-IID settings. Overall, FedFA outperforms other heterogeneous strategies in testing accuracy except in 2 out of 9 scenarios.</p><p>As shown in Figure <ref type="figure">3</ref>, backdoor attack scenarios reveal more distinct differences. When varying only the depth, FlexiFed experiences a more significant drop in testing accuracy than FedFA, with decreases of 1.22 to 2.09 times globally in IID, 2.21 to 3.31 times locally, and 1.21 to 1.74 times globally in non-IID settings. With width variation only, HeteroFL sees a testing accuracy drop of 1.07 to 1.47 times globally in IID, 0.89 to 2.37 times locally, and 1.11 to 1.22 times FedFA is remarkably robust against backdoor attacks on the global model. Lastly, to demonstrate the generality of FedFA, we also examine its performance with transformers. The earlier strategies exhibit perplexities that are 1.07 to 4.50 times higher than those of FedFA, as detailed in Table <ref type="table">3</ref>.</p><p>Computational Complexity FedFA, employing layer grafting and scalable aggregation, has slightly higher computational complexity than earlier heterogeneous methods. Yet, for targeted testing accuracies-70% (IID) and 40% (non-IID) in CIFAR-10, 25% (both IID and non-IID) in CIFAR-100, and 80% (IID) and 30% (non-IID) in Fashion MNIST-the computational complexities are only 0.95 to 1.02 times higher. This indicates that FedFA's computational overhead is not marginally higher than earlier strategies, as presented in Table <ref type="table">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Future Work and Limitations</head><p>Future research could enhance FedFA's scalability for larger and more complex networks by optimizing algorithms to reduce the communication overhead and computational burden on clients. Developing dynamic client participation algorithms based on resource availability and network conditions could improve resource efficiency and model convergence speed but require careful mechanisms to handle clients' heterogeneous architectures. Advanced security mechanisms are necessary to detect and mitigate a broader range of adversarial attacks beyond the backdoor attacks we consider, including those exploiting model aggregation vulnerabilities. Further personalizing model architectures based on client data characteristics using advanced NAS techniques could improve local model performance. Additionally, integrating FedFA with edge computing paradigms could address latency, bandwidth, and real-time processing challenges in highly distributed environments.</p><p>Despite its advantages, FedFA has limitations that need to be addressed. One significant limitation is that all clients must employ the same type of network architecture, such as all using ResNets, MobileNets, or EfficientNets. This uniformity can restrict the flexibility and efficiency of the system, especially when dealing with diverse client capabilities and requirements. Future work should focus on enabling support for heterogeneous network types within the same federated learning framework to accommodate the variety of client devices better and improve overall performance and scalability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>This paper introduces FedFA, a width-and depth-flexible aggregation strategy designed for clients with diverse computational and communication requirements and their local data. FedFA safeguards the global model against backdoor attacks through the layer grafting method. Furthermore, it introduces a scalable aggregation method to manage scale variations among networks of differing complexities. Compared to previous heterogeneous network aggregation methods, FedFA has shown superior testing performance and robustness to backdoor attacks, establishing its feasibility as a solution for various FL applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Testing Specifications</head><p>Table <ref type="table">4</ref>. Details of the network architectures for Pre-ResNet, MobileNetV2, Efficient-NetV2, and Transformer are presented.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Pre-ResNet MobileNetV2 EfficientNetV2</head><p>Section</p><p>-d fc Transformer Encoder Attention 192-d fc 64-d fc FeedForward 3 &#215; 3, 64 3 &#215; 3,64 &#215; 2 Attention 192-d fc 64-d fc FeedForward 3 &#215; 3, 64 3 &#215; 3,64 &#215; 2 28782-d fc Decoder 28782-d fc Attention 192-d fc 64-d fc FeedForward 3 &#215; 3, &#119908; 1 3 &#215; 3,64 &#215; &#119889; 1 Attention 192-d fc 64-d fc FeedForward 3 &#215; 3, &#119908; 1 3 &#215; 3,64 &#215; &#119889; 1 Classifier 28782-d fc  property of CNNs with skip connections, and the focus of our layer grafting approach, is that CNNs with skip connections exhibit residual blocks with similar weight values. This similarity makes it possible to graft the last residual blocks of each section, playing a crucial role in guiding the aggregation of local models of different depths and informing the selection of layers during global model dissemination to local clients.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 Residual Block Similarity and Its Implications</head><p>The similarity among residual blocks within a given section of a CNN can be empirically observed and has significant implications for model performance. Veit et al. <ref type="bibr">[40]</ref>'s results suggest that CNNs' performance is not drastically affected by removing or swapping certain residual blocks. This observation is crucial in understanding the resilience of CNNs with skip connections and forms the basis of our layer grafting strategy. Consider a skip connection network as shown in Figure <ref type="figure">4 b</ref>), divided into sections with their residual blocks (f x , for x = 1, 2, 3). Each block within a section typically employs a similar convolutional layer structure, and the model's output can be conceptualized as an average of results from various sub-model paths formed by these blocks.</p><p>For an input x, the output of a section with residual blocks can be expressed as:</p><p>Swapping blocks, say f 1 and f 2 , alters the equation but does not significantly affect the output, suggesting that f 1 (x) and f 2 (x) have similar contributions. This observation leads us to an important relationship:</p><p>Algebraic manipulation then brings us to the conclusion that</p><p>residual blocks within a given CNN are similar. This reasoning can be generalized to larger numbers of residual blocks with various layer structures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 Statistical Evidence for Layer Similarity</head><p>We also statistically validate layer similarity. We consider the three different CNNs shown in Table <ref type="table">4</ref>: Pre-ResNet, MobileNetV2, and EfficientNetV2, which are respectively trained on the CIFAR-10, CIFAR-100, and Fashion MNIST datasets. We use models of different depths d k , with d k determining the number of residual blocks in each convolutional layer section, as shown in the table.</p><p>We do this by using the Pearson Correlation Coefficient (PCC). While the PCC has been used to assess correlations between filter weights, e.g., to facilitate model pruning <ref type="bibr">[17]</ref>, we directly utilize the PCC to gauge the similarity between convolutional layers and thus residual blocks. A high PCC value, combined with our previous result on similarities in weight scales, would indicate similarities in residual block weights.</p><p>To measure the similarity of two residual blocks, we first consider using the PCC to assess the similarities of two convolutional layers, A and B, in distinct residual blocks. Each layer contains N Cout filters (i.e., output channels), with each filter possessing N Cin weight maps (i.e., input channels). We use r i,j k,l to represent the PCC value of the k-th weight map of the i-th filter of layer A and the l-th weight map of the j-th filter of layer B.</p><p>We use R i,j to denote the matrix of the PCCs for all pairs of weight maps in the i-th filter of layer A and the j-th filter of layer B. Our goal is now to represent R i,j with a single scalar encapsulating the overall similarities of filters i and j.</p><p>Since weights in networks are initialized randomly for every new training iteration, filter sequences can differ, even with identical input feature maps, leading to diverse output feature map sequences (Figure <ref type="figure">5</ref>) <ref type="bibr">[8,</ref><ref type="bibr">12]</ref>. Importantly, these outputs then serve as input for subsequent convolutional layers, influencing the sequence of weight maps within each filter.</p><p>...</p><p>&#8859; = ... To match these sequences, we thus select the element with the highest PCC in each row of R i,j , with the constraint that only one element per column of R i,j is selected. We then compute the average of the selected N Cin elements, denoted as rij . Since we need to account for potential overlapping weight maps, any column with a selected element is excluded in subsequent steps.</p><p>When assessing two convolutional layers with N Cout filters each, there are N 2</p><p>Cout filter pair combinations. Just as with weight map similarity, we create a one-to-one filter matching, since previous research has highlighted the existence of many similar filters within a single layer <ref type="bibr">[17]</ref>. This one-to-one matching avoids matching a given filter with an excessive number of overlapping filters and overstating the overall layer similarity. We then average the PCCs ri,j of the matched filters.</p><p>The similarities between the convolutional layers of two models with varying depths are presented in Tables <ref type="table">7</ref>, <ref type="table">8</ref>, and 9. We consider models at the 0 epoch (i.e., before training) and at the epoch where they achieve their highest testing accuracy. Specifically, we exclude the first residual block of each section from our analysis, as it typically has a different input channel size compared to the other residual blocks in the same section.</p><p>From Tables <ref type="table">7</ref>, <ref type="table">8</ref>, and 9, all convolutional layers within a particular section have similarities greater than 0.5, regardless of the presence of skip connections. CNNs with skip connections usually show increased correlations post-training, with exceptions in 43 of the 138 cases (emphasized in bold). In contrast, CNNs without skip connections reveal decreased correlations post-training in 76 of the 138 cases (emphasized in bold).</p><p>Prior to training, high similarities are observed in both types of CNNs, suggesting potential matching of filters or weight maps, possibly due to the law of large numbers. This trend is even more evident as filter counts rise from lower to higher sections. Notably, while CNNs with skip connections mostly show an increase in similarity, over half the cases without them deviate from this trend.</p><p>This data shows that the weights of residual blocks within the same section of CNNs with skip connections remain similar, irrespective of depth or structural variations. We thus validate layer grafting to aggregate client models, as well as our method of sending model weight subsets to each client.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.4 Layer Grafting Based on Residual Block Similarity</head><p>The observed similarity among residual blocks in CNNs allows us to infer that any block from the same section is a viable candidate for grafting. This similarity supports the layer grafting method in two ways:</p><p>-Depth Modification: When modifying the depth of the global model for local models, any residual block from the same section can be used, ensuring consistency in feature representation and learning capability. -Model Adaptability: The similarity in blocks enables a more flexible approach to model aggregation and dissemination, as blocks can be added or removed without significantly impacting model performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.5 Summary</head><p>The statistical analysis of the layer grafting method in CNNs with skip connections underscores the viability of this approach in FL. By leveraging the inherent similarity among residual blocks, the layer grafting method not only preserves model integrity but also enhances adaptability and robustness across diverse client models. This analysis affirms the soundness of layer grafting as a key component in our FedFA framework in Figure <ref type="figure">6</ref>.  C Global Model distribution Algorithm Algorithm 3 This algorithm customizes the aggregated global model M t G for each client c by adjusting its structure to align with its original local network architecture, M t 1 c (line 7 of Algorithm 1). The global model has the number of residual blocks for section s, N (s) depth,max , whereas the local model, M t 1 c , has the number of residual blocks for each section, N (s) depth,c and the input and the output channel sizes for of each layer, C I and C o . Customization involves reducing the depth and width of M G by systematically removing residual blocks and filters over the client-specific thresholds. In the algorithm, the operator signifies the reverse of the layer grafting process. Require: Updated global model M t G . Ensure: Each client c receives an appropriately configured version of M t G , M t c . 1: for each section M (s) G in global model M t G do 2: D N (s) depth,max N (s) depth,c 3: for d = 1 to D do 4: R (s) last last residual block in section s 5: M (s) G M (s) G R (s) last 6: end for 7: end for 8: for each layer M (l) G in global model M t G do 9: CI , Co Input and output channel sizes of M (l) c 10: M (s) G M (s) G [: Co, : CI ] 11: end for 12: M t c M t G</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Efficacy of Heterogeneous Network Aggregation</head><p>This section investigates the theoretical underpinnings of heterogeneous network aggregation. Due to their simpler structure, shallow networks demonstrate a faster convergence rate than deeper networks. However, deeper networks are more adept at capturing complex features and hierarchical data structures, translating to superior performance on complex tasks <ref type="bibr">[25]</ref>. Therefore, aggregating shallow and deep models can accelerate convergence relative to only deep models and enhance performance compared to solely shallow models. This theoretical examination focuses on how the speed of increase in prediction variance (the output logits of the classifier) differs between shallow and deep models in classification tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.1 Variance Analysis Across Model Complexities</head><p>When a model is fully trained in classification tasks, its output logits typically approach 1s for the correct classes and 0s for others, assuming one-hot encoding. This indicates that the variance of the output logits generally increases until the models are completely adapted to their data. This subsection explores the relationship between model complexity and variance in model predictions for classification tasks, analyzing models of varying complexities to discern their impact on predictive variability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Preliminaries and Assumptions</head><p>To guide our analysis, we define essential concepts and assumptions:</p><p>-Model Output Functionality: In classification tasks, consider the model's output logits vector, y, is a function of the input x and weight vectors w i for the index of each layer i. We assume that the activation function g negates the correlated outputs of w i for analytical simplicity. Additionally, appropriately clipped weights w i can compute the output y without needing a softmax function. Thus, y can be expressed as a linear combination of g(w i , x), formulated as y = P i g(w i , x). Also, the analysis assumes the optimization process employs full batch gradient descent.</p><p>-Law of Large Numbers (LLN) Application: The LLN indicates that as the number of trials increases, the average of the outcomes converges to the expected value. This principle is instrumental in understanding the variance behavior with increasing model complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model Complexity and Variance</head><p>We explore the relationship between model complexity (number of weights) and variance in a structured manner. The total variance of y is articulated through the law of total variance:</p><p>Here, E[Var(y|x)] signifies the expected value of the conditional variance of y given x, and Var(E[y|x]) represents the variance of the expected value of y given x.</p><p>Analyzing each component, assuming g(w i , x) is independent for each i:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>!</head><p>In a simple or linear model where f represents the model function, Var(y 0 ) could be nearly zero if P i g(w i,0 , x) yields consistent outputs. However, this variance may not be negligible for more complex models, though it remains relatively small.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Post-Training Variance Dynamics</head><p>After training, the weights w i are updated to minimize the loss function L, introducing variability in w i,t+1 which, in turn, injects diversity into the model's output: w i,t+1 = w i,t + &#8984;rL(y t , y &#8676; ) and L(y t+1 , y &#8676; ) &#63743; L(y t , y &#8676; ) As w i,t+1 is continually adjusted to reduce L, reflecting the learning process from the training data, the variance Var(y t+1 ) tends to increase, acknowledging the diversity in responses due to this learned variability: Var(y t+1 ) Var(y t ) for the training data x.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.3 Toward a Specific Target Variance</head><p>This subsection systematically examines how the weights of deep and shallow models affect the speed toward a specific target variance Var target and finally shows that aggregating deep and shallow models is beneficial.</p><p>From earlier discussions:</p><p>1 N shallow Var(y t ) = Var shallow,t Var deep,t = 1 N deep Var(y t ) Var(y t+1 ) Var(y t )</p><p>We know from the given conditions that the accumulated variances for shallow and deep models equalize at certain times T shallow and T deep for y &#8676; , respectively:</p><p>Let r shallow and r deep represent the average rates of variance growth for shallow and deep models, respectively. These rates are defined as the change in variance per training iteration. Over time, the variances for shallow and deep models can be represented as a function of their growth rates and time:</p><p>Plugging these into the equation of accumulated variances gives:</p><p>To find a direct relationship between r shallow and r deep , rearrange the equation:</p><p>and knowing T deep T shallow <ref type="bibr">[25]</ref>, Therefore, we can infer that:</p><p>This implies that shallow models exhibit a faster increase in variance compared to deep models, illustrating that the variance in predictions of a shallow model increases more quickly during training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Aggregation Dynamics of Deep and Shallow Model Weights</head><p>The aggregation of weights from both deep and shallow models creates a new dynamic in the variance increase rate of the combined model. This can be quantified as:</p><p>where &#8675; is a factor that balances the contributions of deep and shallow model weights with 0 &lt; &#8675; &lt; 1. This formula shows that combining the weights of deep and shallow models in the combined model configuration yields a more marked increase in the variance rate per weight compared to using only the deep model's weights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Convergence Analysis of Gradient Aggregation in Federated Learning with Flexible Architectures</head><p>In this section, we show the convergence rate of our FedFA's object function in the context of FL <ref type="bibr">[24]</ref>.</p><p>Preliminary Concepts</p><p>1. Mitigation of Skewness Introduced by Data Heterogeneity: Different client local weights and data distributions, denoted as ! 1 , ! 2 , . . . , ! c and D 1 , D 2 , . . . , D c , yield gradients rf 1 , rf 2 , . . . , rf c that can vary significantly in magnitude. Formally:</p><p>To counteract this variability, we employ the L2 norm for scaling. The L2 norm is calculated based on the central 95% of the weights under the 95th percentile:</p><p>Unlike the FedFA's complete Algorithm 1, we assume ! c is a one-layered neural network architecture for simplicity. Let |C| symbolize the total client count within the FL framework. C is the participating rate of clients for each round. Convergence Analysis To understand the convergence of our algorithm in federated learning, we analyze it under two primary mathematical properties: strong convexity and Lipschitz continuity of the gradient.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Assumptions</head><p>1. The global loss function f (!) is &#181;-strongly convex, where &#181; is a positive real number, ! is weights. This property guarantees a unique minimum for f (!), which aids in the convergence of the optimization process. 2. The gradient of f (!) exhibits Lipschitz continuity with a constant denoted as L. Given that L is a non-negative real number, this continuity ensures that the gradient variations between consecutive iterations are bounded, ensuring stability during the optimization updates.</p><p>Analysis If a function f (!) is &#181;-strongly convex, then for every ! 1 and ! 2 :</p><p>The function of f (!) is L-Lipschitz continuous when:</p><p>The iterative process of full batch gradient descent updates the weight ! as:</p><p>Here, &#8984; is the learning rate. Thus, we can deduce:</p><p>We start with the strong convexity property:</p><p>Considering the Lipschitz continuity, the function value change due to a step in the direction of the gradient is:</p><p>Combining both inequalities, we get:</p><p>This inequality gives us a bound on the magnitude of the gradient at iteration t. If the magnitude of the gradient decreases (or remains below a certain threshold), this indicates convergence towards an optimum.</p><p>To understand the convergence properties of the FedFA framework, let's break down the gradient's behavior and its associated norms. Given the global gradient at iteration t as:</p><p>w/ scaled k We can express it as an aggregation of the scaled gradients from each client: kr X c2C sel p c &#8677; rf t,scaled c ) ! k The scaling factor &#8629; t c alters the gradient norm: krf t G,w/ scaled k &#63743; max c2C sel {&#8629; t c } &#8677; krf t G,w/o scaled k Using the bound from the unscaled case: krf t G,w/ scaled k &#63743; max c2C sel {&#8629; t c } &#8677; 2L 2 + &#181;&#8984; With the scaled full batch gradient descent update, the difference becomes: k! t+1 ! t k = &#8984;krf t G,w/ scaled k Substituting the bound for the scaled gradient norm: k! t+1 ! t k &#63743; &#8984; &#8677; max c2C sel {&#8629; t c } &#8677; 2L 2 + &#181;&#8984; Therefore, the convergence rate is: O max c2C sel ( 1 m P &#63743;2C sel ! 95%,&#63743; ! 95%,c ) &#8677; 2L&#8984; 2 + &#181;&#8984;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>!</head><p>This suggests that the convergence of FedFA is sensitive to the learning rate, scaling factors, and the inherent attributes of the loss function. It is crucial to recognize that the convergence rate usually fluctuates over time based on the selection of the learning rate &#8984;, which could be either constant or adaptive, tending towards zero as the number of iterations t increases.</p><p>Baseline models to each Model k, we observe variations according to network complexities, influenced by varying depth and width. Specifically, in Pre-ResNet, the distances are 0.98 1.36 times greater than the Baseline models' average magnitude of weights. For MobileNetV2, these distances range from 1.20 1.68 times the Baseline's average magnitude, and for EfficientNetV2, they are 1.35 1.70 times greater. These findings empirically show the scale variations linked to network complexities and highlight the necessity for a scalable aggregation method in our FedFA framework.</p><p>Table <ref type="table">10</ref>. According to network complexities, each network has a distinct scale in its weights, leading to scale variations across heterogeneous network architectures.  c) EfficientNetV2 Fig. 9. Weights distributions across different architectures of the first (left) and the last layers (right) for EfficientNetV2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model</head></div></body>
		</text>
</TEI>
