<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>FedTLU: Federated Learning with Targeted Layer Updates</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>04/06/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10610463</idno>
					<idno type="doi">10.1109/ICASSP49660.2025.10888369</idno>
					
					<author>Jong-Ik Park</author><author>Carlee Joe-Wong</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Federated learning (FL) addresses privacy concerns in training language models by enabling multiple clients to contribute to the training, without sending their data to others. However, non-IID (identically and independently distributed) data across clients often limits FL's performance. This issue is especially challenging during model finetuning, as noise due to variations in clients' data distributions can harm model convergence near stationary points. This paper proposes a targeted layer update strategy for fine-tuning in FL. Instead of randomly updating layers of the language model, as often done in practice, we use a scoring mechanism to identify and update the most critical layers, avoiding excessively noisy or even poisoned updates by freezing the parameters in other layers. We show in extensive experiments that our method improves convergence and performance in non-IID settings, offering a more efficient approach to fine-tuning federated language models.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Data-driven language models have recently driven major application advancements in natural language processing like conversational agents and machine translation <ref type="bibr">[1]</ref>- <ref type="bibr">[4]</ref>. However, the widespread use of language data to train these models raises privacy concerns, as language data from individuals is personal and prone to exploitation for malicious purposes <ref type="bibr">[4]</ref>- <ref type="bibr">[6]</ref>.</p><p>Federated Learning (FL) addresses these concerns by training a global model without centralizing user data <ref type="bibr">[7]</ref>- <ref type="bibr">[9]</ref>. However, one of the core challenges in FL is developing a global model that performs well across clients, whose data can vary significantly in size and distribution <ref type="bibr">[10]</ref>, <ref type="bibr">[11]</ref>. This non-IID nature of client data often causes federated models to underperform compared to centralized models <ref type="bibr">[10]</ref>, <ref type="bibr">[12]</ref>, <ref type="bibr">[13]</ref>. In language models, this heterogeneity can manifest as clients' exhibiting distinct linguistic patterns, content, and topics, complicating model convergence and generalization <ref type="bibr">[14]</ref>- <ref type="bibr">[16]</ref>. Toward the end of the training, as the model nears a stationary point, stochastic local updates-amplified by data heterogeneity-create variability that makes it difficult to maintain global model consistency <ref type="bibr">[17]</ref>, <ref type="bibr">[18]</ref>. This variability slows convergence and prevents the global model from reaching optimal performance <ref type="bibr">[17]</ref>- <ref type="bibr">[20]</ref>.</p><p>While local fine-tuning helps models adapt to client-specific data <ref type="bibr">[21]</ref>, <ref type="bibr">[22]</ref>, fine-tuning only local models can lead to divergence, as each client's model becomes too specialized to its local data and tends to overfit, reducing the global model's overall effectiveness across all clients <ref type="bibr">[23]</ref>. Moreover, individual clients often lack enough data for effective independent fine-tuning <ref type="bibr">[24]</ref>, <ref type="bibr">[25]</ref>. A well-optimized global model, updated across clients, provides a stronger starting point <ref type="bibr">[26]</ref>. Once aligned, local fine-tuning can enhance specific client tasks without diverging from the global objective. Therefore, FL requires a well-generalized global model to meet the collective needs of all participants. Thus, our primary focus in this work is on finetuning the global model to ensure consistency across clients, with local tuning as an optional step for personalization.</p><p>In FL, a central server periodically aggregates locally updated client models. However, uniformly updating all model layers during aggregation, especially in the final stages, can amplify noise from certain clients, degrading performance and hindering convergence <ref type="bibr">[27]</ref>, <ref type="bibr">[28]</ref>. Toward the end of training, as the model nears a stationary point and gradients shrink, noisy or non-representative updates from certain clients can disproportionately affect the global model <ref type="bibr">[8]</ref>. We can mitigate the impact of such noisy updates by selectively updating only the most important parameter sets, such as layers, which are likely to significantly reduce global loss. However, the key challenge in this approach is determining which layers to update. Clients do not know the updates made by other clients and have limited information about the criticality of each layer. Therefore, server-side coordination is needed to aggregate client information and determine the most critical layers to update, though the server has only limited knowledge of the heterogeneous data distributions across clients.</p><p>In this context, we propose Federated Learning with Targeted Layer Updates (FedTLU). FedTLU selectively identifies and updates only the most critical layers to reduce global loss on the server side. By focusing on these key layers, FedTLU enables more efficient training, faster convergence, and improved global model performance across diverse clients. In this work, we make three primary contributions:</p><p>&#8226; We propose a novel server-side scoring mechanism that identifies critical layers to update without relying on client-side knowledge.</p><p>&#8226; FedTLU is designed to be robust against noisy updates, even in environments with non-representative or noisy client data.</p><p>&#8226; FedTLU consistently outperforms random and last-layer updates, achieving better test performance by up to 7.86% globally and up to 8.27% locally, particularly during the noisy fine-tuning phase.</p><p>We review related work in Section II, present the FedTLU algorithm in Section III, and offer theoretical support for FedTLU's design in Section IV. Section V presents experimental results that validate our approach, and we summarize our findings in Section VI.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>In both centralized learning (CL) and FL, fine-tuning is widely recognized for improving model performance, particularly during the final stages of training. In CL, where all data is stored on a central server, fine-tuning often involves updating only a small subset of model parameters, such as layers near the output nodes <ref type="bibr">[29]</ref>, or even selecting layers randomly for updating <ref type="bibr">[30]</ref>- <ref type="bibr">[32]</ref>. This selective approach improves convergence and reduces computational costs <ref type="bibr">[33]</ref>, <ref type="bibr">[34]</ref>. Refining a subset of layers allows the model to adjust task-specific representations while leveraging previously learned features <ref type="bibr">[9]</ref>, <ref type="bibr">[35]</ref>. These layers also can be selected based on their importance, determined through explainable neural network techniques like sensitivity or criticality scoring that assess how vital certain parameters are for generalization <ref type="bibr">[36]</ref>, <ref type="bibr">[37]</ref>. Such methods prioritize updates where they will have the most significant impact.</p><p>Similarly, fine-tuning a subset of layers in FL can benefit convergence. FL can theoretically mirror CL under certain conditions, particularly if there are enough communication rounds between clients and the central server <ref type="bibr">[8]</ref>, <ref type="bibr">[38]</ref>. Thus, some fine-tuning techniques from CL can be adapted for FL scenarios <ref type="bibr">[22]</ref>, <ref type="bibr">[39]</ref>. However, CL techniques for scoring layers are difficult to extend to FL: in FL, clients lack information about others' updates, and the server can only aggregate what is provided. As a result, simpler strategies like randomly selecting layers or updating the last few layers are often used <ref type="bibr">[40]</ref>. However, fine-tuning only the last few layers is insufficient in non-IID settings, where data heterogeneity leads to highly varied learned representations <ref type="bibr">[41]</ref>- <ref type="bibr">[43]</ref>. This variability can hinder the global model's generalization ability, as the last layer represents drastically different outputs depending on each client's data. We therefore focus on updating the main body of the model.</p><p>We aim to update the most significant layers contributing to reducing global loss, where each layer's update is guided by the local gradient magnitudes and directions during local training, resulting in varied contributions from different layers <ref type="bibr">[21]</ref>, <ref type="bibr">[27]</ref>, <ref type="bibr">[44]</ref>. Rather than random selection, we enhance model stability by selecting impactful layers. By targeting key layers that effectively reduce global loss, we can improve convergence, especially in language tasks, addressing the limitations of random or fixed-layer updates <ref type="bibr">[45]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. METHODOLOGY</head><p>This section introduces FedTLU, a method for selecting layers to update during global aggregation in FL, particularly for training language models with non-IID data. The goal is to efficiently update the most critical layers or blocks to improve model convergence and generalization, especially in the final training stages.</p><p>We assume the usual FL training framework, in which training proceeds iteratively and in each round t, clients first compute local model updates based on their local data. These updated model parameters are then sent to the server, which aggregates them to form the new global model with parameters W (t+1) aggre,i in each layer i. Layer scoring: To quantify the significance of each layer, we define a score for the weights Wi in each layer i as follows:</p><p>where &#916;Wi is the L2 norm of the difference between the parameters before round t's update (W (t) i ) and the aggregated parameters after the update W (t+1) aggre,i , ni is the number of parameters in layer Wi, and std(&#916;Wi) is the standard deviation of the differences in parameters in layer i. This score is large in two cases: a) Large Numerator ( &#916;Wi ): A large numerator indicates substantial changes between the current and aggregated parameters, suggesting that the layer is making a significant contribution to reducing the global loss <ref type="bibr">[46]</ref>. b) Small Denominator ( &#8730; ni &#8226; std(&#916;Wi)): A small standard deviation std(&#916;Wi) reflects consistent parameter changes, indicating an effective gradient update. The factor &#8730; ni normalizes by layer size, allowing for fair comparison across layers <ref type="bibr">[44]</ref>, <ref type="bibr">[46]</ref>.</p><p>Training algorithm: Modern deep neural networks, particularly language models, often consist of repeated blocks of layers with the same sequence of channel sizes <ref type="bibr">[1]</ref>, <ref type="bibr">[3]</ref>. While individual layers may differ in parameters, repeated blocks typically have the same total parameter count. To leverage this structure, we group layers into blocks for comparison and updating, so that we can fairly compare Server sends W (t) to the selected clients 6:</p><p>for each client k &#8712; {Selected Clients} in parallel do 7:</p><p>Client k sends W</p><p>(t+1) k to the server 9: end for 10:</p><p>BlockScores &#8592; COMPUTEBLOCKSCORE(LayerScores) <ref type="bibr">13:</ref> Group blocks by the sequence of parameters 14:</p><p>for each group Gi do return SelectedB 23: end function blocks with similar roles. By using aggregated scores for these blocks, we account for gradient magnitude and consistency.</p><p>During each round of aggregation, the process begins by grouping blocks with the same number of parameters into sets, allowing for structured comparison. For each block Bi, an aggregated score is computed by summing the scores of all layers within the block:</p><p>Algorithm 1 outlines the FedTLU process. In each global round (line 2), the server randomly selects a subset of clients based on the participation rate C (line 3). Selected clients receive the global model W (t) and perform local updates (line 6), then return their updated models to the server (line 7). The server aggregates these models (line 9) and computes a score for each layer based on the difference between current and aggregated parameters (line 10). Aggregation methods can vary, including FedAvg <ref type="bibr">[7]</ref> or FedProx <ref type="bibr">[8]</ref>.</p><p>Block scores are computed by summing the scores of all layers within each block (line 11), and blocks are grouped by their parameter sequence for fair comparison (line 13). The top S blocks from each group are selected based on their scores (line 14), and the global model is updated with these blocks (line 16). This process repeats for T global rounds, focusing updates on the most important layers or blocks to improve convergence across heterogeneous clients.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. THEORETICAL ANALYSIS</head><p>In this section, we theoretically analyze why updating only a few layers of the global model during aggregations can benefit FL.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Assumptions</head><p>We make the following assumptions in our analysis: a) Gradient Smoothness: The gradients of the global loss function are Lipschitz continuous, implying that the global loss function is smooth. Specifically, for any aggregated global model W , the loss function L(W ) is L-smooth, meaning:</p><p>where L is the Lipschitz constant, and &#8711;L(W ) is the gradient of the global loss function with respect to the model W . b) Effective Gradient in Subset W (t)</p><p>S : We assume that the subset of layers W (t) S contains layers where the gradient most effectively reduces the loss. Specifically, we assume that in each round t:</p><p>where &#948; is a positive constant. While smaller &#948; values indicate better alignment with the full model's gradient, a larger &#948; can be advantageous in noisy or complex scenarios, as perfect alignment may lead to updating every source of noise in the model. c) Client Participation and Data Distribution: For simplicity, we assume full participation from all clients in every round of communication. To remove stochasticity from the analysis, we also assume that each client's local update is computed with one step of gradient descent with the full gradient of its local loss over all of its data in every round. Additionally, we assume that each client's local data distribution remains constant throughout the training process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Analysis of Gradient Descent Updates</head><p>We now compare the impact of updating all layers versus updating only a subset of layers during the global aggregation phase.</p><p>a) Full Model Update: When the entire model W is updated during the global aggregation phase, the update rule for each layer Wi at iteration t is W</p><p>). The overall change in the global loss function due to this update is governed by the smoothness condition, which can be expressed as:</p><p>where &#951; is the learning rate, &#8711;L(W (t) ) 2 represents the squared L2 norm of the global gradient &#8711;L(W ) (t) , and L is the Lipschitz constant of the global loss function. b) Subset of Layers Update: Now, we consider updating only a subset of layers, S. We decompose the model W (t) into two parts:</p><p>S : The subset of layers that are being updated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8226; W (t)</head><p>S c : The layers that are not being updated. Thus, without loss of generality the full model can be expressed as</p><p>), where S c denotes the complement of S. Since the update only affects WS , the difference W (t+1) -W (t) simplifies to : (W</p><p>S , 0). Substituting into the smoothness condition:</p><p>S , 0) 2 . Now, let's focus on the gradient &#8711;L(W (t) ). This gradient is with respect to the entire model W (t) , which means it includes components for both</p><p>is the gradient of the loss function with respect to the subset W (t) S . Substituting this into the smoothness condition gives:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Theoretical Comparison: Full Model vs. Subset Updates</head><p>Let us assume that the gradient over the subset W (t)</p><p>S is not aligned with the full gradient, i.e., large &#948;. The loss reduction is defined as:</p><p>For the subset update, the loss reduction can be expressed as:</p><p>For the full model update, the loss reduction is:</p><p>Given the loss reduction inequalities for subset and full model updates, consider the scenario where the subset update yields a greater loss reduction than the full model update. This implies that the subset update provides a tighter boundary. The inequality describing the difference between the loss reduction boundaries for the subset update and the full model update is given by:</p><p>For this inequality to hold, we require 1 -&#951;L 2 &#8804; 0, which implies &#951;L &#8805; 2. This condition indicates that when the Lipschitz constant L is large, even a relatively small learning rate &#951; causes the loss function to behave less smoothly, meaning the updates may become noisy. In such cases, deviating from the full model with a large &#948; update can be beneficial, allowing for more efficient loss reduction by focusing on key layers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. EXPERIMENTAL EVALUATION</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Experimental Setup and Evaluation Metrics</head><p>We trained two language models for next-word prediction: a standard Transformer <ref type="bibr">[1]</ref> on UDPOS data and GPT-2 <ref type="bibr">[3]</ref> on Penn Treebank (Penn TB) for 150 communication rounds. The sequence length was set to 128. Both models were optimized with Stochastic Gradient Descent (SGD) at a learning rate of 0.0001 and batch size of 16, with each client performing 5 local epochs across 100 clients. We used FedAvg <ref type="bibr">[7]</ref> and FedProx <ref type="bibr">[8]</ref> (proximal term &#956; = 0.1) for aggregation. Experiments were conducted on a single NVIDIA GeForce RTX 2080 Ti.</p><p>The network architectures consist of the repeated (main body) and non-repeated parts (input and output layers). Four update strategies were evaluated: Full, FedTLU, Random, and Last updates. In the Full update, all layers were updated in each round. For Random and FedTLU, updates were applied to the non-repeated part and a portion of the main body layers, with FedTLU selecting layers based on scores and Random selecting layers uniformly at random. The Last update case involved updating only the last few layers of the nonrepeated part. Importantly, for FedTLU, Random, and Last updates, the same number of layers were updated in each experiment to ensure a fair comparison.</p><p>Each case was evaluated three times, and the average perplexity, which measures prediction accuracy (lower is better), was reported. Global testing perplexity was calculated after each aggregation using test sets from the datasets, while local testing performance was measured by averaging perplexity scores from participating clients after the local training using the same test sets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Experimental Results</head><p>We compared FedTLU's performance under three conditions: 1) training the global model from scratch, 2) fine-tuning pre-trained models, and 3) scenarios with noisy or potentially malicious clients.</p><p>TABLE I AVERAGED MINIMUM GLOBAL AND LOCAL TEST PERPLEXITIES WHEN THE GLOBAL MODEL WAS TRAINED FROM SCRATCH. Model Aggre FedAvg FedProx (Dataset) Portion 0.75 0.50 0.25 0.75 0.50 0.25 Update Ours Random Last Ours Random Last Ours Random Last Ours Random Last Ours Random Last Ours Random Last Transformer Global 2.107 2.108 4.652 2.399 2.406 4.652 2.874 2.879 4.652 2.114 2.114 4.665 2.407 2.413 4.665 2.883 2.888 4.665 (Penn TB) Local 2.109 2.110 4.559 2.399 2.406 4.559 2.870 2.876 4.559 2.116 2.116 4.572 2.407 2.414 4.572 2.880 2.885 4.572 GPT-2 Global 10.091 10.124 14.832 10.231 10.295 14.832 10.464 10.583 14.832 10.093 10.125 14.838 10.232 10.297 14.838 10.466 10.585 14.838 (UDPOS) Local 10.245 10.287 12.016 10.361 10.415 12.016 10.560 10.695 12.016 10.247 10.288 12.018 10.362 10.416 12.018 10.563 10.696 12.018 Fig. 2. Perplexity curves for fine-tuning with Full, FedTLU, and Random updates in noisy or malicious client scenarios.</p><p>TABLE II AVERAGED MINIMUM GLOBAL AND LOCAL TEST PERPLEXITIES WHEN THE GLOBAL MODEL WAS FINE-TUNED AFTER 400 ROUNDS. Model Aggre FedAvg FedProx (Dataset) Update Full Ours Random Last Full Ours Random Last Transformer Global 1.774 1.730 1.730 1.785 1.776 1.731 1.732 1.787 (Penn TB) Local 1.776 1.731 1.732 1.787 1.778 1.733 1.734 1.789 GPT-2 Global 9.617 9.292 9.382 9.645 9.618 9.293 9.383 9.646 (UDPOS) Local 9.702 9.406 9.493 9.722 9.702 9.407 9.494 9.722</p><p>TABLE III AVERAGED MINIMUM GLOBAL AND LOCAL TEST PERPLEXITIES WHEN THE MODELS WERE FINE-TUNED WITH NOISY OR MALICIOUS CLIENTS Model Aggre FedAvg FedProx (Dataset) Update Full Ours Random Full Ours Random Transformer Global 1.815 1.768 1.764 1.817 1.770 1.766 (Penn TB) Local 1.804 1.765 1.763 1.806 1.767 1.764 GPT-2 Global 9.597 9.318 10.050 9.763 9.559 10.050 (UDPOS) Local 9.696 9.444 10.225 9.869 9.687 10.225</p><p>For training from scratch, each communication round varied the portion of updated layers (75%, 50%, and 25%). Each local client was allocated up to a maximum number of tokens, calculated by dividing the total dataset size by the number of clients, where tokens refer to the units obtained from tokenizing the dataset. A minimum number of tokens is half of this maximum number. Local datasets were nonoverlapping, and 10% of clients participated per round. Table <ref type="table">I</ref> shows that FedTLU consistently achieved better test perplexity, reducing global perplexity by up to 1.14% and local perplexity by up to 1.3% compared to Random. FedTLU also showed a 41.7-120.8% improvement in global perplexity and 13.8-116.2% in local perplexity compared to last-layer updates. These results indicate that targeted layer updates in FedTLU enhance model stability.</p><p>For fine-tuning, we trained the global model for 400 rounds with full updates. The same token distribution approach as in training from scratch was applied, ensuring consistency across experiments. During fine-tuning, 5% of clients participated per round, and each client's local data was halved. The update portion was decreased sequentially from 75% to 50% to 25%, and if the loss did not decrease by at least 1% for 10 consecutive rounds, the portion was further reduced. Table <ref type="table">II</ref> and Fig. <ref type="figure">1</ref> show that FedTLU outperformed other strategies, reducing global perplexity by 2.54-3.50% compared to full updates, by up to 0.97% compared to Random, and by 3.18-3.80% compared to last-layer updates. Locally, FedTLU achieved reductions of 2.60-3.15%, 0.06-0.92%, and 3.23-3.36%. In Fig. <ref type="figure">1</ref>, FedTLU consistently maintained lower global perplexity during training.</p><p>In noisy or malicious client scenarios, 10% of clients were assigned shuffled labels (i.e., mismatched tokens). Table <ref type="table">III</ref> shows that FedTLU reduced global perplexity by 2.13-2.99% compared to full updates and by -0.23-7.86% compared to Random. Locally, FedTLU achieved reductions of 1.88-2.67% and -0.17-8.27%. Notably, Random increased perplexities in the GPT-2 (UDPOS) case from the full update cases, highlighting that randomly selecting layers may include less effective ones. These results emphasize the importance of selective layer updates. In summary, FedTLU demonstrated better resilience, reducing the impact of noisy or malicious clients and maintaining lower perplexity, providing robustness in federated learning environments. As seen in Fig. <ref type="figure">2</ref>, FedTLU consistently maintained lower global perplexity despite noisy updates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSION</head><p>We introduced FedTLU, a targeted layer update strategy for federated learning (FL). FedTLU improves convergence and reduces the impact of noisy client updates. Experimental results show that FedTLU outperforms random and last-layer updates, providing a more efficient and stable solution for FL in language modeling.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>The authors are partially supported by the National Science Foundation under grants CNS-2409138 and CNS-2106891 and by the Department of Energy under grant DESC0025652.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Authorized licensed use limited to: Stanford University Libraries. Downloaded on June 28,2025 at 16:41:00 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
