<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>AnycostFL: Efficient On-Demand Federated Learning over Heterogeneous Edge Devices</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>05/17/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10536848</idno>
					<idno type="doi">10.1109/INFOCOM53939.2023.10229017</idno>
					
					<author>Peichun Li</author><author>Guoliang Cheng</author><author>Xumin Huang</author><author>Jiawen Kang</author><author>Rong Yu</author><author>Yuan Wu</author><author>Miao Pan</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In this work, we investigate the challenging problem of on-demand federated learning (FL) over heterogeneous edge devices with diverse resource constraints. We propose a cost-adjustable FL framework, named AnycostFL, that enables diverse edge devices to efficiently perform local updates under a wide range of efficiency constraints. To this end, we design the model shrinking to support local model training with elastic computation cost, and the gradient compression to allow parameter transmission with dynamic communication overhead. An enhanced parameter aggregation is conducted in an element-wise manner to improve the model performance. Focusing on Any-costFL, we further propose an optimization design to minimize the global training loss with personalized latency and energy constraints. By revealing the theoretical insights of the convergence analysis, personalized training strategies are deduced for different devices to match their locally available resources. Experiment results indicate that, when compared to the state-of-theart efficient FL algorithms, our learning framework can reduce up to 1.9 times of the training latency and energy consumption for realizing a reasonable global testing accuracy. Moreover, the results also demonstrate that, our approach significantly improves the converged global accuracy.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Federated learning (FL) is an emerging distributed learning paradigm that enables multiple edge devices to train a common global model without sharing individual data <ref type="bibr">[1]</ref>. This privacyfriendly data analytics technique over massive devices is envisioned as a promising solution to realize pervasive intelligence <ref type="bibr">[2]</ref>. However, in many real-world application areas, mobile devices are often equipped with different local resources, which raises the emerging challenges for locally on-demand training <ref type="bibr">[3]</ref>. Given different local resources status (e.g., computing capability and communication channel state) and personalized efficiency constraints (e.g., latency and energy), it is crucial to customize training strategies for heterogeneous edge devices.</p><p>We perform an in-depth analysis on the time delay and the energy consumption for performing the local model updates at edge devices. Specifically, we evaluate and record the cost of local training on three different NVIDIA Jetson family platforms (i.e., Nano, NX AGX, and Xavier AGX) under different channel states (i.e., good, medium, and poor). On the one hand, we observe that the learning efficiency differs significantly with diverse learning scenarios. As shown in Fig. <ref type="figure">1</ref>, the singleepoch training on Nano with poor communication condition consumes about 4.0 times training latency than that of Xavier AGX with good communication condition, while its energy consumption is about 0.7 times less than the latter one's. On the other hand, we observe that the bottlenecks of latency and energy are induced by parameter transmission and local model training, respectively.</p><p>The above observations provide insights for a proper design of the on-demand FL system. To handle the resource heterogeneity, it is suggested to alleviate the energy and the latency cost of the local device. More importantly, the computation and communication costs should be jointly reduced to achieve efficient local training. In the literature, most existing studies either employ resource allocation and device scheduling to mitigate the system cost <ref type="bibr">[4]</ref>- <ref type="bibr">[10]</ref>, or design gradient compression to accelerate the parameter transmission procedure <ref type="bibr">[11]</ref>- <ref type="bibr">[17]</ref>. The former method inherits the ideas of traditional design for mobile edge systems and takes no account of the optimization for neural networks, while the latter overlooks the computation cost of local model training.</p><p>In this paper, we propose "anycost" FL, named AnycostFL, to break the latency and energy bottlenecks for on-demand distributed training over heterogeneous edge devices. Our goal is to develop a cost-adjustable FL framework that enables edge devices to perform local updates under diverse learning scenarios. To this end, we first design the model shrinking and gradient compression to enable adaptive local updates with different computation and communication costs. Meanwhile, an enhanced parameter aggregation scheme is proposed to fuse the knowledge of the local updates. Following that, we investigate the on-demand learning of AnycostFL by regulating the local model structure, gradient compression policy and computing frequency under personalized latency and energy constraints. However, customizing training strategy for different learning scenarios is a non-trivial task, since how the global accuracy is affected by the local model structure and compression rate is still unknown. To address this issue, we theoretically reveal the convergence insights of our framework, which are further leveraged to guide optimization analysis. Finally, the optimal training strategy is derived for each device according to its locally available resource.</p><p>Our main contributions are summarized as follows.</p><p>&#8226; We propose a novel FL framework, named AnycostFL, that enables the local updates with elastic computation cost and communication overhead. &#8226; We theoretically present the optimal aggregation scheme and convergence analysis for AnycostFL. &#8226; We investigate the on-demand training problem of Any-costFL, and the optimal training strategy is devised to adapt the locally available resource.</p><p>&#8226; Extensive experiments indicate that the proposed Any-costFL outperforms the state-of-the-art efficient FL methods in terms of resource utilization and learning accuracy. The remainder of this paper is organized as follows. Section II describes related studies. In Section III, we detail the main operations of AnycostFL to fulfill the single-round training. The problem formulation, theoretical analysis and the corresponding solution are provided in Section IV. The experiment evaluations are presented in Section V, and we finally conclude the paper in Section VI and discuss the future directions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. RELATED WORK</head><p>Resource Management Methods. Resource management methods aim to reduce the FL system cost by arranging the local and system resources. Resource allocation methods employ frequency scheduling <ref type="bibr">[18]</ref>, transmission power control <ref type="bibr">[19]</ref>, and bandwidth allocation <ref type="bibr">[20]</ref> to balance the cost of local training. Recent device selection methods directly exclude those weak devices with poor computation or communication capabilities to accelerate the convergence time <ref type="bibr">[21]</ref>- <ref type="bibr">[23]</ref>. Besides, topology-aware management is another very effective method to mitigate the network throughput <ref type="bibr">[18]</ref>, <ref type="bibr">[24]</ref>, <ref type="bibr">[25]</ref>. However, these methods inherit the ideas of the efficient design for traditional mobile systems and overlook the optimization of neural networks.</p><p>Neuron-aware Techniques. Neuron-aware techniques focus on revealing the black box of neural networks to improve the training efficiency of the FL system. Early gradient compression utilizes sparsification <ref type="bibr">[11]</ref>, <ref type="bibr">[26]</ref>, and quantization <ref type="bibr">[14]</ref>, <ref type="bibr">[27]</ref>, <ref type="bibr">[28]</ref> to reduce the transmission cost of FL system. In addition, feature maps fusion and knowledge distillation can be carried out to improve the information aggregation <ref type="bibr">[29]</ref>, <ref type="bibr">[30]</ref>.</p><p>Besides, FedMask proposes to train a personalized mask for each device to improve the test accuracy on the local dataset <ref type="bibr">[31]</ref>. Recently, model structure pruning enables multiple devices with different model architectures to train a shared global model <ref type="bibr">[32]</ref>, <ref type="bibr">[33]</ref>. Such methods can reduce the cost of local training, but how to customize optimal training strategies (e.g., gradient compression and model pruning policy) for different learning scenarios is still unknown.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. TRAINING WITH ANYCOSTFL</head><p>In this section, we first outline the overall design of Any-costFL. Next, we detail the key techniques of our framework, including elastic model shrinking (EMS), flexible gradient compression (FGC), and all-in-one aggregation (AIO).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Outline of AnycostFL</head><p>We consider a generic application scenario of FL with a set of I edge devices </p><p>where |D i | is the size of D i . Given the specified learning task, the original training workload of single sample W and the data size of uncompressed gradient S can be empirically measured.</p><p>As shown in Fig. <ref type="figure">2</ref>(a), to reduce the computational complexity of the local model training and the communication cost of gradient update transmission, we propose AnycostFL with two device-side techniques, i.e., model shrinking and gradient compression. At the t-th global iteration of AnycostFL, the device i is enabled to adjust its training workload and gradient size as W t,i = &#945; t,i W and S t,i = &#946; t,i S, respectively. Here, &#945; t,i &#8712; (0, 1] and &#946; t,i &#8712; (0, 1] are defined as the model shrinking factor and the gradient compression rate, respectively. The training procedure of AnycostFL is summarized as follows.</p><p>1) Elastic local training: At the t-th global round, the device i downloads the latest global model w t from the parameter server. With the pre-calculated model shrinking factor &#945; t,i , the specialized sub-model w &#945; t,i = shrink(w t , &#945; t,i ) can be efficiently derived, where function shrink(&#8226;, &#8226;) indicates the operations for model shrinking. Then, the local training is conducted with sub-model w &#945; t,i and local data D i , and the updated local sub-model w &#945; t+1,i is obtained. Furthermore, the local gradient update can be acquired as u t,i = w &#945; t,i -w &#945; t+1,i . 2) Flexible gradient upload: To further reduce the uplink traffic, the local device i is motivated to compress the gradient update u t,i before the parameter transmission. With the given compression rate &#946; t,i , the compressed gradient update &#361;t,i = cmprs(u t,i , &#946; t,i ) is uploaded to the server, where cmprs(&#8226;, &#8226;) is the function for gradient compression. 3) Parameter aggregation: The server collects the compressed local updates {&#361; t,i } &#8704;i with different shrinking factors {&#945; t,i } &#8704;i and compression rates {&#946; t,i } &#8704;i . After that, the global update is calculated by &#361;t = aioagg({&#361; t,i } &#8704;i ), where aioagg(&#8226;) is the server-side all-in-one aggregation. Then, the updated global model is computed as w t+1 = w t -&#361;t .</p><p>After the T -round training of the above three-step iterations, the final global model w T is obtained. Before introducing how to customize the values of {&#945; t,i } &#8704;i and {&#946; t,i } &#8704;i in Section IV, we illustrate the details of model shrinking, gradient compression and update aggregation in the rest of this section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Elastic Model Shrinking</head><p>We aim to derive the sub-model w &#945; t,i with training complexity of &#945; t,i W from global model w t by reducing the width of the global model. The shrinking operations work as follows.</p><p>1) Server-side channel sorting: To avoid incurring extra memory cost for the edge devices, the server first sorts the channels of the latest global model before the model distribution. Given one layer of the weight of the global model, the server sorts the output channels in the current layer in descending order according to their values of L2 norm, and meanwhile, the input channels of the next layer should be sorted accordingly in the same order to maintain the permutation invariance of the whole model <ref type="bibr">[34]</ref>.</p><p>2) Layer-wise uniform shrinking: Next, the server broadcasts the weight of each layer of the global model in a channelby-channel manner. Instead of downloading the full global model, each device only receives those important parameters from the global model to assemble the local sub-model. Here, we utilize the fixed shrinking ratio for each layer in the same sub-model. Empirically, given model shrinking factor &#945; t,i , we can reduce the size of the hidden layer by &#8730; &#945; t,i to acquire the sub-model. For example, as shown in Fig. <ref type="figure">2</ref>(c), when shrinking a global model with hidden sizes of {16, 32, 64} under &#945; t,i = 1 4 , we approximately reduce the size of each hidden layer by half as {8, 16, 32} to form the sub-model.</p><p>At the beginning of the t-th global round, all device initialize their local sub-models {w &#945; t,i } &#8704;i by choosing the most important channels from the global model w t . In this way, the training complexity is significantly reduced while maintaining the performance of local sub-models. After that, the local training of device k is conducted with sub-model w &#945; t,i , which produces the local gradient u t,i with data size of &#945; t,i S.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Flexible Gradient Compression</head><p>Given the local update u t,i with the desired compression rate &#946; t,i , we aim to obtain the compressed update &#361;t,i with data size of &#945; t,i &#946; t,i S. Let &#961; t,i and L t,i denote the sparsity rate and the number of quantization levels, respectively. The gradient compression scheme works as follows.</p><p>1) Kernel-wise sparsification: Without loss of generality, we take the convolution neural network (CNN) as an example to illustrate the sparsification procedure. We aim to acquire the sparse update &#251;t,i from u t,i . Let u t,i [k] denote the k-th kernel of u t,i , and u t,i = {u t,i [k]} &#8704;k . We measure the importance of each kernel and obtain N = { u t,i [k] 2 } &#8704;k , where &#8226; 2 denotes the L2 norm operation. Next, by selecting the &#961; t,i Kth largest value in N as the threshold &#928;, the kernel-wise sparsification is expressed as</p><p>Meanwhile, the binary mask of &#251;t,i is denoted as m t,i .</p><p>2) Probabilistic quantization: Motivated by the studies in <ref type="bibr">[35]</ref>, <ref type="bibr">[36]</ref>, we aim to obtain the quantized update &#361;t,i with the given sparse &#251;t,i and the quantization level L t,i . Let u &#8712; &#251;t,i be a scalar value. To begin with, we first calculate the magnitude range of the non-zero elements of &#251;t,i , denoted as [u min , u max ], where u min = min{|u|} &#8704;u =0 , and</p><p>l=1 denote the set of quantization points, where Q l is computed by</p><p>ut,1 ut,2 ut,3 ut Fig. <ref type="figure">3</ref>. An illustration of the all-in-one aggregation.</p><p>For any u &#8712; &#251;t,i and u = 0, we can always find a quantization interval</p><p>, and its corresponding quantized value &#361; is further computed by</p><p>where sgn(&#8226;) calculates the sign of the given scalar. Furthermore, the set of the quantization indices of all &#361; &#8712; &#361;t,i is denoted as</p><p>3) Lossless encoding: Due to the distribution characteristics of L t,i that smaller indices may occur more frequently, we apply entropy coding to reduce the data size <ref type="bibr">[14]</ref>, <ref type="bibr">[37]</ref>. Besides, the sparse binary matrix m t,i can be compressed by Golomb encoding <ref type="bibr">[11]</ref>, <ref type="bibr">[38]</ref>.</p><p>After determining the compression scheme, we can vary the combinations of {&#961; t,i , L t,i } and record the corresponding compression rates. Based on the results, we can build a piecewise linear function to predict the compression strategy {&#961; t,i , L t,i } with the given &#946; t,i . Notably, this function can be efficiently fitted by the server with a rather small amount of public training data (e.g., 16 samples) in an offline manner.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. All-in-One Aggregation</head><p>After all the devices upload their encoded updates, the server receives, decodes and then reconstructs the compressed local updates {&#361; t,i } &#8704;i . Our goal is to obtain the global update &#361;t by aggregating {&#361; t,i } &#8704;i . However, the aggregation of local updates in our framework cannot be supported by conventional FedAvg <ref type="bibr">[1]</ref>, since the local updates are produced by different model structures with different levels of precision (i.e., different quantization levels and sparsity).</p><p>To tackle the above challenge, we propose an all-in-one aggregation scheme that fuses the local updates in an elementwise manner. Let the set {1, 2, &#8226; &#8226; &#8226; , J} index elements of the global update &#361;t , and &#361;[j] t denote the j-th element of &#361;t . To accomplish the aggregation for &#361;[j] t , we first determine the subset of devices I j &#8838; I whose local model structure also contains the j-th element. Then, we have</p><p>t,i = 0,</p><p>where p t,i is the aggregation coefficient for the j-th device at the t-th global round. The optimal values of {p t,i } &#8704;i will be further analyzed in Section IV. Fig. <ref type="figure">3</ref> gives an example to illustrate the aggregation details. Specifically, different elements in the global update are updated by different subsets of devices, and more important elements will "absorb" knowledge from more devices. When the j-th element is zeroed out by all the devices in I j , we have &#361;[j] t = 0.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>IV. THEORETICAL ANALYSIS AND OPTIMIZATION</head><p>In this section, we focus on the optimization of our framework by customizing the training strategies for diverse devices. We first formulate the on-demand training problem of Any-costFL. Then, we derive the upper bound of the convergence rate and reveal the key insights to improve the performance of AnycostFL. Based on the analysis, the optimization problem is transformed into a tractable form, and the closed-form solution is derived.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. AnycostFL over Wireless Networks</head><p>In this subsection, we formulate the computation and communication models for our framework. After that, we build up an on-demand learning problem that minimizes the global training loss with given delay and energy constraints.</p><p>1) Computation model: For the device i at the t-th global round, given the model shrinking factor &#945; t,i and computing frequency f t,i , the time consumption of local model training can be measured by</p><p>where &#964; denotes the number of local epochs. Meanwhile, the corresponding energy consumption can be given by</p><p>where i is the hardware energy coefficient of the device i.</p><p>2) Communication model: We consider the frequency division multiple access (FDMA) scheme for the transmission of the local gradient update. For the device i at the t-th global round, the achievable transmitting rate can be estimated by</p><p>where P com t,i is the transmitting power; b i is the achievable bandwidth; |h t,i | denotes the path loss of wireless channel; N 0 is the power spectral density of the additive white Gaussian noise. For the device i at t-th global round, given the update &#361;t,i generated by the local model with a shrinking factor of &#945; t,i and compression rate of &#946; t,i , the required time T com t,i and energy consumption E com t,i of uplink transmission can be respectively measured by</p><p>, and</p><p>With the above computation and communication models, we next focus on the optimization problem of AnycostFL.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3) Problem formulation:</head><p>To optimize AnycostFL, we study an on-demand training problem. Specifically, the shared maximal latency for each round T max is determined by the server. The local energy consumption budget for each round E max t,i is customized by the device itself. Given multiple devices with diverse local resources (e.g., computation, communication and data), our goal is to customize the training strategy for each device to minimize the global training loss with personalized constraints (e.g., latency and energy). To sum up, at the t-th global round, we aim to optimize the following problem.</p><p>(P1) minF w t ; {&#945; t,i } &#8704;i , {&#946; t,i } &#8704;i (10) subject to:</p><p>) variables: {&#945; t,i , &#946; t,i , f t,i } &#8704;i , where F w t ; {&#945; t,i } &#8704;i , {&#946; t,i } &#8704;i denotes the global loss of the t-th round with given the global model weight w t under the training strategies of {&#945; t,i } &#8704;i and {&#946; t,i } &#8704;i . In the rest of this section, we analyze the relationship between training loss and training strategies. After that, Problem (P1) is further solved based on the theoretical insights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Assumptions and Key Lemmas</head><p>Being in line with the studies in <ref type="bibr">[5]</ref>, <ref type="bibr">[39]</ref>, we make the following assumptions for the local loss function</p><p>Assumption 3. F i is twice-continuously differentiable. Based on Assumptions 1 and 2, we have &#957;I &#8711; 2 F i (w) &#955;I. Assumption 4. The ratios between the norms of &#8711;F i (w) and &#8711;F (w) are bounded: &#8711;F i (w) 2 &#8804; &#949; &#8711;F (w) 2 , where &#949; &#8805; 0 is a positive constant. Assumption 5. For the moderate shrinking factor &#945; &#8805; &#945; min , the first-shrinking-then-training can be approximated as firsttraining-then-shrinking: &#8711;F i (w &#945; ) = [&#8711;F i (w)] &#945; . Here, we use [&#8711;F i (w)] &#945; to denote the shrinking operation for &#8711;F i (w).</p><p>Next, we give the following two definitions. Definition 1 (Local gradient divergence). The local gradient divergence &#948; t,i is defined as the difference between u t,i and &#361;t,i , which is given by &#948; t,i = u t,i -&#361;t,i . Definition 2 (Global gradient divergence). The global gradient divergence &#916; t is defined as the difference between u t and &#361;t , which is measured by</p><p>Definition 1, u t,i and &#361;t,i may have different dimensions. We pad the missing elements in &#361;t,i with zeros before the arithmetic operation. Next, we are interested in how the training strategies {&#945; t,i , &#946; t,i } &#8704;i affect {&#948; t,i } &#8704;i and &#916; t . We derive the following two lemmas. Lemma 1. For the local training with the model shrinking factor &#945; t,i and compression rate &#946; t,i . The square of the local gradient divergence is bounded by E &#948; t,i 2 &#8804; 1&#945; t,i (2&#945; t,i ) &#946; t,i 2 E u t,i 2 . (11) Proof. See Appendix A. Lemma 2. For the local update {&#361; t,i , &#8704;i} with the corresponding training strategies {&#945; t,i , &#946; t,i } &#8704;i and aggregation coefficients {p t,i } &#8704;i , the square of the global gradient divergence is bounded by</p><p>(12) Proof. See Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Optimal Aggregation Scheme and Convergence Analysis</head><p>Intuitively, the local update u t,i generated with larger {&#945; t,i , &#946; t,i } may carry more accurate information, and thus a larger p t,i should be assigned during the aggregation. Based on Lemma 2, we deduce the following theorem. Theorem 1 (Optimal aggregation scheme). Given the local updates {&#361; t,i } &#8704;i with corresponding training strategies {&#945; t,i , &#946; t,i } &#8704;i , the optimal aggregation coefficients are</p><p>Proof. Based on Lemma 2, we study the following optimization problem to minimize the global gradient divergence.</p><p>(P2) min</p><p>subject to: pt,i &#8805;0, &#8704;i, (14a)</p><p>It can be verified that Problem (P2) is a convex optimization problem. We further solve the problem by the Karush-Kuhn-Tucker (KKT) conditions. Let { } &#8704;i and &#952; be the Lagrange multipliers for Constraints (14a) and (14b), respectively. Then, we obtain</p><p>Being in line with the study in <ref type="bibr">[40]</ref>, we can obtain</p><p>By putting Eqn. ( <ref type="formula">16</ref>) into Eqn. (14b), we obtain</p><p>Putting Eqn. <ref type="bibr">(17)</ref> into Eqn. ( <ref type="formula">16</ref>) completes the proof.</p><p>With the optimal aggregation scheme, we investigate the upper bound of the convergence rate of AnycostFL. Definition 3 (Local and global learning gains). The local and global learning gains are defined as g t,i = &#945; 4 t,i &#946; t,i and g t = i g t,i /I, respectively. Specifically, the local and global learning gains (i.e., g t,i &#8712; [0, 1] and g t &#8712; [0, 1]) measure the amount of effective information carried in the local and global updates, respectively. Theorem 2 (Convergence rate of AnycostFL). Let g min = min{g t } &#8704;t be the minimal global learning gain over the Tround training. The upper bound of the convergence rate of AnycostFL satisfies <ref type="bibr">(18)</ref> where</p><p>Recall that parameters &#957;, &#955; and are defined in Assumptions 1 to 4 before. Proof. See Appendix C.</p><p>Based on Definition 3 and Theorem 2, we derive the following proposition. Proposition 1. The key to minimizing the training loss of AnycostFL is to maximize the learning gain g t for each global round. If g t = 1 &#8704;t, AnycostFL degrades to conventional FL without model shrinking and gradient compression.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D. Solution for Problem (P1)</head><p>Based on Theorem 2 and Proposition 1, Problem (P1) can be transformed into the following problem.</p><p>subject to: Constrains (10a) to (10e), variables:</p><p>Based on Constraints (10a) and (10b) for the training latency and energy, we obtain the following lemma. Lemma 3. The equality will always hold for Constraints (10a) and (10b) when confirming the optimal training strategy {&#945; * t,i , &#946; * t,i , f * t,i } &#8704;i , and thus T * t,i = T max and E t,i = E * t,i &#8704;i. Proof. The lemma can be proved by showing the contradiction. Suppose that there exists i 0 such that T * t,i0 &lt; T max . We can find a new solution {&#945; t,i0 , &#946; * t,i0 , f t,i0 } for device i 0 and &#945;</p><p>, such that T t,i0 = T max and E t,t0 = E max t,i . Since the global learning gain increases with the increase of &#945; t,i0 , we have g t &gt; g * t . Likewise, the contradiction also appears when E * t,i0 &lt; E max t,i0 , and thus we complete the proof.</p><p>Based on Lemma 3, we employ two intermediate variables (i.e., &#966; t,i and &#981; t,i ) for each device to reparameterize Problem (P3). Specifically, &#966; t,i &#8712; [0, 1] and &#981; t,i &#8712; [0, 1] are the splitting factors for latency and energy, respectively, such that</p><p>By combining Eqns ( <ref type="formula">6</ref>) and ( <ref type="formula">20</ref>), the local learning gain of the device i at the t-th round can be rewritten as</p><p>where</p><p>Note that Problem (P3) can be transformed into I subproblems because the decision-making procedure of each device is independent. Based on Eqn. ( <ref type="formula">21</ref>), the i-th subproblem can be expressed as a single-variable optimization problem with respect to &#966; t,i as follows.</p><p>(P4) m a x &#966;t,i g t,i &#966; t,i <ref type="bibr">(22)</ref> subject to: &#966; min t,i &#8804; &#966; t,i &#8804;&#966; max t,i , where the lower and upper limits of &#966; t,i can be acquired by</p><p>Based on the first-order optimality condition &#8706;g t,i /&#966; t,i = 0, we obtain the stationary points as</p><p>where &#968; t,i = 4(P com t,i T max )</p><p>2 -4E max t,i P com t,i T max + 9(E max t,i ) 2 . Let S t,i = {&#966; min t,i , &#966; max t,i , &#966; s1 t,i , &#966; s2 t,i } denote the union of the stationary points and the boundary points for Problem (P4). Then, S t,i = {&#966; t,i |&#966; t,i &#8712; [&#966; min t,i , &#966; max t,i ], &#966; t,i &#8712; S t,i } is the set of the feasible solutions of S t,i . The optimal solution for Problem (P4) can be acquired by &#966; * t,i = arg max &#966;t,i&#8712;S t,i g t,i (&#966; t,i ).</p><p>Furthermore, we obtain the optimal solution for device i at the t-th global round by putting &#966; * t,i into the following equations.</p><p>(26) Notably, the decision-making process of each device does not involve the auxiliary information of the resource status from other devices. At the beginning of each global round, each device can determine its training strategy locally.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>V. EXPERIMENT EVALUATIONS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Experiment Settings 1) Setup for FL training:</head><p>We consider the FL application with image classification on Fashion-MNIST and CIFAR-10 datasets <ref type="bibr">[41]</ref>, <ref type="bibr">[42]</ref>. For Fashion-MNIST, we use a small convolutional neural network (CNN) with data size of model update as 53.22Mb <ref type="bibr">[1]</ref>. For the CIFAR-10 dataset, we employ VGG-9 with data size of model update as 111.7Mb <ref type="bibr">[43]</ref>. For IID and non-IID data settings, we follow the dataset partition strategy in <ref type="bibr">[34]</ref>. For the learning hyper-parameters, the learning rate, batch size and local epoch are set as {0.01, 32, 1} for Fashion-MNIST and {0.08, 64, 1} for CIFAR-10 dataset. The maximal latency is set as T max = 10 seconds TABLE I PERFORMANCE COMPARISON BETWEEN ANYCOSTFL AND OTHER METHODS ON FASHION-MNIST AND CIFAR-10 DATASETS. IID non-IID Dataset Method #Round Energy (KJ) Latency (min) Comp. (TFLOPs) Comm. (GB) Best Acc. (%) #Round Energy (KJ) Latency (min) Comp. (TFLOPs) Comm. (GB) Best Acc. (%) FMNIST {90%, 89%} * STC 305 (1.7&#215;) 10.94 (1.4&#215;) 25.42 (1.7&#215;) 152.71 0.71 90.28&#177;0.18 283 (1.3&#215;) 10.17 (1.1&#215;) 23.56 (1.3&#215;) 141.53 0.66 89.47&#177;0.16 QSGD 283 (1.6&#215;) 11.40 (1.4&#215;) 23.56 (1.6&#215;) 141.53 0.80 90.39&#177;0.04 279 (1.3&#215;) 11.27 (1.2&#215;) 23.28 (1.3&#215;) 139.86 0.79 89.49&#177;0.07 UVeQFed 247 (1.4&#215;) 11.36 (1.4&#215;) 20.58 (1.4&#215;) 123.67 0.72 90.44&#177;0.10 266 (1.2&#215;) 12.21 (1.3&#215;) 22.14 (1.2&#215;) 133.01 0.77 89.64&#177;0.16 HeteroFL 233 (1.3&#215;) 12.03 (1.5&#215;) 21.78 (1.5&#215;) 92.21 0.57 90.43&#177;0.13 242 (1.1&#215;) 12.51 (1.3&#215;) 22.62 (1.3&#215;) 95.77 0.59 89.42&#177;0.10 FedHQ 288 (1.6&#215;) 13.89 (1.7&#215;) 24.03 (1.6&#215;) 144.36 0.86 90.21&#177;0.07 313 (1.5&#215;) 14.96 (1.6&#215;) 26.06 (1.5&#215;) 156.55 0.93 89.27&#177;0.19 AnycostFL 179 (1.0&#215;) 8.07 (1.0&#215;) 14.94 (1.0&#215;) 67.49 0.35 91.20&#177;0.09 214 (1.0&#215;) 9.63 (1.0&#215;) 17.83 (1.0&#215;) 80.51 0.42 90.32&#177;0.14 CIFAR-10 {82%, 80%} * STC 341 (1.2&#215;) 35.39 (1.3&#215;) 56.83 (1.2&#215;) 4160.56 1.78 85.38&#177;0.29 412 (1.1&#215;) 42.39 (1.3&#215;) 68.67 (1.1&#215;) 5026.84 2.15 83.09&#177;0.53 QSGD 337 (1.2&#215;) 39.82 (1.5&#215;) 56.17 (1.1&#215;) 4111.76 2.14 84.83&#177;0.54 430 (1.2&#215;) 50.29 (1.5&#215;) 71.61 (1.2&#215;) 5242.39 2.73 81.94&#177;0.13 UVeQFed 296 (1.0&#215;) 40.77 (1.5&#215;) 49.28 (1.0&#215;) 3607.45 2.12 85.09&#177;0.16 377 (1.0&#215;) 51.59 (1.5&#215;) 62.89 (1.0&#215;) 4603.87 2.71 82.30&#177;0.28 HeteroFL 332 (1.1&#215;) 50.07 (1.9&#215;) 69.14 (1.4&#215;) 3222.26 1.65 83.75&#177;0.55 413 (1.1&#215;) 62.88 (1.9&#215;) 85.78 (1.4&#215;) 3990.49 2.05 80.68&#177;0.45 FedHQ 340 (1.2&#215;) 48.95 (1.9&#215;) 56.67 (1.2&#215;) 4148.36 2.32 84.02&#177;0.22 435 (1.2&#215;) 61.99 (1.9&#215;) 72.44 (1.2&#215;) 5303.40 2.96 81.00&#177;0.41 AnycostFL 294 (1.0&#215;) 26.43 (1.0&#215;) 48.94 (1.0&#215;) 2459.92 1.56 87.72&#177;0.23 372 (1.0&#215;) 33.51 (1.0&#215;) 62.06 (1.0&#215;) 3118.60 1.98 84.91&#177;0.51 * {x, y}: x and y denote the target global model accuracy under IID and non-IID data settings, respectively.</p><p>and the energy budget is set as E max t,i &#8764; U <ref type="bibr">[3,</ref><ref type="bibr">9]</ref> joules for the CIFAR-10 dataset, and the corresponding hyper-parameters for the FMNIST dataset are halved by default. Additionally, we set &#945; min = 1/4 and &#946; max = 1/15.</p><p>2) Setup for mobile system: We investigate a mobile system with I = 60 devices located within a circle cell with a radius of 550 meters, and a base station is situated at the center. To simulate the mobility, the position of each device is refreshed randomly at the beginning of each round <ref type="bibr">[44]</ref>. For the computation, the energy coefficient is set as i &#8764; U [5 &#215; 10 -27 , 1 &#215; 1 -26 ]. For communication, the bandwidth is set as 1MHz equally for each device, and the path loss exponent is 3.76. The transmission power is set as 0.1W, and N 0 is set as -114dBm/MHz.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Performance Comparisons</head><p>We compare the proposed AnycostFL with the following efficient FL algorithms with three different random seeds.</p><p>&#8226; STC. The sparse ternary compression (STC) is adapted to reduce the cost of uplink parameter transmission <ref type="bibr">[11]</ref>. &#8226; QSGD. The TopK sparsification and probabilistic quantization are combined to compress the local gradient <ref type="bibr">[36]</ref>. &#8226; UVeQFed. The TopK sparsification and universal vector quantization are used to compress the local gradient <ref type="bibr">[14]</ref>. &#8226; HeteroFL. Each device trains the local sub-model in different widths to match its computation capacity <ref type="bibr">[32]</ref>. &#8226; FedHQ. Each device uses different quantization levels to compress the gradient according to its channel state <ref type="bibr">[40]</ref>.  <ref type="table">I</ref> provides the best accuracy and required system cost for achieving the specified test accuracy. Particularly, when compared with HeterFL and FedHQ, AnycostFL can reduce up to 1.9 times the energy consumption to reach the test accuracy of 82% on CIFAR-10 dataset under the IID setting. When compared with STC, AnycostFL can reduce up to 1.7 times the time consumption to reach the test accuracy of 90% on FMNIST dataset under the IID setting. Moreover, our framework can significantly improve the best accuracy of the global model by 2.33% and 1.82% on CIFAR-10 dataset under the IID and the non-IID settings, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Impact of Key Mechanisms and Hyper-parameters</head><p>Fig. <ref type="figure">5</ref>(a) verifies the advantages of the main techniques of AnycostFL. We gradually remove the elastic model shrinking (w/o EMS), the flexible gradient compression (w/o FGC) and the all-in-one aggregation (w/o AIO), and record the required system cost to achieve 80% test accuracy with CIFAR-10 dataset under the IID setting. We observe that the proposed EMS and FGC can significantly save the energy consumption and training time, respectively. Besides, AIO contributes to saving both energy and time. We next evaluate the impact of resource heterogeneity on the training efficiency in Fig. <ref type="figure">5(b-c</ref>). We set the average energy coefficient i as 7.5 &#215; 10 -27 and the average distance between the base station and edge devices as 400 meters, and then change their variances to simulate the computation and communication heterogeneity, respectively. The larger variance indicates a higher level of system heterogeneity. As we expect, the proposed AnycostFL shows more resilience than other baselines to tackle the high level of system heterogeneity.</p><p>We also evaluate the performance of sub-models in different widths in Fig. <ref type="figure">5(d)</ref>. Specifically, We compare AnycostFL with HeteroFL (i.e., local training with different widths) and STC (i.e., the best-performing compression-only method). The submodels are derived from the well-trained global model without further re-training. Surprisingly, the sub-models of the global model trained by AnycostFL can still maintain satisfactory test accuracy, which provides dynamic inference for diverse edge devices after the training time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>VI. CONCLUSION</head><p>In this paper, we proposed AnycostFL, a joint computation and communication efficient framework for FL, that enables edge devices with diverse resources to train a shared global model. We aimed to minimize the global training loss under given personalized latency and energy constraints. By leveraging the theoretical insight of AnycostFL, we decomposed the optimization problem into multiple sub-problems. Following that, the optimal training strategy is derived for each device according to its locally available resource. Experiments demonstrate the advantage of our framework in improving the system efficiency and model performance compared to the state-of-the-art methods.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENT</head><p>Rong Yu and Yuan Wu are the corresponding authors. This work was supported in part by National Key R&amp;D Program of China under Grant 2020YFB1807802, in part by National Natural Science Foundation of China under Grants 61971148, 62102099, U22A2054 and 62001125, in part by Science and Technology Development Fund of Macau SAR under Grant 0162/2019/A3, in part by FDCT-MOST Joint Project under Grant 0066/2019/AMJ, in part by the Guangdong Basic and Applied Basic Research Foundation (2022A1515011287), and in part by US National Science Foundation under grant CNS-2107057. APPENDIX A PROOF OF LEMMA 1</p><p>Proof. For the given local gradient &#361;t,i with shrinking factor &#945; t,i and gradient compression rate &#946; t,i , we aim to capture the divergence between &#361;t,i and u t,i . Suppose that the absolute value of the element in u t,i follows uniform distribution |u| &#8764; U (0, u max ), and u max = max{|u|} &#8704;u&#8712;ut,i .</p><p>For clear notation, we sort the element-wise absolute value of u t,i in ascending order. Then, we obtain u</p><p>Based on Assumption 5, the update generated from local training with w &#945; t,i is equal to shrink(u t,i , &#945; t,i ). The operation of model shrinking on u t,i with &#945; t,i can be viewed as removing (1&#945; t,i )J elements with the least value from u t,i . Then, we obtain shrink(u t,i , &#945; t,i ) = [0, . . . , 0, u</p><p>We next focus on the gradient compression. The operation of gradient sparsification on u t,i with sparsity of &#961; t,i can be viewed as removing &#961; t,i J elements with the least value from u t,i . Then, the quantization is conducted on the nonzero elements of &#251;t,i , and we obtain cmprs(u t,i , &#946; t</p><p>) Likewise to Eqn. (28), we have (A) = &#961; 3 t,i E u t,i 2 . Based on Eqn. (4) and the statistical feature of u t,i , we obtain (B) = (1&#961; t,i ) 3 E u t,i 2 /(2L 2 t,i ). Given plain update u t,i in 32-bit floating point and the desired compression rate &#946; t,i , we can set &#961; t,i = 1&#946; t,i and L t,i = 2 32 &#8730; &#946;t,i for the analysis. In this way, the operations of sparsification and quantization contribute equally to the gradient compression. Furthermore, we have E ut,icmprs(ut,i, &#946;t,i) 2 &#8804; (1&#946;t,i) 2 E ut,i 2 . (30) Next, we focus on the local divergence &#948; t,i with respect to &#945; t,i and &#946; t,i . According to the Definition 1, we have E &#948;t,i 2 = E ut,icmprs([ut,i] &#945; , &#946;t,i) 2 = E ut,i -[ut,i] &#945; 2 + E [ut,i] &#945;cmprs([ut,i] &#945; , &#946;t,i) 2 + 2 &lt; ut,i -[ut,i] &#945; , [ut,i] &#945;cmprs([ut,i] &#945; , &#946;t,i) &gt; (C)</p><p>. <ref type="bibr">(31)</ref> It can be verified that the two vectors in term (C) are orthogonal, and we obtain (C) = 0. According to Eqns ( <ref type="formula">28</ref>) and ( <ref type="formula">30</ref>), we further obtain</p><p>&#8804; (1&#945;t,i)</p><p>3 E ut,i 2 + (1&#946;t,i) 2 &#945;t,i(&#945; 2 t,i -3&#945;t,i + 3)E ut,i 2 (b) &#8804; 1&#945;t,i(2&#945;t,i) &#946;t,i 2 E ut,i 2 . (32) Likewise to Eqn. (28), inequality (a) stems from the fact that E [u t,i ] &#945; 2 = &#945; t,i (&#945; 2 t,i -3&#945; t,i +3)E u t,i 2 . Besides, inequality (b) holds for all &#945; t,i &#8712; [&#945; min , 1] and &#946; t,i &#8712; [0, &#946; max ]. Thus, we complete the proof. APPENDIX B PROOF OF LEMMA 2 Proof. Based on Definition 2 and Lemma 1, we have E &#916;t 2 = E I i=1 pt,iut,i -I i=1 pt,i &#361;t,i 2 &#8804; E I i=1 pt,i 1&#945;t,i(2&#945;t,i) &#946;t,i ut,i 2 . (33) We use &#951; to denote the learning rate, and u t,i = &#951;&#8711;F i (w t ). Based on Assumption 4, we obtain E &#916;t 2 &#8804; &#949;&#951; 2 I i=1 pt,i 1&#945;t,i(2&#945;t,i) &#946;t,i 2 E &#8711;F (wt) 2 . (34) According to Cauchy-Schwarz inequality, we obtain E &#916;t 2 &#8804; I&#949;&#951; 2 I i=1 p 2 t,i 1&#945;t,i(2&#945;t,i) &#946;t,i 2 E &#8711;F (wt) 2 .</p><p>(</p><p>Thus, we complete the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>APPENDIX C ON THE CONVERGENCE OF ANYCOSTFL</head><p>Proof. Inspired by the studies in <ref type="bibr">[5]</ref>, <ref type="bibr">[39]</ref>, we deduce the convergence analysis of AnycostFL. According to Taylor expansion and Assumption 3, we have</p><p>F (wt+1) &#8804; F (wt) + (wt+1 -wt) &#8711;F (wt) + &#955; 2 wt+1 -wt 2 = F (wt) -&#361; t &#8711;F (wt) + &#955; 2 &#361;t 2 . (36) By using learning rate &#951; = 1 &#955; , we obtain E F (wt+1) &#8804; E F (wt)&#955; (ut -&#916;t) ut + &#955; 2 ut -&#916;t 2 = E F (wt) -1 2&#955; &#8711;F (wt) 2 + &#955; 2 &#916;t 2 . (37) We now pay attention to the upper bound of &#916; t 2 . Based on Jensen's inequality and Eqn. (34), we obtain E &#916;t 2 &#8804; &#949;&#951; 2 I i=1 pt,i 1&#945;t,i(2&#945;t,i) &#946;t,i 2 (D) E &#8711;F (wt) 2 . (38) By putting Eqn. (13) into (A), we have E D &#8804; E I I i=1 1 (1-&#945;t,i(2-&#945;t,i) &#8730; &#946;t,i) 2 (c) &#8804;E I I i=1 1 1-&#945; 4 t,i &#946;t,i , (39) where (c) always holds for &#945; t,i &#8712; [0, 1] and &#946; t,i &#8712; [0, 1]. According to Definition 3, we have g t,i = &#945; 4 t,i &#946; t,i and g t = i g t,i /I. Since 1/ i 1 1-gt,i is a concave function with respect to g t,i , based on Jensen's inequality, we obtain E A &#8804; I i 1 1-E(&#945; 4 t,i &#946;t,i) = 1g t . (40) Since the training strategies of each device and the norm of the gradient of global data &#8711;F (w t ) are independent, by putting Eqn. (40) back to Eqn. (38), we obtain E &#916; t 2 &#8804; E &#949;&#951; 2 1g t &#8711;F (w t ) 2 . (41) Next, by putting Eqn. (41) back to Eqn. (37), we have E F (wt+1) &#8804; E F (wt) -1 + &#949; gt -1 2&#955; &#8711;F (wt) 2 . (42) Subtracting F (w * ) in both sides of Eqn. (42) yields E F (wt+1 -F (w * ) &#8804; E F (wt) -1 + &#949;(gt -1) 2&#955; &#8711;F (wt) 2 -F (w * ) .</p><p>(</p><p>Based on Assumptions 2 and 3, we have <ref type="bibr">[5]</ref>, <ref type="bibr">[45]</ref> &#8711;F (w t ) 2 &#8805; 2&#957; F (w t ) -F (w * ) . (</p><p>Plugging Eqn. <ref type="bibr">(44)</ref> into Eqn. ( <ref type="formula">43</ref>), we have E F (w t+1 ) -F (w * ) &#8804; Z t E F (w t ) -F (w * ) , <ref type="bibr">(45)</ref> where Z t = 1 -&#957; &#955; (1&#949;(1g t )). Let g min = min{g t } &#8704;t be the minimal global learning gain over T global rounds. By recursively applying the above inequality from iteration round 0 to T , we can obtain E F (w T ) -F (w * ) &#8804; Z T -1 E F (w 0 ) -F (w * ) , <ref type="bibr">(46)</ref> where Z = 1 -&#957; &#955; 1&#949;(1g min ) . Thus, we complete the proof.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Authorized licensed use limited to: University of Houston. Downloaded on August 28,2024 at 17:39:18 UTC from IEEE Xplore. Restrictions apply.</p></note>
		</body>
		</text>
</TEI>
