<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Quantifying the Impact of Label Noise on Federated Learning</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10394400</idno>
					<idno type="doi"></idno>
					<title level='j'>AAAI-2023 Workshop on AI for Agriculture and Food Systems</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Ke S</author><author>Huang C</author><author>Liu X</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Federated Learning (FL) is a distributed machine learning paradigm where clients collaboratively train a model using their local (human-generated) datasets. While existing studies focus on FL algorithm development to tackle data heterogeneity across clients, the important issue of data quality (e.g., label noise) in FL is overlooked. This paper aims to fill this gap by providing a quantitative study on the impact of label noise on FL. We derive an upper bound for the generalization error that is linear in the clients' label noise level. Then we conduct experiments on MNIST and CIFAR-10 datasets using various FL algorithms. Our empirical results show that the global model accuracy linearly decreases as the noise level increases, which is consistent with our theoretical analysis. We further find that label noise slows down the convergence of FL training, and the global model tends to overfit when the noise level is high.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Federated Learning (FL) is a distributed machine learning paradigm where clients (e.g., distributed devices or organizations) collaboratively train a global model <ref type="bibr">(Kairouz et al. 2021)</ref>. The local data of the clients are often humangenerated and have critical privacy concerns. An FL process consists of some communication rounds. In each round, each client trains its local model with its local data and then uploads the model updates to a central server <ref type="bibr">(Hsu, Qi, and Brown 2019)</ref>. The central server aggregates the local updates from clients and sends back an aggregated global model to all clients. After that, clients update their local models according to the information from the central server <ref type="bibr">(Collins et al. 2022)</ref>. The client-server interaction stops when the global model converges.</p><p>There has been an increasing volume of research studies on FL over the last few years <ref type="bibr">(Kairouz et al. 2021;</ref><ref type="bibr">Wang et al. 2021;</ref><ref type="bibr">Li et al. 2021b;</ref><ref type="bibr">Jiang et al. 2022</ref>). Among these studies, a critical bottleneck, which without appropriate algorithmic treatment usually fails FL, is data heterogeneity (non-IID). For example, in a classification task, some clients may collect more data for class A while others may collect more data for class B. Previous studies among this line focused on two categories of non-IID: attribute skew and label skew <ref type="bibr">(Zhu et al. 2021)</ref>. Attribute skew refers to the case where the feature distribution of each client is different from one another. For example, attribute skew could occur in a handwritten digit classification task as users may write the same digit with different font styles, sizes, and stroke widths <ref type="bibr">(Kairouz et al. 2021)</ref>. Label skew refers to the case where the label distribution of each client is different from one another. Label skew, for example, could occur in an animal recognition task. Label distributions are different because clients are in different geo-regions and different animal habitats -dolphins only live near coastal regions, or aquariums <ref type="bibr">(Kairouz et al. 2021)</ref>.</p><p>While existing studies focus on tackling the non-IIDness, some implicitly assume that the data are clean, i.e., the data are correctly labeled. In practical applications, however, clients' datasets usually contain noisy labels <ref type="bibr">(Northcutt, Athalye, and Mueller 2021)</ref>. Label noise has been identified in many widely used FL datasets, including MNIST (Northcutt, <ref type="bibr">Jiang, and Chuang 2021;</ref><ref type="bibr">LeCun and Cortes 2005)</ref>, EMNIST (Al-Rawi and Karatzas 2018; <ref type="bibr">Cohen et al. 2017)</ref>, <ref type="bibr">CIFAR-10 (Krizhevsky 2009;</ref><ref type="bibr">Al-Rawi and Karatzas 2018)</ref>, ImageNet <ref type="bibr">(Northcutt, Jiang, and Chuang 2021;</ref><ref type="bibr">Russakovsky et al. 2014)</ref>, and Clothing1M <ref type="bibr">(Xiao et al. 2015)</ref>. The causes of label noise can be human error, subjective labeling tasks, non-exact data labeling processes, and malfunctioning data collection infrastructure <ref type="bibr">(Johnson and Khoshgoftaar 2022;</ref><ref type="bibr">Chen et al. 2021)</ref>. Moreover, in an FL setting, as clients collect and label local data in a distributed and private fashion, their labels are likely to be noisy and have different noise patterns <ref type="bibr">(Xu et al. 2022)</ref>. For example, wearable devices can access various human-generated data, such as heart rate, sleep patterns, medication records, and mental health logs. Such data could contain different levels of label noise due to various sensor precision issues and human bias <ref type="bibr">(Kim, Jo, and Choi 2021)</ref>.</p><p>Label noise is known to lessen model performance (Johnson and Khoshgoftaar 2022). This paper focuses on the issue of label noise in FL, and we are particularly interested in answering the following two key questions:</p><p>&#8226; Question 1: How does label noise affect FL convergence? &#8226; Question 2: How does label noise affect FL generalization?</p><p>To answer Question 1, we conduct numerical experiments 1 arXiv:2211.07816v5 [cs.LG] 28 Jan 2023</p><p>and show that the training loss converges slower with a higher noise level. To answer Question 2, we proceed from both theoretical and empirical perspectives. First, under minor assumptions, we prove that, for any distributed learning algorithm, the generalization error of the global model is linearly bounded above by a multiple of the system noise level. Then we conduct experiments using MNIST and CIFAR-10, showing that the results are consistent with the assumptions and theoretical results. We further show that the global model's accuracy decreases linearly in the clients' label noise level.</p><p>The key contributions of this paper are summarized below.</p><p>&#8226; To the best of our knowledge, this is the first quantitative study that analyzes the impact of label noise on FL. Our study bears practical significance for its use in different applications, e.g., incentive design <ref type="bibr">(Huang et al. 2022</ref>). &#8226; We provide a generic upper bound on the FL generalization error that applies to any FL algorithms. We further obtain a tighter upper bound considering the widely adopted ReLU networks in clients' local models. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Related Work</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Label noise</head><p>Label noise has been an active topic in FL over the last few years. We classify the existing methods into three categories: (1) Some methods apply noise-tolerant loss functions to achieve robust performance <ref type="bibr">(Sharma et al. 2022</ref>).</p><p>(2) Some methods distill confident training sample by selection or a weighting scheme <ref type="bibr">(Chen et al. 2021;</ref><ref type="bibr">Yang et al. 2022b,a;</ref><ref type="bibr">Ma et al. 2021;</ref><ref type="bibr">Chen et al. 2020;</ref><ref type="bibr">Fang and Ye 2022;</ref><ref type="bibr">Duan et al. 2022;</ref><ref type="bibr">Han and Zhang 2020;</ref><ref type="bibr">Li et al. 2021a;</ref><ref type="bibr">Kim et al. 2022;</ref><ref type="bibr">Li, Pei, and Huang 2022;</ref><ref type="bibr">Tuor et al. 2021</ref>). Li et al. discovered that label noise might cause overfitting for FedAvg algorithm. However, they did not analytically characterize the hidden linear relation between noise level and the global model's performance.</p><p>(3) Based on (2), some methods further correct noisy samples <ref type="bibr">(Xu et al. 2022;</ref><ref type="bibr">Zeng et al. 2022;</ref><ref type="bibr">Wang et al. 2022;</ref><ref type="bibr">Tsouvalas et al. 2022)</ref>   <ref type="bibr">(Neyshabur, Salakhutdinov, and Srebro 2015;</ref><ref type="bibr">Neyshabur et al. 2017)</ref>. Empirical studies showed that path-norm positively correlates with generalization in all categories of hyper-parameter <ref type="bibr">(Jiang* et al. 2020)</ref>.</p><p>The value of path-norm increases throughout the learning process. E et al. showed that the path-norm increases at most polynomially under centralized training <ref type="bibr">(Weinan et al. 2020)</ref>. In this work, we conduct the first formal study on the evolution of path-norm in FL. This is also the first work that analyzes the generalization ability of models in FL with path-norm proxy. We introduce path-norm proxy to the FL context because this proxy does not require unrealistic assumptions and allows us to characterize a large class of FL algorithms. For example, the assumptions on convexity, smoothness, etc., are no longer necessary in our analysis. Moreover, we have empirically verified our theory based on the definition of path-norm proxy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Preliminaries and Problem Statement Federated Learning</head><p>In this subsection, we briefly introduce the problem formulation and algorithmic framework of FL.</p><p>Consider a typical FL task <ref type="bibr">(Kairouz et al. 2021)</ref>, where N clients collaboratively train a global model under the coordination of a central server through R communication rounds. FL aims to solve a distributed optimization problem with distributed data. Here we first introduce the objective of the distributed optimization problem and then define the relevant notations. The objective is min</p><p>where we define</p><p>&#8226; Hypothesis space: W &#8834; R dw denotes the hypothesis space of all feasible parameters of learning models, and d w &#8712; N is the dimension of the hypothesis space. &#8226; Local data: Each client has a local dataset S k . We assume that in the k-th dataset S k , each data point is drawn from a distribution &#960; k over S &#8834; R dx+dy where d x denotes the dimension of feature space and d y denotes the dimension of the label space. A data point (x, y) &#8712; R dx+dy is a realvalued vector where x &#8712; R dx denotes its feature and y &#8712; R dy denotes its label. There are in total n k data points in client k's local dataset</p><p>Let &#181; k denote the ground truth distribution (i.e., clean labels) and &#960; k denote client k's possibly noisy data distribution. There exists label noise in the local dataset of client k if there exists x &#8712; R dx , y &#8712; R dy such that</p><p>where Pr represents a probability mass/density function with a given distribution and an event. One can consider the data points sampled from &#960; k as training data and those sampled from &#181; k as test data. &#8226; Global parameter and local parameter: We denote the global model's parameter as a real-valued vector W &#8712; W. Each client has a local model with parameter w k &#8712; W. &#8226; Meta model: We define the meta model f : R dx &#215; W &#8594; R dy as a function that maps the data feature and model parameter to an estimated label. For example, a meta model could be a neural network with variable parameters. We obtain a model by substituting the variable parameters with real number values. &#8226; Loss function: We denote the loss function as</p><p>For example, a squared loss function is defined as : (y, &#375;) &#8594; y -&#375; 2 .</p><p>In each communication round, a client trains its local model for E epochs to minimize the local training loss Different FL algorithms use different aggregation mechanisms. We use FedAvg as an example to explain the aggregation step in Algorithm 1. In FedAvg, the aggregation is defined as</p><p>where &#951; gl denotes the global learning rate. Note that in a realistic setting, there could be limitations on computation and end for 13: end for communication, including computational efficiency, communication bandwidth, and network robustness <ref type="bibr">(Ghosh et al. 2020)</ref>. For example, some clients may fail to communicate with the central server due to network issues. Therefore, the server only samples a subset of available clients. Since we focus on data noise, we ignore these realistic considerations and assume that all clients participate in all communication rounds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model performance</head><p>This subsection introduces the theoretical tools to measure a learning algorithm's performance. Here we inherit most notations from the last part with some revisions. We consider fixed data points for a FL process in the previous part. But in this part, we consider each data point and each local dataset S k as random variables to investigate the generalization performance of an algorithm given an arbitrary training dataset. The pair (x, y) in lowercase represents a deterministic data point, and the pair (X, Y ) in uppercase represents pair of random variables. We re-write a local dataset S k as</p><p>where n := N k=1 n k and W denotes the parameter of the global model. Given the ground truth distribution &#181; k of each client, we further define the ground-truth risk L &#8224; : W &#8594; R &#8805;0 of the global model as</p><p>Then we define the generalization error of the global model as (Yagli, Dytso, and Vincent Poor 2020)</p><p>Path-norm proxy This paper uses ReLU network (Figure <ref type="figure">1</ref>) and path-norm proxy (Figure <ref type="figure">2</ref>) for a case study of the generalization error. The authors in <ref type="bibr">(Weinan et al. 2020</ref>) provided a mathematical description of ReLU networks. Based on their definitions, we define the path-norm proxy below. Definition 1 (Path-norm proxy <ref type="bibr">(Weinan et al. 2020)</ref>). The path-norm proxy of an L-layer ReLU network is defined as</p><p>where &#952; denotes the parameter vector of the ReLU network; &#952; l (i l , i l+1 ) refers to the weight of the edge connecting the i l -th node in layer l and the i l+1 -th node in layer l + 1. The authors in <ref type="bibr">(Weinan et al. 2020</ref>) also proved that the path norm proxy controls the generalization error in a centralized learning setting. Next, we will show that the path norm proxy controls the generalization error in FL.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Theoretical Results</head><p>In this section, we provide a theoretical analysis of the generalization error of the global model in FL. In particular, we give proof of the upper bound of the global model's generalization error.</p><p>In practical FL applications, local data distributions are complicated as we cannot explicitly find the distribution functions. To simplify our theoretical analysis, we make the following assumption: This assumption assures the feature distributions to be identical for all clients, which is a standard setting in studies about concept drift <ref type="bibr">(Jothimurugesan et al. 2022)</ref>. Although it is difficult to show that Assumption 2 holds in our experiment settings, the numerical results are still consistent with our theoretical results.</p><p>We first provide a general result on the upper bound of generalization error in Theorem 3. Then we extend this general bound by studying some specific cases with more assumptions in Corollary 8.</p><p>Theorem 3 (Bound the evolution of generalization error). Consider any Federated Learning algorithm with a neural network with an arbitrary structure for a classification task of C classes under label noise and use the cross-entropy function for loss computation, then under Assumption 2</p><p>where &#8486; is the upper bound of f . Interpretation of Theorem 3: This theorem implies that the generalization error of global model in FL is linearly bounded by the degree of label noise in the distributed system. The theorem quantitatively characterizes the impact of label noise. This linear bound is also consistent with our empirical findings. When N = 1, this linear bound applies to centralized learning.</p><p>We can interpret the expectation term in the upper bound with an example. In this example, we set N = 2, i.e., two clients. The input space consists of 25 discrete grid points and two classes. Client 2's local data distribution is identical to the ground truth. Client 1 has label noise in its local data where three circled data points in class A are mislabelled as class B. If the two clients has the same number of data samples, i.e., Before we prove Theorem 3, we need a lemma on crossentropy. Lemma 4. Consider a classification problem of C classes. Given a data distribution &#960; such that (x, y) &#8764; &#960;, y &#8712; [1 : C], a neural network f and a probability measure Pr, then the expectation of cross-entropy loss is</p><p>In most machine learning tasks, it is reasonable to assume that the input and output of the model are bounded, which we formalize in Assumptions 5 and 6. Assumption 5 (Bounded input space). The input space X is bounded in [0, 1] dx &#8834; R dx . Assumption 6 (Bounded model output). Consider a neural network f : R dx &#215; W &#8594; R dy . We assume that its range</p><p>Note that the upper bound of model output could change as we train the model for more epochs. To model the evolution of the output upper bound, we can relax Assumption 6 and study a specific family of classifiers: ReLU networks. Later we can bound the generalization error evolution given the growth of path-norm proxy through iterations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Proposition 7 (Polynomial growth of path-norm proxy).</head><p>Consider an FL process with a L-layer neural network f : R dx &#215; W &#8594; R dy as its global model, then its pathnorm increases at most polynomially,</p><p>where t &#8804; R denotes the number of communication rounds and E denotes the local training time.</p><p>If we consider a generic decentralized algorithm, we have</p><p>where C is a constant independent of t, L, E.</p><p>Corollary 8. We can specify &#8486; in Theorem 3 with various assumptions:</p><p>1. By Assumption 6, &#8486; = C f . 2. If we use a ReLU network as our model in the FL task, then &#8486; = f (&#8226;; &#952;(t)) pnp . 3. By Assumption 5 and Proposition 7,</p><p>where C 0 is a constant independent of t, E, L.</p><p>There are some important implications behind Corollary 8.</p><p>&#8226; Since the first two statements of the corollary do not rely on the aggregation mechanism of the algorithm, they could also be extended from FL to a decentralized learning scenario, e.g. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Numerical Results</head><p>We present three numerical experiments to validate our theoretical results and draw new insights. We first verify our theoretical work on the path-norm proxy. Then we show experiments of 2-client, 4-client, and 15-client FL settings.</p><p>Our main findings are 1) the growth of path-norm proxy empirically increases in a polynomial order in FL; 2) there exists an approximate negative linear relation between the test accuracy of global model and the number of incorrectly labeled data; 3) label noise slows down the convergence of FL algorithms and induces over-fitting to the global model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Path-norm Proxy</head><p>In this subsection, we study the path norm proxy and observe its relation with the number of layers and communication rounds. Compare different FL algorithms. We use the same neural network structure and consider N = 4 clients. We study These results empirically show that the path norm increases polynomially in FL. Compare ReLU networks with different numbers of layers. We train ReLU networks using FedAvg on MNIST. We use different numbers of layers in {2, 3, 4}. The result is illustrated in Figure <ref type="figure">5b</ref>. To verify the polynomial rule, we use a logarithmic scale on the y-axis. The three path-norm curves have similar shapes and almost differ up to a constant factor. This result is consistent with Proposition 7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Pilot experiments</head><p>We run 2-client experiments with FedAvg algorithm on MNIST dataset. We study a 2-client setting for multiple considerations.</p><p>&#8226; 2-client setting exists in practice. In cross-silo FL, clients could be enterprises, and each client could provide abundant data, so the total number of clients is relatively small. For example, since 2019, two insurance companies, Swiss Re and WeBank, have collaborated on federated learning <ref type="bibr">(Huang, Huang, and Liu 2022</ref>). &#8226; This experiment serves as a starting point and gives us a thorough pedagogical understanding of the impact of label noise. We will study the 4-client and 15-client cases in the next subsection.</p><p>We generate the local datasets for two clients by dividing the whole dataset into two equally-sized parts. We add label noise to local datasets by uniformly flipping some instances' labels to other class labels. Each client has different noise levels. Denote the noise level of client i as wp i , then (wp 1 , wp 2 ) &#8712; {0%, 10%, 20%, . . . , 80%, 90%} 2 . Pathological noise levels (greater than 50%) have been studied in supervised learning settings <ref type="bibr">(Luo et al. 2022)</ref>. We illustrate the test accuracy of global models under different degrees of label noise as bar charts in Fig 6 . To verify the linear trend of test accuracy, we perform linear regression and visualize the result in Fig 7 . 
Negative bilinear trend by label noise. Figure <ref type="figure">6</ref> shows a negative bilinear relation between the test accuracy of the global model and noise label. When we apply linear regression on the test accuracy of the global model and the proportion of wrongly labeled data, we obtain a coefficient of determination of 0.98 in Figure <ref type="figure">7</ref>. That means the relation between the test accuracy and label noise has a strong linear relation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiments with larger cohort size</head><p>We run experiments on CIFAR-10 dataset respectively with four clients and fifteen clients. Local datasets are generated by dividing the whole dataset into equally-sized parts. We add label noise to local datasets by uniformly flipping some instances' labels to other class labels. In a case study by Gu et al., the real human annotation has a rater error rate of around 4.8% <ref type="bibr">(Gu et al. 2022)</ref>. Therefore it is reasonable to study the error rate within a relatively small range that contains 4.8%, i.e., from 0% to 10%. We set the same proportion of wrongly labeled data for each client (0%, 2%, 4%, 8%). Slow Convergence by label noise. In Figure <ref type="figure">8</ref> and Figure <ref type="figure">9</ref>, we plot how client 1's local model loss depends on the communication rounds at different percentages of wrongly la-  beled data. The training loss decreases slower with a larger proportion of wrongly labeled data, i.e., the algorithm converges slower with a larger proportion of wrongly labeled data.</p><p>Overfitting by label noise. We observe in Figure <ref type="figure">10</ref> that for all three algorithms, the global model's test accuracy decreases after 20 communication rounds. The global model is more over-fitted with a larger percentage of wrongly labeled data. This result provides an engineering insight in FL that the over-fitting of the global model could result from some wrongly labeled data in the local datasets. It also motivates the study of mitigating label noise in FL <ref type="bibr">(Li et al. 2021a</ref>). Negative linear trend by label noise. In Figure <ref type="figure">11</ref>, all three algorithms show a negative linear relation between the test accuracy of the global model and the proportion of wrongly labeled data. This is consistent with our theoretical analysis.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Discussions</head><p>In this section, we discuss the limitation and potential application of our work.  bound for the generalization error, and the bound is consistent with numerical results. However, the upper bound can be loose. One can provide a lower bound or improve the upper bound by making more restrictive assumptions. For example, one can consider a regression task with MSE loss function that provides nicer theoretical properties <ref type="bibr">(Damian, Ma, and Lee 2021)</ref>. More comprehensive experiments: Our experiments use a small number of clients, which applies to cross-silo FL. In future research, we plan to study the impact of label noise with a larger number of clients (e.g., as in cross-device FL). Application: Our results potentially serve as "domain knowledge" to improve FL algorithm design. Our work could also be used in designing incentive mechanisms in FL systems <ref type="bibr">(Huang et al. 2022)</ref>. In particular, the qualitative relation in this paper helps model the performance of global model under label noise. Methodology: The emergence of large machine learning models has shifted the nature of AI research from an engineering science (iteratively improving models) to a natural science (probing capabilities of the models we designed) <ref type="bibr">(Kambhampati 2022)</ref>. Researchers have been proposing hundreds of new models/algorithms for different AI problems. However, more must be done to understand how and why a proposed model/algorithm performs in a certain way. We must build theories based on observation and experiments to understand these artificial black boxes. In this way, we can transform AI research from engineering alchemy to white-box chemistry. Our work follows this scientific paradigm shift and conducts a case study for different FL models under label noise.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Concluding Remarks</head><p>This paper takes the first step to quantify the impact of label noise on the global model in FL. The critical challenge is that we have little knowledge of the underlying information related to local data distributions and we do not have an explicit expression of the outcome of an FL algorithm. We show with both empirical evidence and theoretical proof that 1) label noise linearly degrades the global model's performance in FL; 2) label noise slows down the convergence of the global model; 3) label noise induces overfitting to the global model. Our results could provide insights into the design of a noise-robust algorithm and the design of an incen-error as follows</p><p>Theorem 12 (Growth of path-norm proxy <ref type="bibr">(Weinan et al. 2020, Corollary 5)</ref>). Consider an arbitrarily wide L-layer ReLU neural network. If the network's weight evolves under continuous gradient flow dynamics, then the network's path-norm increases at most polynomially</p><p>) where &#952;(t) denotes the weight of the network at time t, C 0 is a constant and R is the expectation of a sufficiently smooth loss function</p><p>Proof sketch of Proposition 7. This proof assumes the gradient-flow evolution and an arbitrarily wide neural network based on the proofs in <ref type="bibr">(Weinan et al. 2020)</ref>. We first consider an FL setting. Let &#952; (i) (t k ) denote the collection of all the parameters of client i's neural network uploaded to the central server at the k-th communication round before FL aggregation. Let &#952; (i) j (t k ) denote the collection of j-th layer parameters of client i's neural network uploaded to the central server at the k-th communication round before FL aggregation. Let &#952;j (t k ) denote the collection jth layer parameters of the global neural network at the k-th communication round after FL aggregation, i.e.</p><p>Define the risk functional of client i:</p><p>where X denotes the feature/input space and Y denotes the label/output space.</p><p>We first consider the path-norm evolution during local update of client i. Let E denote the time range of local update under gradient flow and t k = t k+1 + E, &#8704;k &#8804; R. By Theorem 12, we have</p><p>Without loss of generality, consider the FedAvg aggregation scheme, i.e. for 1</p><p>Let &#8486; j denote the index space of the j-th layer of a neural network, and &#952; (i) j (w j+1 , w j , t k+1 ) denote the weight of the neural network given indices w j+1 , w j , then</p><p>&#8804; &#952;j (t k ) L 2 (&#960; j+1 &#8855;&#960; j ) + max 1&#8804;i&#8804;N</p><p>&#8804; &#952;j (t 0 ) L 2 (&#960; j+1 &#8855;&#960; j ) + (k + 1) max 1&#8804;i&#8804;N R (i) &#952;(t 0 ) E 1/2</p><p>Now we have an upper bound of the j-th layer parameters, we can then derive the upper bound of the path-norm. By Lemma 4.6 in <ref type="bibr">(Weinan et al. 2020)</ref>,</p><p>where C is a constant and C &#8805; &#952;j (t 0 ) L 2 (&#960; j+1 &#8855;&#960; j ) for all 1 &#8804; j &#8804; L.</p><p>Note that the above result also applies for SCAFFOLD with a new risk functional of client i:  where c i denotes the client control variate and M is a constant. We introduce M to ensure that the risk functional is always non-negative. The gradient-flow dynamics does not depend on the choice of M .</p><p>Next, we consider a decentralized learning setting. Here we take the multi-consensus stochastic variance reduced extragradient algorithm as an example <ref type="bibr">(Luo and Ye 2022)</ref>. The local update analysis is similar to FL. As for the aggregation step in decentralized optimization, the communication step is typically written as the matrix multiplication</p><p>where &#952; denotes the matrix which collects all clients' parameter vector &#952; (i) . We assume that &#8226; W ij = 0 if client i and j can exchange information; &#8226; W is a symmetric matrix; &#8226; 0 W I, W 1 = 1, null(I -W ) = span(1).</p><p>Let &#952; (i) (t k ) denote the collection of all the parameters of client i's neural network uploaded to the central server at the k-th communication round before decentralized communication. Let &#952; (i) j (t k ) denote the collection of j-th layer parameters of client i's neural network uploaded to the central server at the k-th communication round before decentralized communication. Let &#952;(i) j (t k ) denote the collection j-th layer parameters of client i's neural network at the k-th communication round after the decentralized communication.</p><p>By Lemma 2 in <ref type="bibr">(Liu and Morse 2011)</ref> and Lemma 2.1 in <ref type="bibr">(Luo and Ye 2022)</ref>, we obtain a bound on the mixing rate of parameters,</p><p>L 2 (&#960; j+1 &#8855;&#960; j ) &#8804;&#955; 2 (W ) &#952; where &#955; 2 (W ) denotes the second largest eigenvalue of W . Then &#952;(i) j (t k+1 ) L 2 (&#960; j+1 &#8855;&#960; j ) &#8804;(1 + 2&#955; 2 (W )) &#952;j (t k ) L 2 (&#960; j+1 &#8855;&#960; j )</p><p>again, by Lemma 4.6 in <ref type="bibr">(Weinan et al. 2020)</ref>, we have</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>In some work, the conditional distribution is also referred to as "feature-to-label mapping".</p></note>
		</body>
		</text>
</TEI>
