<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Switching Gradient Methods for Constrained Federated Optimization</title></titleStmt>
			<publicationStmt>
				<publisher>OPT2025: 17th Annual Workshop on Optimization for Machine Learning</publisher>
				<date>12/02/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10652574</idno>
					<idno type="doi"></idno>
					
					<author>Antesh Upadhyay</author><author>Sang Bin Moon</author><author>Abolfazl Hashemi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Constrained optimization problems arise in federated learning (FL) settings, where a global objective must be minimized subject to a functional constraint aggregated across clients. We introduce Federated Switching Gradient Methods (FedSGM), a primal-only, projection-free algorithm for federated constrained optimization. By extending switching gradient methods to the federated setting, FedSGM avoids the inner solves and penalty tuning required by dual or penalty-based methods, enabling lightweight and scalable deployment. Our analysis addresses three practical challenges simultaneously: (i) multi-step local updates to accommodate heterogeneous client compute, (ii) unbiased uplink compression to mitigate communication costs, and (iii) both hard and soft switching between objective and constraint gradients. We provide the first convergence guarantees for constrained FL that hold under these combined settings, recovering known centralized rates in special cases. In particular, we show that soft switching, recently proposed in the centralized literature, retains convergence guarantees while offering improved empirical stability near the constraint boundary.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>We study federated constrained optimization problems of the form w * = arg min</p><p>where f j and g j represent the local objective and constraint functions on the client j. This framework captures many real-world applications of federated learning <ref type="bibr">[18,</ref><ref type="bibr">22,</ref><ref type="bibr">24]</ref>, where models must be trained across private or geo-distributed data without central collection. Constraints g(w) &#8804; 0 encode feasibility conditions such as fairness mandates, energy budgets, or safety margins in autonomous systems and battery management <ref type="bibr">[1,</ref><ref type="bibr">5,</ref><ref type="bibr">19,</ref><ref type="bibr">38]</ref>. Still, solving <ref type="bibr">(1)</ref> in realistic deployment scenarios remains challenging. We highlight three primary difficulties that simultaneously shape the design space for federated constrained optimization:</p><p>(i) Functional constraints. Federated tasks increasingly involve feasibility criteria beyond minimizing a loss: fairness across subpopulations <ref type="bibr">[1]</ref>, bounding risk exposure in financial models, or safety limits in batteries (e.g., maximum temperature rise). Enforcing such constraints requires algorithms that provide guarantees on feasibility without resorting to expensive projections or inner constrained solves each round.</p><p>(ii) Severe bandwidth limits. Deep neural network models involve millions of parameters, yet FL often operates on commodity wireless or edge networks, where it is infeasible to send full-precision parameter updates every round. Communication-efficient training requires compression techniques such as Top-K, Rand-K, or quantization <ref type="bibr">[2,</ref><ref type="bibr">9,</ref><ref type="bibr">33]</ref>.</p><p>(iii) Heterogeneous on-device compute. Devices participating in FL varies in orders of magnitude with their FLOPS, memory, and energy capacity. A common strategy is to allow clients to perform multiple local updates (E &gt; 1) before each communication round <ref type="bibr">[20]</ref>, amortizing latency and increasing utilization. Yet, local updates cause drifts between client and global iterates, complicating convergence analysis, especially when constraints must be satisfied globally.</p><p>Despite the importance of these challenges for real-world FL applications, no existing method provides provable guarantees that addresses them simultaneously.</p><p>Limitations of existing approaches. Constrained versions of FedAvg <ref type="bibr">[14]</ref> and primal-dual or AL/ADMM methods <ref type="bibr">[4,</ref><ref type="bibr">6,</ref><ref type="bibr">8,</ref><ref type="bibr">13,</ref><ref type="bibr">21,</ref><ref type="bibr">25,</ref><ref type="bibr">26,</ref><ref type="bibr">39]</ref> can certify feasibility but rely on dual variable tuning, penalty scheduling, or inner projection steps. These methods also typically assume synchronous full participation and uncompressed updates, which is unrealistic in FL applications. Error-feedback compression algorithms, such as EF-SGD and Safe-EF <ref type="bibr">[2,</ref><ref type="bibr">9,</ref><ref type="bibr">17,</ref><ref type="bibr">33]</ref>, provide robustness against biased quantization, but do not incorporate local update drift (E &gt; 1). Local-SGD methods <ref type="bibr">[20]</ref> directly address compute heterogeneity but are unconstrained, offering neither feasibility nor compression robustness. Finally, switching-gradient methods (SGM) <ref type="bibr">[23,</ref><ref type="bibr">27,</ref><ref type="bibr">30,</ref><ref type="bibr">36]</ref> provide a primal-only, projection-free mechanism for constrained optimization: if w t is nearly feasible, update along &#8711;f ; if not, update along &#8711;g. This design achieves the optimal O(&#1013; -2 ) rate for convex, possibly non-smooth problems, and recent work <ref type="bibr">[7,</ref><ref type="bibr">15]</ref> extends optimality guarantees to weakly convex objectives. However, all existing analyses assume centralized, synchronous, full-gradient access, which is not suitable for federated systems with compression and local updates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Contributions.</head><p>&#8226; We investigate FedSGM as a unifying backbone for constrained FL by extending the primalonly philosophy of SGM. Unlike AL/ADMM-based methods, FedSGM avoids dual-variable tuning and penalty scheduling, ensuring lightweight per-round computation while certifying the feasibility of the averaged iterate.</p><p>&#8226; We analyze FedSGM under multiple local steps (E &gt; 1) by bounding the drift between local and global iterates. This yields rates of the form O DG</p><p>, where D := &#8741;w 0 -w * &#8741;, T is the total number of rounds, G is the Lipschitz constant, and q u the uplink compression factor. The analysis recovers the canonical 1/</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8730;</head><p>T rate from centralized SGM, while explicitly quantifying the impact of multi-step local updates and compression.</p><p>&#8226; We extend both hard switching, a binary choice of update, and soft switching <ref type="bibr">[36]</ref>, a smooth interpolation between &#8711;f and &#8711;g via a smooth weighting function based on the magnitude of violation. Theorem 2 recovers the hard switching regime while improving stability near the feasibility boundary, without impacting the convergence rate.</p><p>Together, these results provide the first convergence guarantees for constrained FL with uplinkunbiased compression, multiple local updates, and soft switching.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Problem Setup and Preliminaries</head><p>Notation. We write [T ] := {0, 1, . . . , T -1} for an index set of length T , and [n] := {1, 2, . . . , n} for the set of all clients. For a set A, |A| denotes its cardinality. We use Id to denote the identity mapping, i.e., Id(x) = x for all x. For a scalar z &#8712; R, we define the positive part operator [z] + := max{0, z}. Additionally, 1 {&#8226;} denotes the standard binary indicator function.</p><p>In this work, we consider the problem defined in <ref type="bibr">(1)</ref>, which is a standard constrained optimization problem in the FL setting, where f j : R d &#8594; R denotes the local objective of client j and g j : R d &#8594; R denotes its local constraint function. Here, n is the total number of clients and w &#8712; R d are the model parameters. The functions f (&#8226;) and g(&#8226;) represent the global objective and global constraint, respectively. The goal is to compute an &#1013;-solution w that satisfies f ( w) -f (w * ) &#8804; &#1013; and g( w) &#8804; &#1013;. Following are the assumptions used in our analysis.</p><p>Assumption 1 Each function f j and g j is convex and G-Lipschitz continuous in R d . Consequently, f and g are also convex and G-Lipschitz. 2 , where q u &#8805; 0 is the compression factor.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Assumption 2 (Unbiased Compression</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">FedSGM: Federated Switching Gradient Methods</head><p>We introduce FedSGM, a projection-free and duality-free algorithm for constrained optimization in FL. FedSGM combines Federated learning with the Switching Gradient Method, focusing here on the setting with: (i) full client participation, (ii) both hard and soft switching between objective and constraint updates, (iii) multiple local updates per communication round, and (iv) uplink compression.</p><p>At round t, the server collects constraint values {g j (w t )} n j=1 and broadcasts the global average g(w t ). Each client j initializes w t j,0 = w t and performs E local steps. The update direction is</p><p>where &#963; &#946; is the switching rule: hard switching &#963; &#946; (z) = 1 {z&gt;0} , or soft switching &#963; &#946; (z) = min{1, [1 + &#946;z] + }. Clients then set w t j,&#964; +1 = w t j,&#964; -&#951;&#957; t j,&#964; . After E steps, each client transmits the compressed difference &#8710; t j,E = C((w t -w t j,E )/&#951;), with C &#8801; Id denoting no compression. The server updates</p><p>This unified design (Alg. 1, see Appendix A) cleanly subsumes both hard and soft switching under compression.</p><p>Motivation for Soft Switching. Hard switching enforces a binary rule: each client either optimizes the objective or the constraint, depending on whether the global violation g(w t ) is within the tolerance &#1013;. While simple, this approach is highly sensitive when g(w t ) fluctuates around &#1013; as already shown in the centralized case <ref type="bibr">[36]</ref>. Such fluctuations can trigger frequent back-and-forth updates, amplifying client drift and resulting in unstable trajectories.</p><p>Soft switching provides a remedy by introducing a smooth interpolation between the two gradients <ref type="bibr">[36]</ref>. Specifically, the switching weight &#963; &#946; (g(w t )-&#1013;) ensures that near the feasibility boundary, the update direction is a convex combination of &#8711;f j and &#8711;g j , rather than an abrupt choice. This smooth relaxation suppresses oscillations while preserving convergence guarantees. Our analysis demonstrates that, even under compression, soft switching attains the same order of convergence as hard switching, while empirical validation confirms its superior stability near the feasibility boundary. To the best of our knowledge, this is the first convergence guarantee establishing the effectiveness of soft switching for constrained FL with compression, thereby extending beyond prior results restricted to hard switching and single local updates <ref type="bibr">[17]</ref>. Next, we state the theorems that provide the convergence guarantees for FedSGM in Algorithm 1.</p><p>Theorem 1 (Hard switching FedSGM) Consider the problem in (1) and Algorithm 1, under Assumptions 1 and 2 (only for compression). Let D := &#8741;w 0 -w * &#8741; and define</p><p>where q u is client-to-server compression factor. Now, set the constraint threshold and step size as</p><p>then A is nonempty, w is well-defined, and w is an &#1013;-solution of (1).</p><p>Theorem 2 (Soft switching FedSGM) Consider the problem in (1) and Algorithm 1, under Assumptions 1 and 2 (only for compression). Let D := &#8741;w 0 -w * &#8741; and define</p><p>.</p><p>where q u is client-to-server compression factor. Now, set the constraint threshold and step size as</p><p>then A is nonempty, w is well-defined, and w is an &#1013;-solution of (1).</p><p>The details of the proof are present in the Appendix A. Now, we discuss the implications of the statement of Theorems 1 and 2.</p><p>&#8226; n = 1, q u = 0, C j &#8801; Id, E = 1 , i.e., centralized with no compression: In this case, we can infer from Theorems 1 and 2, that rates we receive is O DG/</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8730;</head><p>T , corresponding to the rates present in <ref type="bibr">[23,</ref><ref type="bibr">28,</ref><ref type="bibr">36]</ref>.</p><p>&#8226; g j &#8801; 0, &#8704;j &#8712; [n], E = 1, i.e., no constraint: In this case, our algorithm reduces to the well-known</p><p>FedCOM method <ref type="bibr">[12]</ref>, where we recover the standard rate of O DG . The scaling of &#8730; E captures the effect of client-drift in this federated constrained setting, deviating from the previous centralized results.</p><p>&#8226; FedSGM with uplink compression: We focus on the case where only uplink transmissions are compressed, which captures the dominant communication bottleneck in federated learning. Most prior works on uplink compression have studied restricted classes of compressors-such as absolute compression <ref type="bibr">[34]</ref> or unbiased operators <ref type="bibr">[10,</ref><ref type="bibr">11,</ref><ref type="bibr">29,</ref><ref type="bibr">35]</ref>-and largely target unconstrained or singlestep settings. Recent progress on biased compressors <ref type="bibr">[3,</ref><ref type="bibr">16,</ref><ref type="bibr">31]</ref> has established convergence guarantees for unconstrained optimization, while constrained FL with compression has only been addressed under hard switching and single local updates <ref type="bibr">[17]</ref>. In this work, we analyze FedSGM in the unbiased uplink regime, extending the scope to multiple local updates and, crucially, to soft switching between objective and constraint updates. To the best of our knowledge, this is the first convergence result showing that soft switching remains effective in constrained FL under uplink-unbiased compression, thereby laying the groundwork for subsequent extensions to biased and bi-directional compression.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Numerical Experiments</head><p>We evaluate the proposed method in a federated Neyman-Pearson (NP) classification setting with constrained optimization objectives. The aim is to examine how different switching strategies behave under uplink compression, with particular focus on convergence, stability, and feasibility.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">NP Classification</head><p>We consider the constrained optimization problem in <ref type="bibr">(1)</ref>, where the objective is to minimize the empirical loss on the majority class while ensuring that the loss on the minority class remains below a prescribed tolerance. For each client j, the local objective and constraint are defined as</p><p>where</p><p>j denote local samples of class 0 and class 1, respectively, and m j0 , m j1 are their cardinalities. The function &#981; is the binary logistic loss,</p><p>This captures the NP paradigm: f (w) enforces performance on the majority class, while the constraint g(w) &#8804; &#1013; ensures the minority class loss does not exceed the tolerance. We use the breast cancer dataset <ref type="bibr">[37]</ref>, containing 569 samples with 30 features. To simulate the federated setting, the data is split in an IID fashion between n = 10 clients, such that each client receives an equal number of samples and the same class ratio. Each client performs local gradient descent updates for E = 5 epochs before communication, and the global model is updated  for T = 100 communication rounds. We compare both hard and soft switching using a tolerance = 0.1.</p><p>For communication efficiency, we evaluate using Rand-K <ref type="bibr">[12,</ref><ref type="bibr">32]</ref> compressor, which transmits a fraction K/d of coordinates with unbiased scaling. Performance is measured in terms of the majority-class objective loss f (w), the minority-class constraint loss g(w), and the number of rounds in which the feasibility condition g(w) &#8804; is violated. Each experiment is repeated with three random seeds, and we report mean trajectories with variance bands. From Figure <ref type="figure">1</ref>, we observe that both hard and soft switching achieve convergence of the majority-class objective f (w). However, soft switching exhibits markedly improved stability in the constraint g(w): while hard switching frequently oscillates around the tolerance, soft switching maintains feasibility more consistently. This translates into a substantial reduction in the number of violations, with soft switching yielding around 4&#215; fewer violations compared to hard switching. Notably, this improvement in feasibility is achieved without compromising the convergence behavior of the objective loss, and in fact ensures more stable results as evident from the variance plots, highlighting the advantage of soft switching in balancing accuracy and constraint satisfaction under communication compression.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusion</head><p>FedSGM provides a unified algorithmic framework for constrained federated learning by integrating projection-free and duality-free switching-gradient methods with multi-step local updates and uplink compression. Our analysis provides the first convergence guarantees in this regime, yielding rates</p><p>with explicit dependence on the compression factor, and recovers classical centralized and FedCOM bounds as special cases. Beyond hard switching, we demonstrated that soft switching recovers the hard regime when &#946; &#8805; 2/ , but is empirically more stable near the feasibility boundary, suppressing oscillations without altering the rate. Overall, FedSGM robustly balances feasibility, client drift, and communication efficiency, and lays the foundation for extensions to biased/bidirectional compression and partial participation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix A. Appendix</head><p>Algorithm 1: FedSGM(T, E, C j , &#951;, &#946;) Input: Number of rounds T ; local updates E; compression operator C j (with C j = Id denoting no compression); learning rate &#951;; initial model w 0 ; switching rule &#963; &#946; (&#8226;) (hard or soft) for t &#8592; 0 to T -1 do foreach client j &#8712; [n] // in parallel do send g j (w t ) to server // cheap single float communication end compute g(w t ) &#8592; 1 n n j=1 g j (w t ) and broadcast; foreach client j &#8712; [n] // in parallel do set w t j,0 &#8592; w t ; for &#964; &#8592; 0 to E -1 do compute switching weight &#945; t &#8592; &#963; &#946; (g(w t ) -&#1013;); update direction &#957; t j,&#964; &#8592; (1 -&#945; t )&#8711;f j (w t j,&#964; ) + &#945; t &#8711;g j (w t j,&#964; ); update w t j,&#964; +1 &#8592; w t j,&#964; -&#951;&#957; t j,&#964; ; end send &#8710; t j,E &#8592; C j wt-w t j,E &#951; to server; end server computes w t+1 &#8592; w t -&#951; &#8226; 1 n n j=1 &#8710; t j,E and broadcasts; end A.1. Lemmas Lemma 3 (Bound on the Expected Norm of Compressed Aggregates) Under Assumptions 1 and 2, for all rounds t &#8712; [T ], the following bound holds,</p><p>Proof Let X t j := E-1 &#964; =0 &#957; t j,&#964; . We aim to bound:</p><p>Using the identity for the second moment:</p><p>we apply this to our setup:</p><p>where the inequality uses independence across j and linearity of expectation. Applying Assumption 2 (variance bound of the compression operator), E C j &#8741;C j (X t j ) -X t j &#8741; 2 &#8804; q u &#8741;X t j &#8741; 2 , we get,</p><p>Now substitute X t j = E-1 &#964; =0 &#957; t j,&#964; and apply Jensen's inequality again,</p><p>Hence,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 4 (Inner Product Bound under Compression and Switching)</head><p>Under Assumptions 1 and 2, for all rounds t &#8712; [T ], the following bound holds,</p><p>Proof We now analyze the cross term in the squared distance recursion,</p><p>We handle the second term by applying the convexity of f j or g j . For t &#8712; A, where updates use &#8711;f j we apply: f j (w * ) &#8805; f j (w t j,&#964; ) + &#8711;f j (w t j,&#964; ), w * -w t j,&#964; , which implies:</p><p>-w t j,&#964; -w * , &#8711;f j (w t j,&#964; ) &#8804; f j (w * ) -f j (w t j,&#964; ). Thus,</p><p>T ermB = -2&#951; w t j,&#964; -w * , &#8711;f j (w t j,&#964; ) &#8804; 2&#951; f j (w * ) -f j (w t j,&#964; ) .</p><p>A argument holds for t &#8712; B with &#8711;g j , resulting in T ermB &#8804; 2&#951; g j (w * ) -g j (w t j,&#964; ) . Again, while upper bounding T ermA, we need to deal with 2 cases depending on whether t &#8712; A or t &#8712; B. Firstly, we start with the case where t &#8712; A, for any &#945; &gt; 0</p><p>Similarly, for t &#8712; B, we get</p><p>Substituting the bounds for both T ermA and T ermB back into the original expectation furnishes the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 5 (Global-local iterate bound) Under Assumptions 1, for all rounds t &#8712; [T ], the following bound holds,</head><p>Therefore,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2. Main Theorem FedSGM-Unbiased Compression</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2.1. Hard Switching -Full Participation</head><p>Theorem 6 (FedSGCM) Consider the problem in eq. ( <ref type="formula">1</ref>) and Algorithm 1, under Assumptions 1 and 2. Define D := &#8741;w 0 -w * &#8741; and</p><p>, where &#915; = 2E 2 + Eq u 2n it holds that A is nonempty, w is well-defined, and w is an &#1013;-solution for P .</p><p>Proof Using Algorithm 1, the update rule for the global model is</p><p>where we define the compressed update using a stochastic compression operator C j (&#8226;), and also</p><p>We analyze the squared distance to the optimal point w * as follows,</p><p>Taking the expectation with respect to the compression operator C j and using Lemmas 3 and 4, we get</p><p>Now our goal is to handle the term f j (w * ) f j (w t j,&#964; ), so we first rewrite, f j (w t j,&#964; ) &#8805; f j (w t ) + &#10216;&#8711;f j (w t ), w t j,&#964; -w t &#10217; (by convexity)</p><p>Using Young's inequality with parameter &#945; &gt; 0, we get</p><p>Thus, we get,</p><p>Similarly, we can handle the other term with g and get,</p><p>So, putting these inequalities back in eq. ( <ref type="formula">5</ref>) along with the use of Lemma 5 and eq. ( <ref type="formula">1</ref>), we get</p><p>Now, for &#945; = &#951;, and rearranging the terms we get</p><p>Defining D := &#8741;w 0 -w * &#8741; and summing the expression above for t = 0, 1, &#8226; &#8226; &#8226; , T -1 and then dividing by T , we get</p><p>Note that when &#1013; is sufficiently large, A is nonempty. Assuming an empty A, we can find the largest "bad" &#1013;:</p><p>for some N &#8805; 0. With this choice, A is guaranteed to be nonempty. Now, we consider two cases. Either t&#8712;A f (w t ) -f (w * ) &#8804; 0 which implies by the convexity of f and g for w = 1 |A| t&#8712;A w t we have</p><p>Otherwise, if t&#8712;A f (w t ) -f (w * ) &gt; 0, then</p><p>By rearranging</p><p>Implying f ( w) -f (w * ) &lt; &#1013; and further by convexity of g for w = 1 |A| t&#8712;A w t , we also have g( w) &#8804; &#1013;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2.2. Soft Switching -Full Participation</head><p>Theorem 7 (FedSSGCM) Consider the problem in (1) and Algorithm 1, under Assumption 1. Define D := &#8741;w 0 -w * &#8741; and</p><p>where</p><p>Then, if</p><p>, and &#946; = 2 &#1013; where &#915; = 2E 2 + Eq 2n it holds that A is nonempty, w is well-defined, and w is an &#1013;-solution for P .</p><p>Proof Using Algorithm 1, the update rule for the global model is</p><p>where,</p><p>We analyze the squared distance to the optimal point w * as follows,</p><p>Firstly, we start with upper-bounding T erm -A. Taking the expectation with respect to the compression operator C j and using Lemmas 3 and the fact that &#963; &#946; (&#8226;) &#8712; [0, 1], we have</p><p>Now we aim to upper bound T erm -B. First, we take the expectation with respect to C j and</p><p>Now we define &#963; t &#946; = &#963; &#946; (g(w t ) -&#1013;) before we start upper bounding T erm -B 1 for any &#945; &gt; 0.</p><p>Similarly, we start to upper bound T erm -B 2 as well.</p><p>Putting T erm -B 1 and T erm -B 2 back in T erm -B, we get</p><p>Substituting T erm -A and T erm -B back in the original expression after taking the expectation w.r.t C j , we get for &#945; = &#951;</p><p>Note that for all t &#8712; B it holds that &#963; &#946; (g(w t ) -&#1013;) = 1 and g(w t ) -g(w * ) &#8805; &#1013;. Further, for all t &#8712; A if &#963; &#946; (g(w t ) -&#1013;) &#8805; 0 it holds that g(w t ) -g(w * ) &#8805; g(w t ) &#8805; &#1013; -1/&#946;. With these observations, using convexity of f and g and decomposing the sum over t according to the definitions of A and B and division by T yields,</p><p>(10) Similar to the previous proofs, we first need to find the smallest &#1013; to ensure A is non-empty. So, to find a lower bound on &#1013;, assume A is empty in eq. ( <ref type="formula">10</ref>) and observe that as long as condition</p><p>ET . Now, like before, we consider two cases based on the sign of t&#8712;A 1 -&#963; t &#946; f (w t ) -f (w * ) . As before, when the sum is non-positive we are done by the definition of A, which implies 0 &lt; 1 -&#963; &#946; (g(w t ) -&#1013;) &#8804; 1 and the convexity of f and g.</p><p>Assuming the sum is positive, dividing eq. ( <ref type="formula">10</ref>) by t&#8712;A 1 -&#963; t &#946; (which by the definition of A is strictly positive), using convexity of f , and the definition of w, we have</p><p>where we used |B| = T -|A|.</p><p>Let us now find a lower bound on &#946; to ensure the second term in the bound is non-positive. Note this is done for simplicity, and as long as the second term is O(&#1013;), an &#1013;-solution can be found. Immediate calculations show the second term in the bound is non-positive when</p><p>Since t&#8712;A &#963; t &#946; &lt; T , a sufficient (and highly conservative) condition for all T &#8805; 1 is to set &#946; &#8805; 2/&#1013;. Thus, we proved the suboptimality gap result. The feasibility result is immediate given the definition of A and the convexity of g. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3. Main Theorem FedSGM</head><p>Then, if</p><p>, and &#951; = D 2 4G 2 E 3 T it holds that A is nonempty, w is well-defined, and w is an &#1013;-solution for P .</p><p>Proof Using Algorithm 1, the update rule for the global model is</p><p>We analyze the squared distance to the optimal point w * as follows,</p><p>Firstly, we start with upper-bounding T erm -A.</p><p>T erm</p><p>Now we can use Lemma 4 to bound T erm -B. So, putting T erm -A and T erm -B back into the expression we get</p><p>Now our goal is to handle the term f j (w * ) -f j (w t j,&#964; ), so we first rewrite, f j (w t j,&#964; ) &#8805; f j (w t ) + &#10216;&#8711;f j (w t ), w t j,&#964; -w t &#10217; (by convexity)</p><p>Using Young's inequality with parameter &#945; &gt; 0, we get -&#10216;&#8711;f j (w t ), w t j,&#964; -w t &#10217; &#8804;</p><p>Thus, we get,</p><p>Similarly, we can handle the other term with g and get,</p><p>So, putting these inequalities back in eq. ( <ref type="formula">12</ref>) along with the use of Lemma 5 and eq. ( <ref type="formula">1</ref>), we get</p><p>Now, for &#945; = &#951;, and rearranging the terms we get</p><p>Defining D := &#8741;w 0 -w * &#8741; and summing the expression above for t = 0, 1, &#8226; &#8226; &#8226; , T -1 and then dividing by T , we get</p><p>Note that when &#1013; is sufficiently large, A is nonempty. Assuming an empty A, we can find the largest "bad" &#1013;:</p><p>for some N &#8805; 0. With this choice, A is guaranteed to be nonempty. Now, we consider two cases. Either t&#8712;A f (w t ) -f (w * ) &#8804; 0 which implies by the convexity of f and g for w = 1 |A| t&#8712;A w t we have</p><p>Otherwise, if t&#8712;A f (w t ) -f (w * ) &gt; 0, then</p><p>By rearranging</p><p>Implying f ( w) -f (w * ) &lt; &#1013; and further by convexity of g for w = 1 |A| t&#8712;A w t , we also have g( w) &#8804; &#1013;. where</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3.2. Soft Switching -Full participation</head><p>Then, if</p><p>, and &#946; = 2 &#1013; it holds that A is nonempty, w is well-defined, and w is an &#1013;-solution for P .</p><p>Proof Using Algorithm 1, the update rule for the global model is</p><p>where, &#957; t j,&#964; = &#963; &#946; (g(w t ) -&#1013;)&#8711;g j (w t j,&#964; ) + (1 -&#963; &#946; (g(w t ) -&#1013;))&#8711;f j (w t j,&#964; )</p><p>We analyze the squared distance to the optimal point w * as follows,</p><p>Firstly, we start with upper-bounding T erm -A.</p><p>T erm</p><p>Now we aim to upper bound T erm -B.</p><p>-2&#951; w t j,&#964; -w * , &#957; t j,&#964; )</p><p>T erm-B 2</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#63737; &#63738; &#63739;</head><p>Now we define &#963; t &#946; = &#963; &#946; (g(w t ) -&#1013;) before we start upper bounding T erm -B 1 for any &#945; &gt; 0. T erm -B 1 = -2&#951; w t -w t j,&#964; , &#957; t j,&#964; = -2&#951;&#963; t &#946; w t -w t j,&#964; , g j (w t j,&#964; ) -2&#951;(1 -&#963; t &#946; ) w t -w t j,&#964; , f j (w</p><p>t j,&#964; ) Y oung &#8242; s &#8804; &#963; t &#946; &#951; &#945; &#8741;w t -w t j,&#964; &#8741; 2 + &#951;&#945;&#8741;&#8711;g j (w t j,&#964; )&#8741; 2 + (1 -&#963; t &#946; ) &#951; &#945; &#8741;w t -w t j,&#964; &#8741; 2 + &#951;&#945;&#8741;&#8711;f j (w t j,&#964; )&#8741; 2 G-Lip &#8804; &#951; &#945; &#8741;w t -w t j,&#964; &#8741; 2 + &#951;&#945;G 2</p><p>Similarly, we start to upper bound T erm -B 2 as well.</p><p>T erm -B 2 = -2&#951; w t j,&#964; -w * , &#957; t j,&#964; ) = -2&#951;&#963; t &#946; w t j,&#964; -w * , g j (w t j,&#964; ) -2&#951;(1 -&#963; t &#946; ) w t j,&#964; -w * , f j (w t j,&#964; )</p><p>Cvx.</p><p>&#8804; 2&#951;&#963; t &#946; g j (w * ) -g j (w t j,&#964; ) + 2&#951;(1 -&#963; t &#946; ) f j (w * ) -f j (w t j,&#964; ) &#8804;2&#951;&#963; t &#946; g j (w * ) -g j (w t ) +</p><p>1 2&#945; &#8741;w t -w t j,&#964; &#8741; 2 + &#945; 2 G 2 (by Cvx., Young's, &amp; G-Lip) + 2&#951;(1 -&#963; t &#946; ) f j (w * ) -f j (w t ) + 1 2&#945; &#8741;w t -w t j,&#964; &#8741; 2 + &#945; 2 G 2 Putting T erm -B 1 and T erm -B 2 back in T erm -B, we get T erm -B &#8804; 1 n n j=1 E-1 &#964; =0 2&#951; &#945; &#8741;w t -w t j,&#964; &#8741; 2 + 2&#951;&#945;G 2 + 2&#951;&#963; t &#946; (g j (w * ) -g j (w t )) +2&#951;(1 -&#963; t &#946; ) (f j (w * ) -f j (w t )) Lemma 5 &#8804; 2 3&#945; &#951; 3 E 3 G 2 + 2&#951;&#945;EG 2 + 2&#951;E&#963; t &#946; (g(w * ) -g(w t )) + 2&#951;E(1 -&#963; t &#946; ) (f (w * ) -f (w t )) Substituting T erm -A and T erm -B back in eq. (17), we get for &#945; = &#951; &#8741;w t+1 -w * &#8741; 2 &#8804; &#8741;w t -w * &#8741; 2 + &#951; 2 E 2 G 2 + 2&#951; 2 EG 2 + 2 3 &#951; 2 E 3 G 2 + 2&#951;E&#963; t &#946; (g(w * ) -g(w t )) + 2&#951;E(1 -&#963; t &#946; ) (f (w * ) -f (w t )) Let A = {t &#8712; [T ]|g(w t ) &lt; &#1013;} and B = [T ]\A = {t &#8712; [T ]|g(w t ) &#8805; &#1013;}. Note that for all t &#8712; B it holds that &#963; &#946; (g(w t ) -&#1013;) = 1 and g(w t ) -g(w * ) &#8805; &#1013;. Further, for all t &#8712; A if &#963; &#946; (g(w t ) -&#1013;) &#8805; 0 it holds that g(w t ) -g(w * ) &#8805; g(w t ) &#8805; &#1013; -1/&#946;. With these observations, using convexity of f and g and decomposing the sum over t according to the definitions of A and B and division by T yields, D 2 2&#951;E + 1 2 &#951;EG 2 T + &#951;G 2 T + 1 3 &#951;E 2 G 2 T &#8805; t&#8712;A &#963; t &#946; (g(w t ) -g(w * )) + t&#8712;A (1 -&#963; t &#946; ) (f (w t ) -f (w * )) + t&#8712;B (g(w t ) -g(w * )) . Now choosing &#951; = D 2 2G 2 ET &#915; , where &#915; = 1 2 E + 1 + 1 3 E 2 , we get 2D 2 G 2 T &#915; E &#8805; t&#8712;A &#963; t &#946; (g(w t ) -g(w * )) + t&#8712;A (1 -&#963; t &#946; ) (f (w t ) -f (w * )) + t&#8712;B (g(w t ) -g(w * )) &#8805; t&#8712;A (1 -&#963; t &#946; ) (f (w t ) -f (w * )) + &#1013;|B| + &#1013; -1 &#946; t&#8712;A &#963; t &#946; .</p><p>(18) Similar to the previous proofs, we first need to find the smallest &#1013; to ensure A is non-empty. So, to find a lower bound on &#1013;, assume A is empty in eq. ( <ref type="formula">18</ref>) and observe that as long as condition</p><p>&lt; &#1013; is met, A is non-empty. We choose to set &#1013; = 2 2D 2 G 2 &#915; ET . Now, like before, we consider two cases based on the sign of t&#8712;A 1 -&#963; t &#946; f (w t ) -f (w * ) . As before, when the sum is non-positive we are done by the definition of A, which implies 0 &lt; 1 -&#963; &#946; (g(w t ) -&#1013;) &#8804; 1 and the convexity of f and g.</p><p>Assuming the sum is positive, dividing eq. ( <ref type="formula">18</ref>) by t&#8712;A 1 -&#963; t &#946; (which by the definition of A is strictly positive), using convexity of f , and the definition of w, we have f ( w) -f (w * ) &#8804; 0.5&#1013;T -&#1013;|B| -(&#1013; -</p><p>1 &#946; ) t&#8712;A &#963; t &#946; |A| -t&#8712;A &#963; &#946; (g(w t ) -&#1013;) = &#1013; + -0.5&#1013;T + &#946; -1 t&#8712;A &#963; t &#946; |A| -t&#8712;A &#963; t &#946; ,</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>&#169; A. Upadhyay, S.B. Moon &amp; A. Hashemi.</p></note>
		</body>
		</text>
</TEI>
