<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Federated Natural Policy Gradient and Actor Critic Methods for Multi-task Reinforcement Learning</title></titleStmt>
			<publicationStmt>
				<publisher>38th Conference on Neural Information Processing Systems</publisher>
				<date>12/10/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10593003</idno>
					<idno type="doi"></idno>
					
					<author>Tong Yang</author><author>Shicong Cen</author><author>Yuting Wei</author><author>Yuxin Chen</author><author>Yuejie Chi</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Federated reinforcement learning (RL) enables collaborative decision making of multiple distributed agents without sharing local data trajectories. In this work, we consider a multi-task setting, in which each agent has its own private reward function corresponding to different tasks, while sharing the same transition kernel of the environment. Focusing on infinite-horizon Markov decision processes, the goal is to learn a globally optimal policy that maximizes the sum of the discounted total rewards of all the agents in a decentralized manner, where each agent only communicates with its neighbors over some prescribed graph topology. We develop federated vanilla and entropy-regularized natural policy gradient (NPG) methods in the tabular setting under softmax parameterization, where gradient tracking is applied to estimate the global Q-function to mitigate the impact of imperfect information sharing. We establish non-asymptotic global convergence guarantees under exact policy evaluation, where the rates are nearly independent of the size of the state-action space and illuminate the impacts of network size and connectivity, and further establish its robustness against inexact policy evaluation. We further propose a federated natural actor critic (NAC) method for multi-task RL with function approximation and stochastic policy evaluation, and establish its finitetime sample complexity taking the errors of function approximation into account. To the best of our knowledge, this is the first time that near dimension-free global convergence is established for federated multi-task RL using policy optimization.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Federated reinforcement learning (FRL) is an emerging paradigm that combines the advantages of federated learning (FL) and reinforcement learning (RL) [QZLZ21, ZFL + 19], allowing multiple agents to learn a shared policy from local experiences, without exposing their private data to a central server nor other agents. FRL is poised to enable collaborative and efficient decision making in scenarios where data is distributed, heterogeneous, and sensitive, which arise frequently in applications such as edge computing, smart cities, and healthcare [WHM + 23, WKNL20, ZFL + 19], to name just a few. As has been observed [LZZ + 17], decentralized training can lead to performance improvements in FL by avoiding communication congestions at busy nodes such as the server, especially under high-latency scenarios. This motivates us to design algorithms for the fully decentralized setting, a scenario where the agents can only communicate with their local neighbors over a prescribed network topology. <ref type="foot">6</ref>In this work, we study the problem of federated multi-task RL [AR21, QZLZ21, YLS + 20], where each agent collects its own reward -possibly unknown to other agents -corresponding to the local task at hand, while having access to the same dynamics (i.e., transition kernel) of the environment. The collective goal is to learn a shared policy that maximizes the total rewards accumulated from all the agents; in other words, one seeks a policy that performs well in terms of overall benefits, rather than biasing towards any individual task, achieving the Pareto frontier in a multi-objective context. There is no shortage of application scenarios where federated multi-task RL becomes highly relevant. For instance, in healthcare [ZBW + 20], different hospitals may be interested in finding an optimal treatment for all patients without disclosing private data, where the effectiveness of the treatment can vary across different hospitals due to demographical differences. See Appendix B.1 for more application scenarios of our setting.</p><p>Nonetheless, despite the promise, provably efficient algorithms for federated multi-task RL remain substantially under-explored, especially in the fully decentralized setting. The heterogeneity of local tasks leads to a higher degree of disagreements between the global value function and local value functions of individual agents. Due to the lack of global information sharing, care needs to be taken to judiciously balance the use of neighboring information (to facilitate consensus) and local data (to facilitate learning) when updating the policy. To the best of our knowledge, very few algorithms are currently available to find the global optimal policy with non-asymptotic convergence guarantees even for tabular infinite-horizon Markov decision processes.</p><p>Motivated by the connection with decentralized optimization, it is tempting to take a policy optimization perspective to tackle this challenge. Policy gradient (PG) methods, which seek to learn the policy of interest via first-order optimization methods, play an eminent role in RL due to their simplicity and scalability. In particular, natural policy gradient (NPG) methods <ref type="bibr">[Ama98,</ref><ref type="bibr">Kak01]</ref> are among the most popular variants of PG methods, underpinning default methods used in practice such as trust region policy optimization (TRPO) [SLA + 15] and proximal policy optimization (PPO) [SWD + 17]. On the theoretical side, it has also been established recently that the NPG method enjoys fast global convergence to the optimal policy in an almost dimension-free manner <ref type="bibr">[AKLM21,</ref><ref type="bibr">CWC21]</ref>, where the iteration complexity is nearly independent of the size of the state-action space. These benefits can be translated to their sample-based counterparts such as the natural actor critic (NAC) method <ref type="bibr">[BSGL09,</ref><ref type="bibr">XWL20,</ref><ref type="bibr">KDRM22]</ref>, where the policies are evaluated via stochastic samples. It is natural to ask:</p><p>Can we develop federated NPG and NAC methods with non-asymptotic global convergence guarantees for multi-task RL in the fully decentralized setting?</p><p>1.1 Our contributions Focusing on infinite-horizon Markov decision processes (MDPs), we provide an affirmative answer to the above question, by developing federated NPG (FedNPG) methods for solving both the vanilla and entropy-regularized multi-task RL problems with finite-time global convergence guarantees. While entropy regularization is often incorporated as an effective strategy to encourage exploration during policy learning, solving the entropy-regularized RL problem is of interest in its own right, as the optimal regularized policy possesses desirable robust properties with respect to reward perturbations <ref type="bibr">[EL21,</ref><ref type="bibr">MP95]</ref>. Due to the multiplicative update nature of NPG methods under softmax parameterization, it is more convenient to work with the logarithms of local policies in the decentralized setting.</p><p>In each iteration of the proposed FedNPG method, the logarithms of local policies are updated by a weighted linear combination of two terms (up to normalization): a gossip mixing <ref type="bibr">[NO09]</ref> of the logarithms of neighboring local policies, and a local estimate of the global Q-function tracked via the technique of dynamic average consensus <ref type="bibr">[ZM10]</ref>, a prevalent idea in decentralized optimization that allows for the use of large constant learning rates [DLS16, NOS17, QL17] to accelerate convergence. We further develop sample-efficient federated NAC (FedNAC) methods that allow for both stochastic policy evaluation and function approximation. Our contributions are as follows.</p><p>&#8226; We propose FedNPG methods for both the vanilla and entropy-regularized multi-task RL problems, where each agent only communicates with its neighbors and performs local computation using its own reward or task information.</p><p>setting algorithms iteration complexity optimality criteria unregularized NPG <ref type="bibr">[AKLM21]</ref> O</p><p>Table <ref type="table">1</ref>: Iteration complexities of NPG and FedNPG (ours) methods to reach &#949;-accuracy of the vanilla and entropy-regularized problems, where we assume exact gradient evaluation, and only keep the dominant terms w.r.t. &#949;. The policy estimates in the t-iteration are &#960; (t) and &#960;(t) for NPG and FedNPG, respectively, where T is the number of iterations. Here, N is the number of agents, &#964; &#8804; 1 is the regularization parameter, &#963; &#8712; [0, 1] is the spectral radius of the network, &#947; &#8712; [0, 1) is the discount factor, |A| is the size of the action space, and &#951; &gt; 0 is the learning rate. The iteration complexities of FedNPG reduce to their centralized counterparts when &#963; = 0. For vanilla FedNPG, the learning rate</p><p>; for entropy-regularized FedNPG, the learning rate satisfies 0 &lt; &#951; &lt; &#951; 0 = O (1-&#947;) 7 (1-&#963;) 2 &#964; &#963;N .</p><p>&#8226; Assuming access to exact policy evaluation, we establish that the average iterate of vanilla FedNPG converges globally at a rate of O(1/T 2/3 ) in terms of the sub-optimality gap for the multi-task RL problem, and that the last iterate of entropy-regularized FedNPG converges globally at a linear rate to the regularized optimal policy. Our convergence theory highlights the impacts of all salient problem parameters (see Table <ref type="table">1</ref> for details), such as the size and connectivity of the communication network. In particular, the iteration complexities of FedNPG are again almost independent of the size of the state-action space, which recover prior results on the centralized NPG methods when the network is fully connected. &#8226; We further demonstrate the stability of the proposed FedNPG methods when policy evaluations are only available in an inexact manner. To be specific, we prove that their convergence rates remain unchanged as long as the approximation errors are sufficiently small in the &#8467; &#8734; sense. &#8226; We go beyond the tabular setting and black-box policy evaluation by proposing FedNAC-a federated actor critic method for multi-task RL with function approximation and stochastic policy evaluation -and establish a finite-sample sample complexity on the order of O(1/&#949; 7/2 ) for each agent in terms of the expected sub-optimality gap for the fully decentralized setting. To the best of our knowledge, the proposed federated NPG and NAC methods are the first policy optimization methods for multi-task RL that achieve near dimension-free global convergence guarantees in terms of iteration and sample complexities, allowing for fully decentralized communication without any need to share local reward/task information. We conduct numerical experiments in a multi-task GridWorld environment to corroborate the efficacy of the proposed methods (see Appendix H). We defer the readers to Appendix A for more related work, and Appendix B.2 for additional discussions on our theoretical contributions.</p><p>Notation. Boldface small and capital letters denote vectors and matrices, respectively. Sets are denoted with curly capital letters, e.g., S, A. We let (R d , &#8741;&#8226;&#8741;) denote the d-dimensional real coordinate space equipped with norm &#8741;&#8226;&#8741;. The &#8467; p -norm of v is denoted by &#8741;v&#8741; p , where 1 &#8804; p &#8804; &#8734;, and the spectral norm and the Frobenius norm of a matrix M are denoted by &#8741;M &#8741; 2 and &#8741;M &#8741; F , resp. We let [N ] denote {1, . . . , N }, use 1 N to represent the all-one vector of length N , and denote by 0 a vector or a matrix consisting of all 0's. We allow the application of functions such as log(&#8226;) and exp(&#8226;) to vectors or matrices, with the understanding that they are applied in an element-wise manner.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Model and backgrounds</head><p>Markov decision processes. We consider an infinite-horizon discounted Markov decision process (MDP) denoted by M = (S, A, P, r, &#947;), where S and A denote the state space and the action space, respectively, &#947; &#8712; [0, 1) indicates the discount factor, P : S &#215; A &#8594; &#8710;(S) is the transition kernel, and r : S &#215; A &#8594; [0, 1] stands for the reward function. To be more specific, for each state-action pair (s, a) &#8712; S &#215; A and any state s &#8242; &#8712; S, we denote by P (s &#8242; |s, a) the transition probability from state s to state s &#8242; when action a is taken, and r(s, a) the instantaneous reward received in state s when action a is taken. Furthermore, a policy &#960; : S &#8594; &#8710;(A) specifies an action selection rule, where &#960;(a|s) specifies the probability of taking action a in state s for each (s, a) &#8712; S &#215; A.</p><p>For any given policy &#960;, we denote by V &#960; : S &#8594; R the corresponding value function, which is the expected discounted cumulative reward with an initial state s 0 = s, given by &#8704;s &#8712; S : V &#960; (s) := E &#8734; t=0 &#947; t r(s t , a t )|s 0 = s ,</p><p>where the randomness is over the trajectory generated following the policy a t &#8764; &#960;(&#8226;|s t ) and the MDP dynamic s t+1 &#8764; P (&#8226;|s t , a t ). We also overload the notation V &#960; (&#961;) to indicate the expected value function of policy &#960; when the initial state follows a distribution &#961; over S, namely, V &#960; (&#961;) := E s&#8764;&#961; [V &#960; (s)]. Similarly, the Q-function Q &#960; : S &#215; A &#8594; R of policy &#960; is defined by</p><p>for all (s, a) &#8712; S &#215;A, which measures the expected discounted cumulative reward with an initial state s 0 = s and an initial action a 0 = a, with expectation taken over the randomness of the trajectory. The optimal policy &#960; &#8902; refers to the policy that maximizes the value function V &#960; (s) for all states s &#8712; S, which is guaranteed to exist <ref type="bibr">[Put14]</ref>. The corresponding optimal value function and Q-function are denoted as V &#8902; and Q &#8902; , respectively.</p><p>Entropy-regularized RL. Entropy regularization [WP91, ALRNS19] is a popular technique in practice that encourages stochasticity of the policy to promote exploration, as well as robustness against reward uncertainties. Mathematically, this can be viewed as adjusting the instantaneous reward based the current policy in use as</p><p>where &#964; &#8805; 0 denotes the regularization parameter. Typically, &#964; should not be too large to outweigh the actual rewards; for ease of presentation, we assume &#964; &#8804; min 1, 1 log |A| <ref type="bibr">[CCDX22]</ref>. Equivalently, this amounts to the entropy-regularized (also known as "soft") value function, defined as</p><p>where</p><p>Analogously, for all (s, a) &#8712; S &#215; A, the regularized (or soft) Q-function Q &#960; &#964; of policy &#960; is related to the soft value function V &#960; &#964; (s) as</p><p>The optimal regularized policy, the optimal regularized value function, and the Q-function are denoted by &#960; &#8902; &#964; , V &#8902; &#964; , and Q &#8902; &#964; , respectively. Natural policy gradient methods. Natural policy gradient (NPG) methods lie at the heart of policy optimization, serving as the backbone of popular heuristics such as TRPO [SLA + 15] and PPO [SWD + 17]. Instead of directly optimizing the policy over the probability simplex, one often adopts the softmax parameterization, which parameterizes the policy as &#960; &#952; := softmax(&#952;) or &#960; &#952; (a|s) := exp &#952;(s, a)</p><p>for any &#952;: S &#215; A &#8594; R and (s, a) &#8712; S &#215; A.</p><p>In the tabular setting, the update rule of vanilla NPG at the t-th iteration can be concisely represented as</p><p>Turning to the regularized problem, we note that the update rule of entropy-regularized NPG becomes</p><p>where &#951; &#8712; (0, 1-&#947; &#964; ] is the learning rate, and</p><p>&#964; is the soft Q-function of policy &#960; (t) .</p><p>3 Federated NPG methods for multi-task RL</p><p>In this paper, we consider the federated multi-task RL setting, where a set of agents learn collaboratively a single policy that maximizes its average performance over all the tasks using only local computation and communication.</p><p>Multi-task RL. Each agent n &#8712; [N ] has its own private reward function r n (s, a) -corresponding to different tasks -while sharing the same transition kernel of the environment. The goal is to collectively learn a single policy &#960; that maximizes the global value function given by V &#960; (s) =</p><p>Clearly, the global value function corresponds to using the average reward of all agents r(s, a) =</p><p>In parallel, we are interested in the entropy-regularized setting, where each agent n &#8712; [N ] is equipped with a regularized reward function given by r &#964;,n (s, a) := r n (s, a)&#964; log &#960;(a|s). And we define similarly the regularized value functions as</p><p>and the global soft Q-function is given by</p><p>s, a). Federated policy optimization in the fully decentralized setting. We consider a federated setting with fully decentralized communication, that is, all the agents are synchronized to perform information exchange over some prescribed network topology denoted by an undirected weighted graph G([N ], E).</p><p>Here, E stands for the edge set of the graph with N nodes -each corresponding to an agentand two agents can communicate with each other if and only if there is an edge connecting them. The information sharing over the graph is best described by a mixing matrix <ref type="bibr">[NO09]</ref>, denoted by</p><p>, where w ij is a positive number if (i, j) &#8712; E and 0 otherwise. We also make the following standard assumptions on the mixing matrix. Assumption 3.1 (double stochasticity). The mixing matrix W</p><p>The following standard metric measures how fast information propagates over the graph. Definition 3.2 (spectral radius). The spectral radius of W is given as</p><p>The spectral radius &#963; determines how fast information propagate over the network. For instance, in a fully-connected network, we can achieve &#963; = 0 by setting</p><p>For control of 1/(1&#963;) regarding different graphs, we refer the readers to <ref type="bibr">[NOR18]</ref>. In an Erd&#246;s-R&#233;nyi random graph, as long as the graph is connected, one has with high probability &#963; &#8781; 1. Another immediate consequence is that for any x &#8712; R N , letting x = 1 N 1 &#8868; N x be its average, we have</p><p>where the consensus error contracts by a factor of &#963;.</p><p>Algorithm 1 Federated NPG (FedNPG)</p><p>1: Input: learning rate &#951; &gt; 0, iteration number T &#8712; N + , mixing matrix W &#8712; R N &#215;N . 2: Initialize:</p><p>Update the policy for each (s, a) &#8712; S &#215; A:</p><p>where</p><p>Evaluate Q (t+1) .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>6:</head><p>Update the global Q-function estimate for each (s, a) &#8712; S &#215; A:</p><p>7: end for 3.1 Proposed federated NPG algorithms Assuming softmax parameterization, the problem can be formulated as decentralized optimization,</p><p>where &#960; &#952; := softmax(&#952;) subject to communication constraints. Motivated by the success of NPG methods, we aim to develop federated NPG methods to achieve our goal. For notational convenience, let &#960; (t) := &#960;</p><p>&#8868; be the collection of policy estimates at all agents in the t-th iteration.</p><p>Let</p><p>which satisfies that &#960; (t) (a|s)</p><p>could be seen as the normalized geometric mean of {&#960;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8868;</head><p>. We shall often abuse the notation and treat &#960; (t) , Q</p><p>&#964; as matrices in R N &#215;|S||A| , and treat &#960; (t) (a|s), Q</p><p>&#964; (a|s) as vectors in R N , for all (s, a) &#8712; S &#215; A.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Vanilla federated NPG methods.</head><p>To motivate the algorithm development, observe that the NPG method (cf. (8)) applied to (12) adopts the update rule &#960; (t+1) (a|s) &#8733;</p><p>for all (s, a) &#8712; S &#215; A. Two challenges arise when executing this update rule: the policy estimates are maintained locally without consensus, and the global Q-function are unavailable in the decentralized setting. To address these challenges, we apply the idea of dynamic average consensus <ref type="bibr">[ZM10]</ref>, where each agent maintains its own estimate T (t) n (s, a) of the global Q-function, which are collected as vector</p><p>At each iteration, each agent updates its policy estimates based on its neighbors' information via gossip mixing, in addition to a correction term that tracks the difference Q</p><p>n n (s, a) of the local Q-functions between consecutive policy updates. Note that the mixing is applied linearly to the logarithms of local policies, which translates into a multiplicative mixing of the local policies. Algorithm 1 summarizes the detailed procedure of the proposed algorithm written in a compact matrix form, which we dub as federated NPG (FedNPG). Note that the agents do not need to share their reward functions with others, and agent n &#8712; [N ] will only be responsible to evaluate the local policy &#960; (t) n using the local reward r n .</p><p>Entropy-regularized federated NPG methods. Moving onto the entropy regularized case, we adopt similar algorithmic ideas to decentralize (9), and propose the federated NPG (FedNPG) method with entropy regularization, summarized in Algorithm 2 (see Appendix C.1). Clearly, the entropyregularized FedNPG method reduces to vanilla FedNPG in the absence of the regularization (i.e., when &#964; = 0).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Theoretical guarantees</head><p>Global convergence of FedNPG with exact policy evaluation. We begin with the global convergence of FedNPG (cf. Algorithm 1), stated in the following theorem. The formal statement and proof can be found in Appendix D.3, and see Appendix B.2 for discussions on the technical challenges.</p><p>Theorem 3.3 characterizes the average-iterate convergence of the average policy &#960; (t) (cf. ( <ref type="formula">14</ref>)) across the agents, which depends logarithmically on the size of the action space, and independently on the size of the state space. Theorem 3.3 indicates that in the server-client setting with &#963; = 0, the convergence rate of FedNPG recovers the O(1/T ) rate, matching that of the centralized NPG established in <ref type="bibr">[AKLM21]</ref>; on the other end, in the decentralized setting where &#963; &gt; 0, FedNPG slows down and eventually converges at the slower O(1/T 2/3 ) rate.</p><p>We state the iteration complexity in Corollary 3.4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Corollary 3.4 (Iteration complexity of exact FedNPG).</head><p>To reach</p><p>Global convergence of FedNPG with inexact policy evaluation. In practice, the policies need to be evaluated using samples collected by the agents, where the Q-functions are only estimated approximately. We are interested in gauging how the approximation error impacts the performance of FedNPG, as demonstrated in the following theorem. The formal statement, detailed discussions, and proof of this result is given in Appendix D.4. Theorem 3.5 (Global sublinear convergence of inexact FedNPG (informal)). Suppose that an estimate q</p><p>in Algorithm 1. Under the assumptions of Theorem 3.3, when</p><p>, we have</p><p>Equipped with existing sample complexity bounds on policy evaluation, e.g. using a simulator as in <ref type="bibr">[LWCC23a]</ref>, this immediate leads to the sample complexity per state-action pair at each agent to find an &#949;-optimal policy is at most</p><p>for sufficiently small &#949;.</p><p>Global convergence of entropy-regularized FedNPG with exact policy evaluation. Next, we present our global convergence guarantee of entropy-regularized FedNPG with exact policy evaluation (cf. Algorithm 2). Theorem 3.6 (Global linear convergence of exact entropy-regularized FedNPG (informal)). For any &#947; &#8712; (0, 1) and 0 &lt; &#964; &#8804; 1, there exists</p><p>, such that if 0 &lt; &#951; &#8804; &#951; 0 , then we have</p><p>where and <ref type="figure">C</ref> 1 is some problem-dependent constant. Furthermore, the consensus error satisfies</p><p>The exact expressions of C 1 and &#951; 0 are specified in Appendix D.1. Theorem 3.6 confirms that entropy-regularized FedNPG converges at a linear rate to the optimal regularized policy, which is almost independent of the size of the state-action space, highlighting the positive role of entropy regularization in federated policy optimization. When the network is fully connected, i.e. &#963; = 0, the iteration complexity of entropy-regularized FedNPG reduces to O 1 &#951;&#964; log 1 &#949; , matching that of the centralized entropy-regularized NPG established in <ref type="bibr">[CWC21]</ref>. When the network is less connected, one needs to be more conservative in the choice of learning rates, leading to a higher iteration complexity, as described in the following corollary. Corollary 3.7 (Iteration complexity of exact entropy-regularized FedNPG). To reach log &#960; &#8902; &#964;log &#960; (t) &#8734; &#8804; &#949;, the iteration complexity of entropy-regularized FedNPG is at most</p><p>up to logarithmic factors. Especially, when &#951; = &#951; 0 , the best iteration complexity becomes</p><p>Global convergence of entropy-regularized FedNPG with inexact policy evaluation. Last but not the least, we present the informal convergence results of entropy-regularized FedNPG with inexact policy evaluation, whose formal version can be found in Appendix D.2. Theorem 3.8 (Global linear convergence of inexact entropy-regularized FedNPG (informal)). Suppose that an estimate q</p><p>n &#964;,n in Algorithm 2. Under the assumptions of Theorem 3.6, we have</p><p>where</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Federated NAC with function approximation and stochastic evaluation</head><p>In this section, motivated by the design and analysis of FedNPG, we go beyond the tabular setting and exact policy evaluation, by proposing a federated natural actor-critic (FedNAC) method with function approximation and stochastic policy evaluation. Specifically, we consider the policy with function approximation under softmax parameterization is of the following form:</p><p>for all (s, a) &#8712; S &#215; A and &#958; &#8712; R p , where &#981; : S &#215; A &#8594; R p is a known feature map. We assume &#981; is bounded over S &#215; A, i.e., there exists C &#981; &gt; 0 such that &#8741;&#981;(s, a)&#8741; 2 &#8804; C &#981; holds for all (s, a) &#8712; S &#215; A.</p><p>Following [AKLM21, YDG + 22], given any w &#8712; R p , Q : S &#215; A &#8594; R and probability distribution &#950; &#8712; &#8710;(S &#215; A) over the state-action space, we define the function approximation error &#8467;(w, Q, &#950;) as follows:</p><p>By searching for w that minimizes &#8467;(w, Q, &#950;), it approximates Q(s, a) using the feature map &#981; with respect to the distribution &#950;.</p><p>Algorithm design. Let us now discuss the high-level design of FedNAC, which is presented in Algorithm 3, with more details provided in Appendix C.2. At the t-th iteration (t = 0, . . . , T -1), denote the actor (concerning the policies) parameters of all agents as &#958; (t) = (&#958;</p><p>N ) &#8868; &#8712; R N &#215;p , and the critic parameters of all agents as w (t) = (w</p><p>&#8226; First, the critic parameter w (t) n is locally updated at each agent by aiming to minimize &#8467;(w, Q</p><p>)) with gradient descent, where Q (t) n is the local Q-function of the local policy f &#958; (t) n , and d(t) n is the state-action visitation distribution induced by the local policy f &#958; (t) n and an initial state-action distribution &#957; (determined from the data sampling mechanism, cf. (30)). However, since</p><p>n is not directly available, it needs to be estimated from samples. Therefore, the critic update takes K steps of stochastic gradient descent with critic learning rate &#946;, given by</p><p>, and Q &#958; (s k , a k ) is a careful estimate of the Q-value using a trajectory with expected length 1/(1&#947;) (see Algorithm 5 in Appendix C.2 adopted from [YDG + 22, Lemma 4]), and w 0 = 0 for simplicity. The final critic is updated as w</p><p>The total sample complexity of the critic update per iteration is then on the order of K/(1&#947;).</p><p>&#8226; Next, the critic parameter h (t) n for estimating the global Q-function can then be estimated by averaging with the neighbors with the Q-tracking term, given by h (t) = W h (t-1) + w (t)w (t-1) .</p><p>&#8226; Finally, the actor parameter &#958; (t) n can be updated via averaging with the neighbors along with the policy gradient informed by h (t) n , given by &#958; (t+1) = W &#958; (t) + &#945;h (t) , where &#945; is the learning rate of the actor.</p><p>Note that the sample complexity of FedNAC is on the order of KT /(1&#947;). An important aspect of the FedNAC method is that the policy is updated using trajectory data collected via executing the learned policy, which is closer to practice and more challenging to learn than using the generative model.</p><p>Theoretical guarantees. We first state the assumptions that are needed to guarantee the convergence of Algorithm 3, which are all commonly used in the literature, e.g., [YDG + 22, AKLM21]. To begin, we require the covariance matrix of the feature map induced by the initial state-action distribution &#957; satisfies the following assumption to guarantee the convergence of the critic. Assumption 4.1 (PSD of the covariance matrix of the feature map). There exists &#181; &gt; 0 such that E (s,a)&#8764;&#957; &#981;(s, a)&#981; &#8868; (s, a) = &#931; &#957; &#8805; &#181;I.</p><p>We also need to ensure that the Q-values can be well approximated by the linear function approximation using feature map &#981;(s, a), which is captured next. Assumption 4.2 (Bounded approximation error). For each n &#8712; [N ], there exists &#949; n approx &#8805; 0 such that for all t &#8712; N, it holds that E &#8467; w</p><p>) n &#8804; &#949; n approx , where w (t) &#8902;,n := arg min w &#8467; w (t) &#8902;,n , Q (t) n , d(t) n . We denote the average approximation error as &#949;approx = 1 N N n=1 &#949; n approx . Similar as [YDG + 22], we need the following assumption that bounds the transfer errors due to distribution shifts. Assumption 4.3 (Bounded transfer error). There exists C &#957; &gt; 0 such that for all n &#8712; [N ] and t &#8712; N, it holds that E (s,a)&#8764; d(t) n h &#960; (s,a) d(t) n (s,a) 2 &#8804; C &#957; , where h &#960; (s, a) is the state-action visitation distribution induced by any policy &#960; from initial state distribution &#961;.</p><p>Note that if we choose &#957;(s, a) &gt; 0 for all (s, a) &#8712; S &#215; A, then Assumption 4.3 is guaranteed to hold true (see Lemma E.4 in Appendix E). We are now ready to state the convergence guarantee, whose formal version and proof could be found in Appendix E.</p><p>Theorem 4.4 (Convergence rate of Algorithm 3 (informal)). Let &#958;</p><p>n , and f (t) := f&#958;(t) as the average policy. Then under Assumption 3.1, 4.1, 4.2 and 4.3, with appropriately chosen learning rates &#945; and &#946;, as long as the number of actor iterations satisfies</p><p>and the number of critic iterations satisfies</p><p>In the server-client setting when &#963; = 0, to reach (26), it suffices to choose</p><p>The sample complexity matches that of (centralized) Q-NPG established in [YDG + 22] with a single agent. On the other end, in the fully decentralized setting when &#963; is not close to 0, FedNAC requires O 1 (1-&#947;) 45/4 &#949; 7/2 (1-&#963;) 3/2 samples for each agent and O 1 &#949; 3/2 (1-&#947;) 17/4 (1-&#963;) 3/2 rounds of communication to reach (26), for sufficiently small &#949;. Encouragingly, the dependency on the accuracy level &#949; -the dominating factor -in the sample complexity matches that of FedNPG given in (19) when assuming access to the generative model, which allows query of arbitrary state-action pairs. In contrast, FedNAC only collects on-policy samples, and therefore is much more challenging to guarantee its convergence.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Conclusions</head><p>This work proposes the first provably efficient federated NPG (FedNPG) methods for solving vanilla and entropy-regularized multi-task RL problems in the fully decentralized setting. The established finite-time global convergence guarantees are almost independent of the size of the state-action space up to some logarithmic factor, and illuminate the impacts of the size and connectivity of the network. Furthermore, the proposed FedNPG methods are provably robust vis-a-vis inexactness of local policy evaluations. Last but not least, we also propose FedNAC, which can be viewed as an extension of FedNPG with function approximation and stochastic policy evaluation, and establish its finite-time sample complexity. Future directions include generalizing the framework of federated policy optimization to allow personalized policy learning in a shared environment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Related work</head><p>Global convergence of NPG methods for tabular MDPs. <ref type="bibr">[AKLM21]</ref> first establishes a O(1/T ) last-iterate convergence rate of the NPG method under softmax parameterization with constant step size, assuming access to exact policy evaluation. When entropy regularization is in place, <ref type="bibr">[CWC21]</ref> establishes a global linear convergence to the optimal regularized policy for the entire range of admissible constant learning rates using softmax parameterization and exact policy evaluation, which is further shown to be stable in the presence of &#8467; &#8734; policy evaluation errors. The iteration complexity of NPG methods is nearly independent with the size of the state-action space, which is in sharp contrast to softmax policy gradient methods that may take exponential time to converge <ref type="bibr">[LWCC23b,</ref><ref type="bibr">MXSS20]</ref>. <ref type="bibr">[Lan23]</ref> proposed a more general framework through the lens of mirror descent for regularized RL with global linear convergence guarantees, which is further generalized in [ZCH + 23, LLZ23]. Earlier analysis of regularized MDPs can be found in <ref type="bibr">[SEM20]</ref>. Besides, <ref type="bibr">[Xia22]</ref> proves that vanilla NPG also achieves linear convergence when geometrically increasing learning rates are used; see also <ref type="bibr">[KJVM21,</ref><ref type="bibr">BR21]</ref>. [ZLK + 22] developed an anchor-changing NPG method for multi-task RL under various optimality criteria in the centralized setting.</p><p>Convergence and sample complexity results of NAC. The convergence and sample complexity of a variety of natural actor-critic methods (NACs) are extensively studied in the literature [BSGL09, WCYW19, KDRM22, AKLM21, YDG + 22]. More pertinent to our work, <ref type="bibr">[AKLM21]</ref> introduced Q-NPG-a sample version of the NPG method with function approximation under softmax parameterization -and obtained a convergence rate of O(1/ &#8730; T ). [YDG + 22] weakens some of its assumptions and improves the convergence rate to O(1/T ) and gives the O(1/&#949; 3 ) sample complexity using a constant actor learning rate. The FedNAC method we propose in this paper can be seen as a decentralized version of Q-NPG, and in the server-client setting where the network is fully connected, our convergence rate and sample complexity match those in [YDG + 22].</p><p>Distributed and federated RL. There have been a variety of settings being set forth for distributed and federated RL. [MBM + 16, ESM + 18, ARB + 19, KSJM22, WJC23] focused on developing federated versions of RL algorithms to accelerate training, assuming all agents share the same transition kernel and reward function; in particular, [KSJM22, WJC23, WSJC24] established the provable benefits of federated learning in terms of linear speedup. More pertinent to our work, [ZRY + 23, AR21] considered the federated multi-task framework, allowing different agents having private reward functions. [ZRY + 23] proposed an empirically probabilistic algorithm that can seek an optimal policy under the server-client setting, while <ref type="bibr">[AR21]</ref> developed new attack methods in the presence of adversarial agents. Recently [LWA + 23] discussed how to avoid transmitting the Hessian matrix during communication in the server-client setting where all agents share the same reward function. Different from the FRL framework, [CZGB21, CZC21, OPA + 17, KMP12, CFGW22, ZAD + 21] considered the distributed multi-agent RL setting where the agents interact with a dynamic environment through a multi-agent Markov decision process, where each agent can have their own state or action spaces.</p><p>[ZAD + 21] developed a decentralized policy gradient method where different agents have different MDPs, where a special case of their setting recovers ours. However, the convergence rate developed in [ZAD + 21] has rather pessimistic dependencies with the size of the state-action space, together with other parameters, without leveraging natural policy gradients and gradient tracking techniques.</p><p>Decentralized first-order optimization algorithms. Early work of consensus-based first-order optimization algorithms for the fully decentralized setting include but are not limited to [LO08, NO09, DAW11]. Gradient tracking, which leverages the idea of dynamic average consensus <ref type="bibr">[ZM10]</ref> to track the gradient of the global objective function, is a popular method to improve the convergence speed [QL17, NOS17, DLS16, PN21, LCCC20].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Additional Discussion</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1 Application Related to Federated Multi-task RL</head><p>In this section, we elaborate more on our motivation and the application scenarios where federated multi-task RL becomes highly relevant.</p><p>We first provide some key motivations for our federated multi-task RL setting as follows.</p><p>&#8226; Efficient knowledge transfer: multi-task RL enables agents to transfer knowledge across related tasks, accelerating learning and improving performance by leveraging experiences gained from one task to another. For instance, in our healthcare example in Section 1, by learning across hospitals with varying demographics, the agent can identify treatment strategies that are effective across diverse patient populations without directly accessing sensitive patient information.</p><p>&#8226; Generalization and adaptability: agents trained with multi-task RL can generalize their learned policies, adapt to new tasks, and handle diverse environments more effectively, enhancing their robustness and adaptability. In the healthcare example, an optimal treatment over different hospitals better adapts to variations in patient characteristics.</p><p>&#8226; Resource optimization: training a single policy for multiple tasks optimizes resource usage compared to training separate policies for each task, making it more efficient in scenarios with limited data or computational resources. In the healthcare example, the collaborative approach enhances learning efficiency and scalability while preserving data privacy, particularly in settings where each hospital has limited access to patient information.</p><p>Below we provide more application scenarios of our setting.</p><p>1. To enhance ChatGPT's performance across different tasks or domains [MA22, RTR + 23], one might consult domain experts to chat and rate ChatGPT's outputs for solving different tasks, and train ChatGPT in a federated manner without exposing private data or feedback of each expert.</p><p>2. Our setting is especially suitable for the multi-task problems where each agent only have partial access of the "global" task. There are a lot of such problems.</p><p>&#8226; An example is the problem we consider in our experiments (see Appendix H), where we distributedly train the agents to learn a shared policy to follow a predetermined trajectory while each agent only has partial information of this trajectory. &#8226; The above problem could be seen as a simplified version of the Unmanned Aerial Vehicle (UAV) Patrol Mission, each unmanned aerial vehicle (UAV) patrols only in a specific area, and they need to collectively train a strategy utilizing information from the entire patrol range. &#8226; In the game setting, different agents aim to train a character to perform well in multiple tasks, and each agent trains on one task.</p><p>Despite the promise, provably efficient algorithms for federated multi-task RL remain substantially under-explored, especially in the fully decentralized setting. Our work is the first to provide efficient algorithms with global convergence guarantees for federated multi-task RL.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 Theoretical Contribution</head><p>In this section, we stress that while our work is built upon the algorithmic ideas in the distributed learning, reinforcement learning and optimization literature, it is not a strightforward combination and the theoretical analysis is by no means trivial.</p><p>One key difficulty is to estimate the global Q-functions using only neighboring information and local data. To address this issue, we invoke the "Q-tracking" step (see Algorithm 1, 2), which is inspired by the gradient tracking method in decentralized optimization. Note that this generalization is highly non-trivial: to the best of our knowledge, the utility of gradient tracking has not been exploited in policy optimization, and the intrinsic nonconcavity issue, together with the use of natural gradients, prevents us from directly using the results from decentralized optimization. It is thus of great value to study if the combination of NPG and gradient tracking could lead to fast globally convergent algorithms as in the standard decentralized optimization literature despite the nonconcavity.</p><p>Besides, due to the lack of global information sharing, care needs to be taken to judiciously balance the use of neighboring information (to facilitate consensus) and local data (to facilitate learning) when updating the policy. Compared to the centralized version of our proposed algorithms, a much more delicate theoretical analysis is required to prove our convergence results. For example, the key step to establish the convergence rate of the single-agent exact entropy-regularized NPG is to form the 2nd-order linear system in Eq. ( <ref type="formula">46</ref>) in [CCC + 22a], while in our corresponding analysis, a 4th-order linear system in Eq. ( <ref type="formula">49</ref>) is needed, where the inequality in each line is non-trivial and requires the introduction of some intricate and novel auxiliary lemmas, see Appendix D.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Omitted Algorithms</head><p>C.1 Federated NPG (FedNPG) with entropy regularization</p><p>We record the entropy-regularized FedNPG method in Algorithm 2 here due to space limits.</p><p>Algorithm 2 Federated NPG (FedNPG) with entropy regularization</p><p>Update the policy for each (s, a) &#8712; S &#215; A:</p><p>Update the global Q-function estimate for each (s, a) &#8712; S &#215; A:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Development of FedNAC</head><p>For any policy &#960;, we let d &#960; s0 denote the discounted state visitation distribution of &#960; given an initial state s 0 &#8712; S, i.e., &#8704;s &#8712; S :</p><p>For a distribution &#961; &#8712; &#8710;(S), we define</p><p>We also define the state-action visitation distribution d&#960; &#961; as</p><p>Furthermore, we extend the definition of d&#960; &#961; by specifying the initial state-action distribution &#957; &#8712; &#8710;(S &#215; A) and define</p><p>Our proposed federated NAC method FedNAC could be seen as a decentralized version of Q-NPG method [AKLM21, YDG + 22], which we briefly review as follows.</p><p>Q-NPG method. Q-NPG is a sample version of NPG with function approximation which is suitable for the case where S or A is large or infinite. We consider the policy with function approximation under softmax parameterization (24).</p><p>Given an approximate solution w (t) for minimizing the function approximation error &#8467;(w, Q</p><p>) (see ( <ref type="formula">25</ref>)), the Q-NPG update rule &#958; (t+1) = &#958; (t) + &#945;w (t) , when plugged in parameterization (24), results in the following policy update rule when we set &#945; = &#951;/(1&#947;):</p><p>which could be seen as the function approximation version of the update rule (8) of vanilla NPG method.</p><p>Federated NAC method. FedNAC (describe in Section 4) is presented in Algorithm 3, whose subroutines are written in Algorithm 4, 5. In each iteration t of FedNAC, each agent n updates the critic parameter w</p><p>n locally using Algorithm 4, which aims to minimize &#8467;(w, Q</p><p>n , d(t) n ) by stochastic gradient descent. Note that since we don't know the Q-function</p><p>n in the gradients, we need to invoke Algorithm 5 [YDG + 22, Algorithm 3] to give an unbiased estimate</p><p>). As a consequence, in line 4 of Algorithm 4, we have</p><p>In each actor iteration, agents share with their neighbors actor and critic parameters, where the tracking scheme is also used.</p><p>Algorithm 3 Federated Natural Actor-Critic (FedNAC) 1: Input: number of actor iterations T , number of critic iterations K, actor learning rate &#945;, critic learning rate &#946;, discounted factor &#947; &#8712; [0, 1) 2: Initialization: initial state-action distribution &#957;, actor parameter &#958; (0) = (&#958;</p><p>Update the critic parameter for estimating the global Q-function:</p><p>Actor update:</p><p>7: end for Algorithm 4 Critic(K, &#957;, &#958;, &#947;, &#946;, r): sample-based regression solver to minimize &#8467;(w, Q</p><p>Compute the stochastic gradient estimator of L Q :</p><p>Critic Update:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Convergence analysis of FedNPG</head><p>For technical convenience, we present first the analysis for entropy-regularized FedNPG and then for vanilla FedNPG.</p><p>Algorithm 5 Q-Sampler(&#957;, &#960;, &#947;, r)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.1 Analysis of entropy-regularized FedNPG with exact policy evaluation</head><p>To facilitate analysis, we introduce several notation below. For all t &#8805; 0, we recall &#960; (t) as the normalized geometric mean of {&#960;</p><p>from which we can easily see that for each (s, a)</p><p>&#964;,1</p><p>. . .</p><p>In addition, we define</p><p>For notational convenience, we also denote</p><p>Following [CCC + 22b], we introduce the following auxiliary sequence {&#958; (t) = (&#958;</p><p>where T (t) (s, a) is updated via (16). Similarly, we introduce an averaged auxiliary sequence {&#958;</p><p>We introduces four error metrics defined as</p><p>where u (t) , v (t) &#8712; R |S||A| are defined as</p><p>We collect the error metrics above in a vector &#8486; (t) &#8712; R 4 :</p><p>With the above preparation, we are ready to state the convergence guarantee of Algorithm 2 in Theorem D.1 below, which is the formal version of Theorem 3.6. Theorem D.1. For any N &#8712; N + , &#964; &gt; 0, &#947; &#8712; (0, 1), there exists &#951; 0 &gt; 0 which depends only on N, &#947;, &#964;, &#963;, |A|, such that if 0 &lt; &#951; &#8804; &#951; 0 and 1&#963; &gt; 0, then the updates of Algorithm 2 satisfy</p><p>Moreover, the consensus errors satisfy:</p><p>The dependency of &#951; 0 on N, &#947;, &#964;, &#963;, |A| is made clear in Lemma D.3 that will be presented momentarily in this section. The rest of this section is dedicated to the proof of Theorem D.1. We first state a key lemma that tracks the error recursion of Algorithm 2. Lemma D.2. The following linear system holds for all t &#8805; 0:</p><p>where we let</p><p>and</p><p>In addition, it holds for all t &#8805; 0 that</p><p>Proof. See Appendix F.1.</p><p>Let &#961;(&#951;) denote the spectral norm of A(&#951;). As &#8486; (t) &#8805; 0, it is immediate from (49) that</p><p>2 , and therefore we have</p><p>2 . It remains to bound the spectral radius &#961;(&#951;), which is achieved by the following lemma. Lemma D.3 (Bounding the spectral norm of A(&#951;)). Let</p><p>where</p><p>Proof. See Appendix F.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2 Analysis of entropy-regularized FedNPG with inexact policy evaluation</head><p>We define the collection of inexact Q-function estimates as</p><p>and then the update rule (U T ) should be understood as</p><p>in the inexact setting. For notational simplicity, we define e n &#8712; R as</p><p>and let e = (e</p><p>&#964; , the approximation of</p><p>With slight abuse of notation, we adapt the auxiliary sequence {&#958; (t) } t=0,&#8226;&#8226;&#8226; to the inexact updates as</p><p>In addition, we define</p><p>where</p><p>We let &#8486; (t) be</p><p>With the above preparation, we are ready to state the inexact convergence guarantee of Algorithm 2 in Theorem D.4 below, which is the formal version of Theorem 3.8.</p><p>Theorem D.4. Suppose that q</p><p>n &#964;,n in Algorithm 2. For any N &#8712; N + , &#964; &gt; 0, &#947; &#8712; (0, 1), there exists &#951; 0 &gt; 0 which depends only on N, &#947;, &#964;, &#963;, |A|, such that if 0 &lt; &#951; &#8804; &#951; 0 and 1&#963; &gt; 0, we have</p><p>Moreover, the consensus errors satisfy:</p><p>then inexact entropy-regularized FedNPG could still achieve 2&#949;-accuracy (i.e.</p><p>Remark D.5. When &#951; = &#951; 0 (cf. ( <ref type="formula">54</ref>) and ( <ref type="formula">53</ref>)) and &#964; &#8804; 1, the RHS of (67) is of the order</p><p>which can be translated into a crude sample complexity bound when using fresh samples to estimate the soft Q-functions in each iteration.</p><p>The rest of this section outlines the proof of Theorem D.4. We first state a key lemma that tracks the error recursion of Algorithm 2 with inexact policy evaluation, which is a modified version of Lemma D.2.</p><p>Lemma D.6. The following linear system holds for all t &#8805; 0:</p><p>where A(&#951;) is provided in Lemma D.2. In addition, it holds for all t &#8805; 0 that</p><p>Proof. See Appendix F.3.</p><p>By (68), we have</p><p>which gives</p><p>Here, (71</p><p>Recall that the bound on &#961;(&#951;) has already been established in Lemma D.3. Therefore we complete the proof of Theorem D.4 by combining the above inequality with (69) and (70) in a similar fashion as before. We omit further details for conciseness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.3 Analysis of FedNPG with exact policy evaluation</head><p>We state the formal version of Theorem 3.3 below. Theorem D.7. Suppose all &#960; (0) n in Algorithm 1 are initialized as uniform distribution. When</p><p>we have</p><p>for any fixed state distribution &#961;. Furthermore, we have</p><p>The rest of this section is dedicated to prove Theorem D.7. Similar to (37), we denote the Q-functions of &#960; (t) by Q (t) :</p><p>In addition, similar to (38), we define</p><p>Following the same strategy in the analysis of entropy-regularized FedNPG, we introduce the auxiliary sequence {&#958; (t) = (&#958;</p><p>as well as the averaged auxiliary sequence {&#958; (t) &#8712; R |S||A| }:</p><p>As usual, we collect the consensus errors in a vector</p><p>, where u (t) , v (t) &#8712; R |S||A| are defined as:</p><p>Step 1: establishing the error recursion. The next key lemma establishes the error recursion of Algorithm 1. Lemma D.8. The updates of FedNPG satisfy</p><p>for all t &#8805; 0, where</p><p>In addition, we have</p><p>where</p><p>Moreover, when &#951; &#8804; &#951; 1 , we have</p><p>Proof. See Appendix F.4.</p><p>Note that when all &#960; (0)</p><p>n in Algorithm 1 are initialized as uniform distribution, &#8486; (0) = 0 and (84) indicates (73) in Theorem D.7.</p><p>Step 2: bounding the value functions. Let p &#8712; R 2 be defined as:</p><p>the rationale for this choice will be made clear momentarily. We define the following Lyapunov function</p><p>which satisfies</p><p>Here, the second inequality follows from (82). One can verify that the second term vanishes due to the choice of p(&#951;):</p><p>Therefore, we conclude that</p><p>.</p><p>Step 3: simplifying the expression. We first upper bound the first term in the RHS of (89). Assuming uniform initialization for all &#960; (0)</p><p>Therefore, putting together relations (86) and ( <ref type="formula">221</ref>) we have</p><p>To continue, we upper bound the second term in the RHS of (89). Note that</p><p>Thus we have</p><p>where the first inequality follows from (91) and the second inequality follows from the definition of &#951; 1 and J. By (92), we deduce</p><p>) and our advertised bound (72) thus follows from plugging (90) and ( <ref type="formula">93</ref>) into (89).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.4 Analysis of FedNPG with inexact policy evaluation</head><p>We state the formal version of Theorem 3.5 below.</p><p>Theorem D.9. Suppose that q</p><p>we have</p><p>for any fixed state distribution &#961;.</p><p>Furthermore, we have</p><p>We next outline the proof of Theorem D.9. With slight abuse of notation, we again define e n &#8712; R as</p><p>and let e = (e 1 , &#8226; &#8226; &#8226; , e n ) &#8868; . We define the collection of inexact Q-function estimates as</p><p>and then the update rule (16) should be understood as</p><p>in the inexact setting. Define q (t) , the approximation of Q (t) as</p><p>we adapt the averaged auxiliary sequence {&#958; (t) &#8712; R |S||A| } to the inexact updates as follows:</p><p>As usual, we define the consensus error vector as</p><p>The following lemma characterizes the dynamics of the error vector &#8486; (t) , perturbed by additional approximation error.</p><p>Lemma D.10. The updates of inexact FedNPG satisfy</p><p>In addition, we have</p><p>where &#981; (t) (&#951;) is defined in (83).</p><p>Moreover, when &#951; &#8804; &#951; 1 , we have</p><p>Proof. See Appendix F.5. Similar to (87), we can recursively bound &#934; (t) (&#951;) (defined in (86)) as</p><p>From the above expression we know that</p><p>via telescoping. Combining the above expression with (90), ( <ref type="formula">92</ref>) and (93), we have</p><p>which establishes (94).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Convergence analysis of FedNAC</head><p>Let &#960; &#8902; be an optimal policy and does not need to belong to the log-linear policy class. Fix a state distribution &#961; &#8712; &#8710;(S) and a state-action distribution &#957;. To simplify the notation, we denote</p><p>, and define d</p><p>n and d(t) n analogously. We also let</p><p>and assume &#977; &#961; &lt; &#8734;.</p><p>We also introduce a weighted KL divergence given by</p><p>where</p><p>Given a state distribution &#961; and an optimal policy &#960; &#8902; , we define a state-action measure d&#8902; as</p><p>The following theorem guarantees that for any fixed policy &#960; and state-action distribution &#957; &#8712; &#8710;(S &#215; A), the Q-Sampler algorithm (cf. Algorithm 5) samples (s, a) from d&#960; &#957; and gives an unbiased estimate Q &#960; (s, a) of Q &#960; (s, a), whose proof can be found in [YDG + 22, Lemma 4].</p><p>Lemma E.1 (Lemma 4 in [YDG + 22]). Consider the output (s h , a h ) and Q &#960; (s h , a h ) of Algorithm 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>It follows that</head><p>To present the convergence results of FedNAC, we further introduce the following notation, where t &#8712; N represents the iteration step in FedNAC:</p><p>&#8902;,n &#8712; arg min</p><p>For convenience of narration, we introduce the following bounded statistical error assumption.</p><p>Assumption E.2 (Bounded statistical error). For all n &#8712; [N ], there exists &#949; n stat &gt; 0 such that for all t &#8712; N in Algorithm 3, we have</p><p>When solving the regression problem with sampling based approaches, we can expect &#949; n stat = O(1/K), where K is the iteration number of Algorithm 4.</p><p>Theorem E.3 (Convergence rate of Critic (Algorithm 4)). For Algorithm 4, let w 0 = 0 and &#946; = 1 2C &#981; . Then under Assumption 4.1, we have</p><p>where w &#8902; &#8712; arg min w &#8467; w, Q &#958; , d&#958; .</p><p>The proof of Theorem E.3 is postponed to Appendix G.5.</p><p>The following lemma provide a (very pessimistic) upper bound of C &#957; in Assumption 4.3.</p><p>Lemma E.4 (Upper bound of C &#957; ). If &#957;(s, a) &gt; 0 for all state-action pairs (s, a) &#8712; S &#215; A, then we have</p><p>Proof. We only need to note that</p><p>where the last inequality follows from (??).</p><p>We give some key lemmas which will be used in our proof of Theorem 4.4.</p><p>Lemma E.5 (consensus properties). For all t &#8712; N, we have</p><p>Proof. (115) could be obtained directly by using mathematical induction and update rule (33) (note that 1 N 1 &#8868; h (-1) = &#373;(-1) = 0, see line 2 of Algorithm 3), and (114) could be obtained by averaging both sides of (34) and using (115).</p><p>Lemma E.6 (Young's inequalities). Let {x 1 , &#8226; &#8226; &#8226; , x m } be a set of m vectors in R l . Then for any &#950; &gt; 0, we have</p><p>Lemma E.7 (Lipschitzness of Q-function with function approximation). Assume that r(s, a) &#8712; [0, 1], &#8704;(s, a) &#8712; S &#215; A. For any &#958;, &#958; &#8242; &#8712; R p , we have</p><p>Proof. See Appendix G.6.</p><p>For each iteration step t in Algorithm 3, we let &#958;(t</p><p>We let</p><p>and define &#948; (t) := V &#8902; -V (t) (&#961;), where V (t) is shorthand for V f (t) . We give the following performance improvement lemma. Lemma E.8 (Performance improvement of FedNAC). Fix a state distribution &#961;, then we have</p><p>Proof. See Appendix G.7.</p><p>Lemma E.9 (linear system). For any t &#8712; N, we let</p><p>2 ) &#8868; . Then for any &#950; &gt; 0, we have &#8486; (t+1) &#8804; C&#8486; (t) + s, (124) where</p><p>Proof. See Appendix G.8. Now we are ready to give the formal version of Theorem 4.4 and its proof.</p><p>Theorem E.10 (Convergence rate of FedNAC (formal)). Let &#958;</p><p>N in FedNAC (Algorithm 3), let the w (0) = 0 and the critic stepsize &#946; = 1 2C &#981; in Algorithm 4. Then under Assumptions 3.1, 4.1, 4.2 and 4.3, when the actor stepsize satisfies</p><p>where L Q is defined in Lemma E.7, we have</p><p>Moreover, the consensus errors could be upper bounded by</p><p>where</p><p>Remark E.11 (Sample and communication complexity). When &#963; &gt; 0 and</p><p>Consequently, we need</p><p>In Algorithm 5, each trajectory has the expected length 1/(1&#947;). Consider only the term where &#949; dominates, FedNAC requires O 1 (1-&#947;) 45/4 &#949; 7/2 (1-&#963; 2 ) 3/2 samples for each agent and O</p><p>On the other end, when &#963; = 0, (128) becomes:</p><p>Consequently, for any fixed &#945; &gt; 0, when &#963; = 0 or close to 0, with</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.1 Proof of Theorem E.10</head><p>We suppose Assumptions 3.1, E.2, 4.1, 4.2 and 4.3 holds. By Lemma E.9 and nonnegativity of each entry of C, s and &#8486; (t) where t &#8712; N, it's easy to see that</p><p>where &#8730; &#8226; is exerted element-wise.</p><p>In addition, taking expectation on both sides of (123) and using the act that</p><p>we have</p><p>We define the Lyapunov function &#934; (t) as follows:</p><p>where</p><p>It's straightforward to verify that when &#950; = 1-&#963; 2 2 , we have the entries in C (cf. (125)) satisfies</p><p>we deduce</p><p>which gives</p><p>which together with (140) and the fact</p><p>indicates q 1 , q 2 &gt; 0 and that</p><p>Thus by ( <ref type="formula">133</ref>) and (134) we have</p><p>which gives</p><p>Summing the above inequality over t = 0, 1, &#8226; &#8226; &#8226; , T -1 and divide both sides by T , we have</p><p>Since</p><p>and</p><p>By Theorem E.3 we know that &#8730; &#949;stat could be upper bounded as follows:</p><p>(128) follows from plugging (149) into (148) and noting that when &#958; </p><p>where the third inequality uses (138), (139), and the fourth inequality uses (127).</p><p>Therefore, similar to (230), when &#945; &#8804; &#945; 1 , we have</p><p>Combining the above inequality with (146), and (149), we obtain (129).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.2 Proof of Theorem E.3</head><p>The proof of Theorem E.3 could be found in Appendix C.5 in [YDG + 22]. We present it for completeness. To prove Theorem E.3, we need the following Theorem G.2.</p><p>Theorem E.12 (Theorem 1 in <ref type="bibr">[BM13]</ref>). Consider the following assumptions:</p><p>(i) The observations (a k , b k ) &#8712; R p &#215; R p are independent and identically distributed.</p><p>Consider the stochastic gradient recursion</p><p>In the proof of Theorem E.3 we'll show that for Algorithm 4, the assumptions in Theorem G.2 are all satisfied and thus we can use the result (267). Let H be the length of trajectory for estimating Q &#958; (s, a). Then Q &#958; (s, a) 2 is bounded by</p><p>from which we deduce E Q &#958; (s, a)&#981;(s, a)</p><p>Furthermore, we introduce the residual</p><p>then from Lemma 7 in [YDG + 22] we know that</p><p>To verify (iv), we let R = C &#981; in Theorem G.2, then E &#8741;&#981;(s, a)&#8741; 2 2 &#981;(s, a)&#981;(s, a) &#8868; &#8804; C 2 &#981; E &#981;(s, a)&#981;(s, a) &#8868; . Also note that</p><p>from which we deduce</p><p>The above expression implies</p><p>Therefore, (iv) is verified.</p><p>Thus by (267), with stepsize &#946; = 1 2C 2 &#981; , initialization w 0 = 0 and K steps of critic updates, we have</p><p>which gives (113).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F Proof of key lemmas</head><p>F.1 Proof of Lemma D.2</p><p>Before proceeding, we summarize several useful properties of the auxiliary sequences (cf. ( <ref type="formula">40</ref>) and (41)), whose proof is postponed to Appendix G.1.</p><p>Lemma F.1 (Properties of auxiliary sequences {&#958; (t) } and {&#958; (t) }). {&#958; (t) } and {&#958; (t) } have the following properties:</p><p>1. &#958; (t) can be viewed as an unnormalized version of &#960; (t) , i.e.,</p><p>2. For any t &#8805; 0, log &#958; (t) keeps track of the average of log &#958; (t) , i.e.,</p><p>It follows that &#8704;s &#8712; S, t &#8805; 0 :</p><p>Lemma F.2 ([CCC + 22b, Appendix. A.2]). For any vector &#952; = [&#952; a ] a&#8712;A &#8712; R |A| , we denote by</p><p>Step 1: bound u (t+1) (s, a) = log &#958; (t+1) (s, a)log &#958; (t+1) (s, a)1 N 2 . By (40b) and (41b) we have</p><p>where the penultimate step results from the averaging property of W (property (11)). Taking maximum over (s, a) &#8712; S &#215; A establishes the bound on &#8486; (t+1) 1 in (49).</p><p>Step 2: bound v (t+1) (s, a) = T (t+1) (s, a) -</p><p>(s, a)1 N 2 . By (U T ) we have</p><p>where the penultimate step uses property (11), and the last step is due to</p><p>Step 3: bound</p><p>&#8734; . We decompose the term of interest as</p><p>which gives</p><p>The last step is due to log &#958;</p><p>n (s, a)log &#958; (t) (s, a) &#8804; u (t) (s, a), while the penultimate step results from writing</p><p>and applying the following lemma.</p><p>Plugging ( <ref type="formula">169</ref>) into (168) gives</p><p>Step 4: bound</p><p>Let w (t) : S &#215; A &#8594; R be defined as</p><p>Again, we treat w (t) as vectors in R |S||A| whenever it is clear from context. For any (s, a) &#8712; S &#215; A and n &#8712; [N ], by Lemma F.3 it follows that</p><p>and consequently</p><p>It boils down to control w (t) &#8734; . To do so, we first note that for each (s, a) &#8712; S &#215; A, we have</p><p>where (a) is due to the doubly stochasticity property of W and (b) is from the fact &#8741;W -I N &#8741; 2 &#8804; 2.</p><p>We further bound the second term as follows:</p><p>Here, the first step results from the following relation established in <ref type="bibr">[NNXS17]</ref>:</p><p>which also leads to</p><p>by Lemma F.2. For the remaining terms in (176), we have</p><p>and</p><p>where the first inequality again results from Lemma F.2. Plugging (178), ( <ref type="formula">179</ref>), ( <ref type="formula">180</ref>) into (176) and using the definition of u (t) , v (t) , we arrive at</p><p>Using previous display, we can write (174) as</p><p>Combining (167) with the above expression (181), we get</p><p>Step 5: bound</p><p>For any state-action pair (s, a) &#8712; S &#215; A, we observe that</p><p>where the first step invokes the definition of Q &#964; (cf. (6a)), and the second step is due to the following expression of V &#8902; &#964; established in <ref type="bibr">[NNXS17]</ref>:</p><p>To continue, note that by ( <ref type="formula">162</ref>) and (41b) we have</p><p>Plugging ( <ref type="formula">185</ref>) into ( <ref type="formula">183</ref>) and ( <ref type="formula">181</ref>) establishes the bounds on</p><p>for any (s, a) &#8712; S &#215; A. In view of property (164), the first term on the right-hand side of (186) can be bounded by</p><p>Plugging the above expression into (186), we have</p><p>which gives</p><p>(187) Plugging the above inequality into (171) and (182) establishes the bounds on &#8486; (t+1) 3</p><p>and &#8486; (t+1) 2 in (49), respectively. Step 6: boundmin s,a Q (t+1) &#964; (s, a)&#964; log &#958; (t+1) (s, a) . We need the following lemma which is adapted from Lemma 1 in [CCC + 22b]: Lemma F.4 (Performance improvement of FedNPG with entropy regularization). Suppose 0 &lt; &#951; &#8804; (1&#947;)/&#964; . For any state-action pair (s 0 , a 0 ) &#8712; S &#215; A, one has</p><p>Proof. See Appendix G.3.</p><p>Using (189), we have</p><p>which gives</p><p>This establishes the bounds on &#8486; (t+1) 4</p><p>in (49).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F.2 Proof of Lemma D.3</head><p>Let f (&#955;) denote the characteristic function. In view of some direct calculations, we obtain</p><p>where, for the notation simplicity, we let</p><p>Note that among all these new notation we introduce, S, d are dependent of &#951;. To decouple the dependence, we give their upper bounds as follows</p><p>where (194) follows from &#951; &#8804; (1&#947;)/&#964; , and (195) uses the fact that &#945; &#8804; 1 and 1&#945; &#8804; 1.</p><p>Since A(&#961;) is a nonnegative matrix, by Perron-Frobenius Theorem (see <ref type="bibr">[HJ12]</ref>, Theorem 8.3.1), &#961;(&#951;) is an eigenvalue of A(&#961;). So to verify (55), it suffices to show that f (&#955;) &gt; 0 for any &#955; &#8712; [&#955; &#8902; , &#8734;).</p><p>To do so, in the following we first show that f (&#955; &#8902; ) &gt; 0, and then we prove that f is non-decreasing on [&#955; &#8902; , &#8734;).</p><p>&#8226; Showing f (&#955; &#8902; ) &gt; 0. We first lower bound f 0 (&#955; &#8902; ). Since &#955; &#8902; &#8805; 3+&#963; 4 , we have</p><p>and from &#955; &#8902; &#8805; 1+(1-&#945;)&#947;+&#945; 2 we deduce</p><p>which gives</p><p>Combining ( <ref type="formula">200</ref>), ( <ref type="formula">197</ref>), (198), we have that</p><p>To continue, we upper bound f 1 (&#955; &#8902; ) as follows.</p><p>Plugging ( <ref type="formula">201</ref>),( <ref type="formula">202</ref>) into (192) and using (199), we have</p><p>where the penultimate inequality uses 1 2&#963; &#8804; 1-&#963; 2 , and the last inequality follows from the definition of &#950; (cf. (53)).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F.3 Proof of Lemma D.6</head><p>Note that bounding u (t+1) (s, a) is identical to the proof in Appendix F.1 and shall be omitted. The rest of the proof also follows closely that of Lemma D.2, and we only highlight the differences due to approximation error for simplicity.</p><p>Step 2: bound v (t+1) (s, a) = T (t+1) (s, a)q</p><p>Similar to (167) we have</p><p>Step 3: bound</p><p>&#8734; . In the context of inexact updates, (168) writes</p><p>For the last term, following a similar argument in (169) leads to</p><p>Combining the above two inequalities, we obtain</p><p>Step 4: bound Q</p><p>. We remark that the bound established in (174) still holds in the inexact setting, with the same definition for w (t) :</p><p>To deal with the approximation error, we rewrite (176) as</p><p>where the second term can be upper-bounded by</p><p>Combining ( <ref type="formula">207</ref>), (206) and the established bounds in (175), ( <ref type="formula">178</ref>), (180) leads to</p><p>Combining the above inequality with (205) and (203) gives</p><p>Step 5: bound</p><p>. It is straightforward to verify that (187) applies to the inexact updates as well:</p><p>Plugging the above inequality into (204) and ( <ref type="formula">208</ref>) establishes the bounds on &#8486; (t+1) 3</p><p>and &#8486; (t+1) 2 in (68), respectively.</p><p>Step 6: boundmin s,a Q</p><p>(s, a)-&#964; log &#958; (t+1) (s, a) . We obtain the following lemma by interpreting the approximation error e as part of the consensus error</p><p>in Lemma F.4. Lemma F.5 (inexact version of Lemma F.4). Suppose 0 &lt; &#951; &#8804; (1&#947;)/&#964; . For any state-action pair (s 0 , a 0 ) &#8712; S &#215; A, one has</p><p>Using (210), we have</p><p>F.4 Proof of Lemma D.8</p><p>Step 1: bound u (t+1) (s, a) = log &#958; (t+1) (s, a)log &#958; (t+1) (s, a)1 N</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2</head><p>. Following the same strategy in establishing (166), we have</p><p>or equivalently</p><p>Step 2: bound v (t+1) (s, a) = T (t+1) (s, a) -Q (t+1) (s, a)1 N 2 . In the same vein of establishing (167), we have</p><p>The term Q (t+1) (s, a) -Q (t) (s, a) 2 can be bounded in a similar way in (174):</p><p>where the coefficient (1+&#947;)&#947; (1-&#947;) 2 comes from M in Lemma F.3 when &#964; = 0, and w</p><p>It remains to bound w (t) 0</p><p>&#8734; . Towards this end, we rewrite (175) as w</p><p>Note that it holds for all (s, a) &#8712; S &#215; A:</p><p>. This along with (218) gives</p><p>Combining the above inequality with (216) and (215), we arrive at</p><p>Step 3: establish the descent equation. The following lemma characterizes the improvement in &#981; (t) (&#951;) for every iteration of Algorithm 1, with the proof postponed to Appendix G.4. Lemma F.6 (Performance improvement of exact FedNPG). For all starting state distribution &#961; &#8712; &#8710;(S), we have the iterates of FedNPG satisfy</p><p>where</p><p>It remains to control the term Q</p><p>&#8734; . Similar to (169), for all t &#8805; 0, we have</p><p>where (a) invokes Lemma F.3 with &#964; = 0 and (b) stems from the definition of u (t) . This along with (220) gives</p><p>Step 4: bound the consensus error. To bound the consensus error log &#960;</p><p>we first upper bound the spectral norm of B(&#951;) which we denote as &#961;(B(&#951;)). Since B(&#951;) is a nonnegative matrix, by Perron-Frobenius Theorem, &#961;(B(&#951;)) is an eigenvalue of B(&#951;). So we only need to upper bound the eigenvalue of &#961;(B(&#951;)).</p><p>which gives</p><p>Note that when &#951; &#8804; &#951; 1 , we have (recall that J = 2(1+&#947;)&#947;</p><p>(1-&#947;) 2 &#8730; N ):</p><p>Plugging the above two expressions into (223) yields</p><p>Therefore, when &#951; &#8804; &#951; 1 , we have</p><p>Combining the above inequality with the following fact:</p><p>where the first inequality uses (165), we obtain (84).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F.5 Proof of Lemma D.10</head><p>The bound on u (t+1) (s, a) is already established in Step 1 in Appendix F.1 and shall be omitted. As usual we only highlight the key differences with the proof of Lemma D.8 due to approximation error.</p><p>Step</p><p>. From (96), we have</p><p>) Note that (216) still holds for inexact FedNPG:</p><p>0 is defined in (217). We rewrite (218), the bound on w (t) 0 (s, a), as w</p><p>With the following bound</p><p>Putting all pieces together, we obtain</p><p>Step 2: establish the descent equation. Note that Lemma F.6 directly applies by replacing Q (t) with q (t) :</p><p>To bound the middle term, for all t &#8805; 0, we have</p><p>Hence, ( <ref type="formula">102</ref>) is established by combining the above two inequalities.</p><p>Step 4: bound the consensus error. Similar as ( <ref type="formula">224</ref>), here we have </p><p>n is the normalization term (cf. line 5, Algorithm 2) and {c</p><p>n (s)} are some constants. To prove the second claim, &#8704;t &#8805; 0, &#8704;(s, a) &#8712; S &#215; A, let</p><p>Taking inner product with 1 N 1 for both sides of (U T ) and using the double stochasticity property of W , we get</p><p>By the choice of T (0) (line 2 of Algorithm 2), we have T</p><p>&#964; and hence by induction &#8704;t &#8805; 0 : T</p><p>Therefore, to prove (161), it suffices to verify the claim for t = 0:</p><p>By taking logarithm over both sides of the definition of &#960; (t+1) (cf. ( <ref type="formula">27</ref>)), we get</p><p>for some constant z (t) (s), which deviate from the update rule of log &#958; (t+1) by a global constant shift and hence verifies (162).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.2 Proof of Lemma F.3</head><p>For notational simplicity, we let</p><p>and Q &#960; &#952; &#964; , respectively. From (6a) we immediately know that to bound</p><p>, it suffices to control V &#952; &#964; (s) -V &#952; &#8242; &#964; (s) for each s &#8712; S. By (4) we have</p><p>so in the following we bound both terms in the RHS of (235).</p><p>Step 1: bounding H(s, &#960; &#952; ) -H(s, &#960; &#952; &#8242; ) . We first bound H(s, &#960; &#952; ) -H(s, &#960; &#952; &#8242; ) using the idea in the proof of Lemma 14 in <ref type="bibr">[MXSS20]</ref>. We let</p><p>and let h (t) &#8712; R |S| be &#8704;s &#8712; S :</p><p>Note that h (t) &#8734; &#8804; log |A|. We also denote H (t) : S &#8594; R |A|&#215;|A| by: &#8704;s &#8712; S :</p><p>then we have &#8704;s &#8712; S :</p><p>where &#8706;h (t) (s) &#8706;&#952; (t) (&#8226;|s) stands for &#8706;h (t) (s) &#8706;&#952;(&#8226;|s) &#952;=&#952; (t) . The first term in (239) is further upper bounded as</p><p>By Lagrange mean value theorem, there exists t &#8712; (0, 1) such that</p><p>where the inequality follows from (239) and the above inequality. Combining (5) with the above inequality, we arrive at</p><p>Step 2: bounding V &#952; (s) -V &#952; &#8242; (s) . Similar to the previous proof, we bound V &#952; (s) -V &#952; &#8242; (s) by bounding dV &#952; (t) dt (s) . By Bellman's consistency equation, the value function of &#960; &#952; (t) is given by</p><p>which can be represented in a matrix-vector form as</p><p>where e s &#8712; R |S| is a one-hot vector whose s-th entry is 1,</p><p>with P t &#8712; R |S|&#215;|S| denoting the induced state transition matrix by &#960; &#952; (t) P t (s, s &#8242; ) = a&#8712;A &#960; &#952; (t) (a|s)P(s &#8242; |s, a) , (243) and r t &#8712; R |S| is given by &#8704;s &#8712; S : r t (s) := a&#8712;A &#960; &#952; (t) (a|s)r(s, a) . (244) Taking derivative w.r.t. t in (241), we obtain [PP08] dV &#952; (t) (s) dt = &#947; &#8226; e &#8868; s M t dP t dt M t r t + e &#8868; s M t dr t dt .</p><p>We now calculate each term respectively.</p><p>&#8226; For the first term, it follows that</p><p>where the second and fourth lines use the fact &#8741;M t &#8741; 1 &#8804; 1/(1&#947;) [LWCC23a, Lemma 10], and the last line follow from We defer the proof of (246) to the end of proof.</p><p>&#8226; For the second term, it follows that</p><p>where the first inequality follows again from &#8741;M t &#8741; 1 &#8804; 1/(1&#947;), and the second inequality follows from</p><p>Plugging the above two inequalities into (245) and using Lagrange mean value theorem, we have</p><p>Step 3: sum up. Combining (250), ( <ref type="formula">240</ref>) and (235), we have</p><p>Combining ( <ref type="formula">251</ref>) and (6a), (170) immediately follows.</p><p>Proof of (246). For any vector x &#8712; R |S| , we have</p><p>t dt x s = s &#8242; &#8712;S a&#8712;A d&#960; &#952; (t) (a|s) dt P(s &#8242; |s, a)x(s &#8242; ) , from which we can bound the l &#8734; norm as dP t dt x &#8734; &#8804; max s a&#8712;A s &#8242; &#8712;S P(s &#8242; |s, a) d&#960; &#952; (t) (a|s) dt &#8741;x&#8741; &#8734; = max s a&#8712;A</p><p>as desired, where the last line follows from the following fact:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.3 Proof of Lemma F.4</head><p>To simplify the notation, we denote</p><p>We first rearrange the terms of (234) and obtain</p><p>(254) This in turn allows us to express V (t) &#964; (s 0 ) for any s 0 &#8712; S as follows</p><p>where the first identity makes use of (6b), the second line follows from (254). Invoking (6b) again to rewrite the z(s 0 ) appearing in the first term of (255), we reach</p><p>-&#964; log &#960; (t+1) (a 0 |s 0 ) + r(s 0 , a 0 ) + &#947;V</p><p>Note that for any (s 0 , a 0 ) &#8712; S &#215; A, we have</p><p>To finish up, applying (256) recursively to expand V (t) &#964; (s i ), i &#8805; 1 and making use of (257), we arrive at</p><p>where the third line follows since V (t+1) &#964; can be viewed as the value function of &#960; (t+1) with adjusted rewards r (t+1) (s, a) := r(s, a)&#964; log &#960; (t+1) (s|a). And (188) follows immediately from the above inequality (258). By (6a) we can easily see that (189) is a consequence of (188).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.4 Proof of Lemma F.6</head><p>We first introduce the famous performance difference lemma which will be used in our proof.</p><p>Lemma G.1 (Performance difference lemma). For any policy &#960;, &#960; &#8242; &#8712; &#8710;(A) S and &#961; &#8712; &#8710;(S), we have</p><p>For all t &#8805; 0, we define the advantage function A (t) as:</p><p>Then for Alg. 1, the update rule of &#960; (Eq. ( <ref type="formula">234</ref>)) can be written as</p><p>where &#948; (t) is defined in (253) and</p><p>where the first inequality follows by Jensen's inequality on the concave function log x and the last equality uses a &#8242; &#8712;A &#960; (t) (a &#8242; |s)A (t) (s, a &#8242; ) = 0.</p><p>For all starting state distribution &#181;, we use d (t+1) as shorthand for d &#960; (t+1) &#181; , the performance difference lemma (Lemma G.1) implies:</p><p>from which we can see that</p><p>where we use (263), and that</p><p>which follows from</p><p>For any fixed &#961;, we use d &#8902; as shorthand for d &#960; &#8902; &#961; . By the performance difference lemma (Lemma G.1),</p><p>where we use (262) in the second equality.</p><p>By applying (265) with &#181; = d &#8902; as the initial state distribution, we have</p><p>Plugging the above equation into (266), we obtain</p><p>which gives Lemma F.6.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.5 Proof of Theorem E.3</head><p>The proof of Theorem E.3 could be found in Appendix C.5 in [YDG + 22]. We present it for completeness. To prove Theorem E.3, we need the following Theorem G.2.</p><p>Theorem G.2 (Theorem 1 in <ref type="bibr">[BM13]</ref>). Consider the following assaumptions:</p><p>(i) The observations (a k , b k ) &#8712; R p &#215; R p are independent and identically distributed.</p><p>In the proof of Theorem E.3 we'll show that for Algorithm 4, the assumptions in Theorem G.2 are all satisfied and thus we can use the result (267).</p><p>Proof of Theorem E.3. We let a k and b k in Theorem G.2 be &#981;(s, a) and Q &#958; &#981;(s, a) in Algorithm 4, respectively. And we let &#8741;</p><p>Let H be the length of trajectory for estimating Q &#958; (s, a). Then Q &#958; (s, a) 2 is bounded by</p><p>Furthermore, we introduce the residual</p><p>then from [YDG + 22, Lemma 7] we know that E</p><p>To verify (iv), we let</p><p>from which we deduce</p><p>The above expression implies</p><p>Therefore, (iv) is verified.</p><p>Thus by (267), with stepsize &#946; = 1 2C 2 &#981; , initialization w 0 = 0 and K steps of critic updates, we have</p><p>which gives (113).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.6 Proof of Lemma E.7</head><p>Proof of Lemma E.7. For notational simplicity we let V &#958; , V &#958; &#8242; denote V f &#958; , V f &#958; &#8242; , resp. Same as in Lemma F.3, We define &#958; (t) = &#958; + t(&#958; &#8242;&#958;) and define P t , M t , r t by replacing &#960; &#958; (t) with f &#958; (t) in ( <ref type="formula">243</ref>),( <ref type="formula">242</ref>) and (244), respectively. Define</p><p>then we have &#8706;f &#958; (a|s) &#8706;&#958; = f &#958; (a|s) &#966;&#958; (s, a) .</p><p>Analogous to (252), we have</p><p>where the last line follows is due to</p><p>Same as (245) in Lemma F.3, we have</p><p>And similar to (249), we deduce</p><p>which gives</p><p>Following the same steps in (247), we deduce</p><p>Combining the above two expressions (277) and ( <ref type="formula">278</ref>) with (276), we deduce</p><p>which implies </p><p>Proof of Lemma E.8. By the update rule (114) and the parameterization (24) we know know that</p><p>where Z (t) (s) is a normalization coefficient to ensure a&#8712;A f (t+1) (s, a) = 1 for each s &#8712; S. Note that the above &#960; (t+1) could also be obtained by a mirror descent update:</p><p>where &#934;(s) &#8712; R |A|&#215;p is a matrix with rows &#981; &#8868; (s, a) &#8712; R p for a &#8712; A, and D(&#8226;, &#8226;) denotes the KL divergence defined in (109).</p><p>We apply the three-point descent lemma-Lemma G.3 with C = &#8710;(A), f = -&#945;&#10216;&#934;(s) &#373;(t) , &#8226;&#10217; and h : &#8710;(A) &#8594; R is the negative entropy with h(q) = a&#8712;A q(a) log q(a) and deduce that for any q &#8712; &#8710;(A), we have</p><p>Rearranging terms and dividing both sides by -&#945;, we obtain</p><p>Let q = f (t) (&#8226;|s) and &#960; &#8902; (&#8226;|s),resp., we have the following two inequalities:</p><p>Taking expectation w.r.t. distribution d &#8902; on both sides of (285), we arrive at</p><p>To simplify the notation we let Q(t) and V (t) denote Q f (t) and V f (t) , respectively. Note that the first expectation in the above expression (286) could be upper bounded as follows:</p><p>where the first inequality uses (??) and the definition of &#977; &#961; (107) and the last line follows from (260) in Lemma G.1. We separate the second term of the last line into four terms as follows:</p><p>Applying again Lemma G.1, we deduce the equivalent form of the second expectation in (286) as follows:</p><p>where the second term of the last line could be decomposed into the following terms:</p><p>Plugging (288), (290) into (287) and (289), resp., and making use of (286), we have </p><p>where L Q is defined in Lemma E.7, the second line uses Cauchy-Schwartz's inequality and Young's inequality (117) and the last inequality uses Assumption 4.3 and Jensen's inequality. </p><p>Plugging ( <ref type="formula">296</ref>),( <ref type="formula">298</ref>),( <ref type="formula">299</ref>),(300) into (291) and dividing both sides by (1&#947;) yield &#977; &#961; &#948; (t+1)&#948; (t) +&#948; (t) &#8804; D</p><p>(t) &#8902; (1&#947;)&#945; -D (t+1) &#8902; (1&#947;)&#945; + 2 &#8730; C &#957; (&#977; + 1) 1&#947; &#63723; &#63725; &#949;(t) stat + 2 &#949;(t) approx + L 2 Q N &#958; (t) -1( &#958;(t) ) &#8868; 2 F &#63734; &#63736; .</p><p>Taking expectation on both sides of the above expression and making use of the simple fact that</p><p>we reach the conclusion (123).</p><p>G.8 Proof of Lemma E.9</p><p>Proof of Lemma E.9. For any &#950; &gt; 0, by the actor update rule (34) and (114) we have that</p><p>where the last line follows from Young's inequality (116) and (11). By the gradient tracking step (33) , Young's inequality (116) and (11), we have h (t+1) -1 &#373;(t+1)&#8868; 2 F = W (h (t) + w (t+1)w (t) ) -1 &#373;(t)&#8868; + 1( &#373;(t)&#8868;&#373;(t+1)&#8868; ) 2 F = W h (t) -1 &#373;(t)&#8868; + W (w (t+1)w (t) ) -1( &#373;(t+1)&#8868;&#373;(t)&#8868; ) + w (t)w</p><p>where the first inequality uses Young's inequality (117).</p><p>Note that by the update rule (34), the double stochasticity of the mixing matrix W and the consensus property (11) we have</p><p>where the penultimate line uses Jensen's inequality and the last line follows from (304), Assumption E.2 and (271).</p><p>Combining (310) and (309) with (302), we deduce</p><p>Finally, (124) follows from taking expectations on both sides of (301) and (311).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H Numerical experiments</head><p>Experimental setup. We study the empirical performance of FedNPG (Algorithm 1) and entropyregularized FedNPG (Algorithm 2) on a K &#215; K GridWorld problem. To be specific, the collective goal of N agents is to learn a global optimal policy to follow a predetermined path which starts at the top left corner and ends at the bottom right corner. However, each agent only has access to partial information about the whole map: in figure <ref type="figure">1</ref> (where we take N = 3 and K = 9 as an example), agent n explores on map n, n &#8712; [N ]. After taking an action, only when the agent is at the shaded positions can it get reward 1, otherwise it gets 0. We stipulate the action space of all agents to be A = {right, down}, i.e. movement is allowed only to the right or down. If an agent takes an action that will lead it out of the boarder of the map, we stipulate the agent's state doesn't change and receive reward 0. Each agent starts at the top left corner. To learn a shared policy to follow the path, we aim to maximize the average value function of all agents.</p><p>&#8226; It should be clear whether the error bar is the standard deviation or the standard error of the mean. &#8226; It is OK to report 1-sigma error bars, but one should state it. The authors should preferably report a 2-sigma error bar than state that they have a 96% CI, if the hypothesis of Normality of errors is not verified. &#8226; For asymmetric distributions, the authors should be careful not to show in tables or figures symmetric error bars that would yield results that are out of range (e.g. negative error rates). &#8226; If error bars are reported in tables or plots, The authors should explain in the text how they were calculated and reference the corresponding figures or tables in the text.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Experiments Compute Resources</head><p>Question: For each experiment, does the paper provide sufficient information on the computer resources (type of compute workers, memory, time of execution) needed to reproduce the experiments?</p><p>Answer: <ref type="bibr">[NA]</ref> Justification: the results are irrelevant to the compute resources.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not include experiments.</p><p>&#8226; The paper should indicate the type of compute workers CPU or GPU, internal cluster, or cloud provider, including relevant memory and storage. &#8226; The paper should provide the amount of compute required for each of the individual experimental runs as well as estimate the total compute. &#8226; The paper should disclose whether the full research project required more compute than the experiments reported in the paper (e.g., preliminary or failed experiments that didn't make it into the paper).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">Code Of Ethics</head><p>Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics <ref type="url">https://neurips.cc/public/EthicsGuidelines</ref>?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Answer: [Yes]</head><p>Justification: the research conducted in the paper conforms with the NeurIPS Code of Ethics.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the authors have not reviewed the NeurIPS Code of Ethics.</p><p>&#8226; If the authors answer No, they should explain the special circumstances that require a deviation from the Code of Ethics. &#8226; The authors should make sure to preserve anonymity (e.g., if there is a special consideration due to laws or regulations in their jurisdiction).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10.">Broader Impacts</head><p>Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?</p><p>Answer: <ref type="bibr">[NA]</ref> Justification: this is a theoretical paper and it has no societal impact.</p><p>Guidelines:</p><p>&#8226; The answer NA means that there is no societal impact of the work performed.</p><p>&#8226; If the authors answer NA or No, they should explain why their work has no societal impact or why the paper does not address societal impact. &#8226; Examples of negative societal impacts include potential malicious or unintended uses (e.g., disinformation, generating fake profiles, surveillance), fairness considerations (e.g., deployment of technologies that could make decisions that unfairly impact specific groups), privacy considerations, and security considerations.</p><p>&#8226; The conference expects that many papers will be foundational research and not tied to particular applications, let alone deployments. However, if there is a direct path to any negative applications, the authors should point it out. For example, it is legitimate to point out that an improvement in the quality of generative models could be used to generate deepfakes for disinformation. On the other hand, it is not needed to point out that a generic algorithm for optimizing neural networks could enable people to train models that generate Deepfakes faster. &#8226; The authors should consider possible harms that could arise when the technology is being used as intended and functioning correctly, harms that could arise when the technology is being used as intended but gives incorrect results, and harms following from (intentional or unintentional) misuse of the technology. &#8226; If there are negative societal impacts, the authors could also discuss possible mitigation strategies (e.g., gated release of models, providing defenses in addition to attacks, mechanisms for monitoring misuse, mechanisms to monitor how a system learns from feedback over time, improving the efficiency and accessibility of ML).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="11.">Safeguards</head><p>Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse (e.g., pretrained language models, image generators, or scraped datasets)?</p><p>Answer: [NA]</p><p>Justification: the paper aims to provide a better understanding on existing algorithms and thus poses no such risks.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper poses no such risks.</p><p>&#8226; Released models that have a high risk for misuse or dual-use should be released with necessary safeguards to allow for controlled use of the model, for example by requiring that users adhere to usage guidelines or restrictions to access the model or implementing safety filters. &#8226; Datasets that have been scraped from the Internet could pose safety risks. The authors should describe how they avoided releasing unsafe images. &#8226; We recognize that providing effective safeguards is challenging, and many papers do not require this, but we encourage authors to take this into account and make a best faith effort.</p><p>12. Licenses for existing assets Question: Are the creators or original owners of assets (e.g., code, data, models), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?</p><p>Answer: <ref type="bibr">[NA]</ref> Justification: the paper does not use existing assets.</p><p>Guidelines:</p><p>&#8226; The answer NA means that the paper does not use existing assets.</p><p>&#8226; The authors should cite the original paper that produced the code package or dataset.</p><p>&#8226; The authors should state which version of the asset is used and, if possible, include a URL. &#8226; The name of the license (e.g., CC-BY 4.0) should be included for each asset.</p><p>&#8226; For scraped data from a particular source (e.g., website), the copyright and terms of service of that source should be provided. &#8226; If assets are released, the license, copyright information, and terms of use in the package should be provided. For popular datasets, paperswithcode.com/datasets has curated licenses for some datasets. Their licensing guide can help determine the license of a dataset. &#8226; For existing datasets that are re-packaged, both the original license and the license of the derived asset (if it has changed) should be provided.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_0"><p>Our work seamlessly handles the server-client setting as a special case, by assuming the network topology as a fully connected network.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_1"><p>Here &#8741;&#8226;&#8741; could be any norm in R p .</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_2"><p>Here &#8741;&#8226;&#8741; could be any norm in R p .</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_3"><p>Results. In the following we discuss the empirical results of our algorithms. In all the experiments, we fix the discounted factor &#947; = 0.99. In our experiments, we also don't require the mixing matrix to</p></note>
		</body>
		</text>
</TEI>
