<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Scalable Reinforcement Learning for Multiagent Networked Systems</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>02/23/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10324690</idno>
					<idno type="doi">10.1287/opre.2021.2226</idno>
					<title level='j'>Operations Research</title>
<idno>0030-364X</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Guannan Qu</author><author>Adam Wierman</author><author>Na Li</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We study reinforcement learning (RL) in a setting with a network of agents whose states and actions interact in a local manner where the objective is to find localized policies such that the (discounted) global reward is maximized. A fundamental challenge in this setting is that the state-action space size scales exponentially in the number of agents, rendering the problem intractable for large networks. In this paper, we propose a scalable actor critic (SAC) framework that exploits the network structure and finds a localized policy that is an [Formula: see text]-approximation of a stationary point of the objective for some [Formula: see text], with complexity that scales with the local state-action space size of the largest [Formula: see text]-hop neighborhood of the network. We illustrate our model and approach using examples from wireless communication, epidemics, and traffic.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>The modeling and optimization of networked systems such as wireless communication networks and traffic networks is a long-standing challenge. Typically analytic models must make numerous assumptions to obtain tractable models as a result of the complexity of the systems, which include many unknown, or unmodeled dynamics. Given the success of Reinforcement Learning (RL) in a wide array of domains such as game play <ref type="bibr">[Silver et al., 2016;</ref><ref type="bibr">Mnih et al., 2015]</ref>, robotics <ref type="bibr">[Duan et al., 2016]</ref>, and autonomous driving <ref type="bibr">[Li et al., 2019]</ref>, it has emerged as a promising tool for tackling the complexity of networked systems. However, when seeking to use RL in the context of the control and optimization of large-scale networked systems, scalability quickly becomes an issue. The goal of this paper is to develop scalable multi-agent RL for networked systems.</p><p>Motivated by real-world networked systems like wireless communication, epidemics, and traffic, we consider an RL model of n agents with local interaction structure. Specifically, each agent i has local state s i , local action a i and the agents are associated with an underlying dependence graph G and interact locally, i.e, the distribution of s i (t + 1) only depends on the current states of the local neighborhood of i as well as the local a i (t). Further, each agent is associated with stage reward r i that is a function of s i , a i , and the global stage reward is the average of r i . In this setting, the design goal is to find a decision policy that maximizes the (discounted) global reward. This setting captures a wide range of applications, e.g. epidemics <ref type="bibr">[Mei et al., 2017]</ref>, social networks <ref type="bibr">[Chakrabarti</ref> To illustrate our model and our results, we provide stylized examples of applications in three areas: multi-access wireless communication, epidemics, and traffic signal control in Section 2.2. We conduct numerical experiments to demonstrate the performance of the approach using both synthetic examples and an application to wireless communication in Section 5.</p><p>Related Literature. Our problem falls into the category of the "succinctly described" MDPs in <ref type="bibr">Blondel and Tsitsiklis [2000, Section 5.2]</ref>, where the state/action space is a product space formed by the individual state/action space of multiple agents. As the state/action space is exponentially large, such problems are not scalable in general, even when the problem has structure <ref type="bibr">[Blondel and Tsitsiklis, 2000;</ref><ref type="bibr">Whittle, 1988;</ref><ref type="bibr">Papadimitriou and Tsitsiklis, 1999]</ref>. Despite this, there is a large literature on RL/MDPs in multi-agent settings, which we discuss below.</p><p>Multi-agent RL dates back to the early work of <ref type="bibr">Littman [1994]</ref>; <ref type="bibr">Claus and Boutilier [1998]</ref>; <ref type="bibr">Littman [2001]</ref>; <ref type="bibr">Hu and Wellman [2003]</ref> (see <ref type="bibr">Bu et al. [2008]</ref> for a review) and has been actively studied, e.g. <ref type="bibr">Zhang et al. [2018]</ref>; <ref type="bibr">Kar et al. [2013]</ref>; <ref type="bibr">Macua et al. [2015]</ref>; <ref type="bibr">Mathkar and Borkar [2017]</ref>; <ref type="bibr">Wai et al. [2018]</ref>, see a more recent review in <ref type="bibr">Zhang et al. [2021]</ref>. Multi-agent RL encompasses a broad range of settings including competitive agents and Markov games. The case most relevant to ours is the cooperative multi-agent RL where typically, the agents can take their own actions but they share a common global state and maximize a global reward <ref type="bibr">[Bu et al., 2008]</ref>. This is in contrast to the model we study, in which each agent has its own state and acts upon its own state. Despite the existence of a global state, multi-agent RL still faces scalability issues since the joint-action space is exponentially large. Methods have been proposed to deal with this, including independent learners <ref type="bibr">[Tan, 1993;</ref><ref type="bibr">Claus and Boutilier, 1998;</ref><ref type="bibr">Matignon et al., 2012]</ref>, where each agent employs a single-agent RL method. While successful in some cases, the independent learner approach can suffer from instability <ref type="bibr">[Matignon et al., 2012]</ref>. Alternatively, one can use function approximation schemes to approximate the large Q-table, e.g. linear function approximation <ref type="bibr">[Zhang et al., 2018]</ref> or neural networks <ref type="bibr">[Lowe et al., 2017]</ref>. Such methods can reduce computation complexity significantly, but it is unclear whether the performance loss caused by the function approximation is small. In contrast, our technique not only reduces computation but also guarantees small performance loss.</p><p>Factored MDPs are problems where every agent has its own state and the state transition factorizes in a way similar to our model <ref type="bibr">[Kearns and Koller, 1999;</ref><ref type="bibr">Guestrin et al., 2003;</ref><ref type="bibr">Osband and Van Roy, 2014]</ref>. However, they differ from the model we consider in that each agent does not have its own action. Instead, there is a global action affecting every agent. Despite the difference, Factored MDPs still suffer from scalability issues. Similar approaches as in the case of multi-agent RL are used, e.g., <ref type="bibr">Guestrin et al. [2003]</ref> proposes a class of "factored" linear function approximators; however, it is unclear whether the loss caused by the approximation is small.</p><p>Other Related Work. Our work is also related to weakly coupled MDPs, where every agent has its own state and action but their transition is decoupled <ref type="bibr">[Meuleau et al., 1998]</ref>. Additionally, our model shares some similarity with Glauber dynamics in physics <ref type="bibr">[Lokhov et al., 2015;</ref><ref type="bibr">Mezard and Montanari, 2009]</ref>, though our focus is very different from these works. As we consider the class of localized policies, another related line of work is Partially Observable MDP (POMDP) <ref type="bibr">[Nair et al., 2005;</ref><ref type="bibr">Oliehoek and Amato, 2016;</ref><ref type="bibr">Bertsekas, 2005]</ref>, though the formulations and results we have are very different from those works.</p><p>Finally, this work is related to our earlier work <ref type="bibr">Qu and Li [2019]</ref>, which assumes the full knowledge of the MDP model (not RL) and imposes strong assumptions on the graph. In contrast, our work here does not need knowledge of the MDP and significantly relaxes the assumptions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Preliminaries</head><p>In this section, we introduce our model, provide a few illustrative examples, and provide important background in RL that underlies our analysis. Throughout this paper, &#8226; denotes Euclidean norm and &#8226; &#8734; denotes infinity norm. Notation t and T are reserved as iteration counters for the inner loop of the algorithm to be introduced later, m and M for the outer loop, and &#954; is used for counting the hops of neighbors. Notation O(&#8226;) hides constants and &#213;(&#8226;) hides log factors with respect to iteration variables T, M and variable &#954;. The total variation distance for two distributions &#960;, &#960; over a finite set S is defined as TV(&#960;, &#960; ) = sup E&#8834;S |&#960;(E) -&#960; (E)|.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Model</head><p>We consider a network of n agents that are associated with an underlying undirected graph G = (N , E), where N = {1, . . . , n} is the set of agents and E &#8834; N &#215; N is the set of edges. Each agent i is associated with state s i &#8712; S i , a i &#8712; A i where S i and A i are finite sets. The global state is denoted as s = (s 1 , . . . , s n ) &#8712; S := S 1 &#215; &#8226; &#8226; &#8226; &#215; S n and similarly the global action a = (a 1 , . . . , a n ) &#8712; A := A 1 &#215; &#8226; &#8226; &#8226; &#215; A n . At time t, given current state s(t) and action a(t), the next individual state s i (t + 1) is independently generated and is only dependent on neighbors:</p><p>where notation N i means the neighborhood of i (including i itself) and notation s N i means the states of the agents in N i . In addition, for integer &#954; &#8805; 0, we use N &#954; i to denote the &#954;-hop neighborhood of i, i.e. the nodes whose graph distance to i has length less than or equal to &#954;. We also let f (&#954;) = sup i |N &#954; i |. Each agent is associated with a class of localized policies &#950; &#952; i i parameterized by &#952; i . The localized policy &#950; &#952; i i (a i |s i ) is a distribution on the local action a i conditioned on the local state s i , and each agent, conditioned on observing s i (t), takes an action a i (t) independently drawn from &#950; &#952; i i (&#8226;|s i (t)). We use &#952; = (&#952; 1 , . . . , &#952; n ) to denote the tuple of the localized policies &#950; &#952; i i , and also use &#950; &#952; (a|s) = n i=1 &#950; &#952; i i (a i |s i ) to denote the joint policy, which is a product distribution of the localized policies as each agent acts independently.</p><p>Further, each agent is associated with a stage reward function r i (s i , a i ) that depends on the local state and action, and the global stage reward is r(s, a) = 1 n n i=1 r i (s i , a i ). The objective is to find localized policy tuple &#952; such that the discounted global stage reward is maximized, starting from some initial state distribution &#960; 0 ,</p><p>(2)</p><p>Remark 1. In the state transition (1) of our model, the distribution of each node's next state is allowed to depend on its neighbors' states s N i (t), but only on its own action a i (t) as opposed to its neighbors' actions a N i (t). This restriction is imposed only for simplicity of exposition. With a simple change of notation, our model, algorithm and analysis can be extended to the more general dependence on a N i (t). Similarly, each agent's reward function r i (s i , a i ) can be generalized to depend on its neighbors' state-action pairs, i.e. r i (s N i , a N i ), and each agent's localized policy can also be generalized to depend on its neighbors' states, i.e.</p><p>Remark 2. In the paper, the interaction graph G is undirected, but our model and results can be easily generalized to the directed graph setting without essential changes in the algorithm and the analysis. In detail, to generalize to the directed graph case, the only change needed is to redefine the "neighbors". Specifically, N i needs to be redefined as the "in-neighborhood" of i (including i itself ), i.e. i itself and the set of nodes that have a directed link pointing towards i. In addition, N &#954; i needs to be redefined as the &#954;-hop "in-neighborhood" of i, i.e. the nodes whose shortest directed link to i has a length less than or equal to &#954; (including i itself ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Examples</head><p>In this section, we provide three networked system examples in wireless communication, epidemics, and traffic that feature the local dependence structure we study in this paper. For ease of exposition, we present simple versions of these examples, keeping the essence of the model and highlighting the dependence structure while ignoring some application-specific details.</p><p>Wireless Communication. We consider a wireless network with multiple access points <ref type="bibr">[Zocca, 2019]</ref>, where there is a set of users N = {1, 2, &#8226; &#8226; &#8226; , n}, and a set of network access points Y = {y 1 , y 2 , &#8226; &#8226; &#8226; , y m }. Each user i only has access to a subset Y i &#8838; Y of the access points. We define the interaction graph as the conflict graph, in which two users i and j are neighbors if and only if they share an access point, i.e. the neighbors of user i is</p><p>Each user i maintains a queue of packets defined as follows. At time step t, with probability p i , user i receives a new packet with an initial deadline d i . Then, user i can choose to send the earliest packet in its queue to one access point in its available set Y i , or not send anything at all. If an action of sending to y k &#8712; Y i is taken, and if no other users send to the same access point at this time, then the earliest packet in user i's queue is transmitted with success probability q k which depends on the access point y k ; however, if another user also chooses to send to y k , then there is a conflict and no transmission occurs. If the packet is successfully transmitted, it will be removed from user i's queue and user i will get a reward of 1. After this, the system moves to the next time step, with all deadlines of the remaining packets decreasing by 1 and packets with deadline 0 being discarded. In this example, the local state s i of user i is a characterization of its queue of packets, and is represented by a d i binary tuple</p><p>where for each &#8712; {1, . . . , d i }, e &#8712; {0, 1} indicates whether user i has a packet with remaining deadline . The action space is A i = {null} &#8746; Y i , where null represents the action of not sending. The detailed transition is provided in Table <ref type="table">1</ref>, where s i (t + 1) only depends on s N i (t), a N i (t), which fits into the local interaction structure we consider. The local reward is given by r i (s N i (t), a N i (t)) = 1 in the case of the last row of Table <ref type="table">1</ref>, and r i (s N i (t), a N i (t)) = 0 in all other cases. The local state space S i , the local action space A i , the local transition probabilities in Table <ref type="table">1</ref> and the local reward function r i (&#8226;) form a networked MDP model described in Section 2.1. The above model serves as a basis for more complex multi-access wireless communication models studied in the literature, including those with multiple channels <ref type="bibr">[Block and Van Houdt, 2016]</ref>, (imperfect) carrier sensing <ref type="bibr">[Kim et al., 2011]</ref>. Existing analytical approaches typically require knowledge of modeling details and parameters such as the packet arrival rate <ref type="bibr">[Tassiulas and Ephremides, 1990;</ref><ref type="bibr">Yun et al., 2012]</ref>, while our RL-based approach does not require such knowledge and learns to improve performance in a model-free manner.</p><p>Epidemic Network. We consider an SIS (Susceptible-Infected-Susceptible) epidemic network model <ref type="bibr">[Mei et al., 2017;</ref><ref type="bibr">Ahn, 2014;</ref><ref type="bibr">Azizan Ruhi et al., 2016a]</ref>, where there is a undirected graph of nodes G = (N , E), and each node has a binary state space S i = {susceptible, infected}, as well all other cases (denote a i (t) = y k ) Flip left most "1" in s i (t) to "0" w.p. q k , then left shift s i (t) and append Bernoulli(p i ).</p><p>Table <ref type="table">1</ref>: State transition for the wireless communication example. "Left shift" means for a binary tuple, discarding the left most bit. Bernoulli(p i ) means a random variable sampled i.i.d. from the Bernoulli distribution that has probability p i to be 1, and probability 1 -p i to be 0. as a finite action space A i with action a i &#8712; A i representing epidemic control measures like different levels of vaccination <ref type="bibr">[Preciado et al., 2013]</ref>. The evolution of the states follows a local interaction structure: the probability of a node turning from susceptible to infected depends on the whether its neighboring nodes are infected or not as well as its control action in place [Azizan <ref type="bibr">Ruhi et al., 2016b;</ref><ref type="bibr">Preciado et al., 2013]</ref>; the probability of a node turning from infected to susceptible depends on the recovering rate. More precisely, s i (t + 1) only depends on s N i (t) and a i (t), and the state transition is provided by,</p><p>where &#948; i &#8712; (0, 1) is a given recovering rate parameter, and &#946; i : A i &#8594; (0, 1) is a given transmission rate function and it depends on the control action a i (t), and |{j &#8712; N i /{i} : s j (t) = 1}| is the number of neighboring nodes excluding i itself that are infected. The local reward of each node is given by,</p><p>which consists of two parts: a positive reward of 1 if the node is free from infection, subtracting a given cost function c i (a i ) on the epidemic control measure a i . In this setup, the expected global reward is a weighted balance between the overall infection level and the epidemic control cost. The above defined local state space S i , local action space A i , local transition probabilities in (3) and local rewards in (4) form a networked MDP model in Section 2.1. We comment that the above SIS model is a basis for more complex models, e.g. those with "exposed" and "recovered" states <ref type="bibr">[Kuznetsov and Piccardi, 1994;</ref><ref type="bibr">Britton, 2010]</ref> or more complex control interventions <ref type="bibr">[Morris et al., 2020]</ref>. These more complex models can also be captured using the framework above. We reiterate that the approach in our paper does not require knowledge of model specifications and learns in a model-free manner.</p><p>Traffic Network. Lastly, we consider a traffic signal control problem adapted from <ref type="bibr">Varaiya [2013]</ref>. In this setting, each node i represents a road link and the interaction graph represents the physical connection of the road links. Given road link i, the local state s i = (x i,j ) i&#8594;j is a tuple of variables with j ranging from neighboring links i can turn to, and x i,j is the number of vehicles on link i that intend to turn to link j, and can only take values in [S] = {0, 1, . . . , S}. Correspondingly, the local state space is S i = [S] N i /{i} . Similarly, the local action a i = (y i,j ) i&#8594;j is the binary traffic signal tuple with y i,j controlling the on-off of turn movement (i &#8594; j), and the local action space is A i = {0, 1} {N i }/{i} . At each time, a random amount of vehicles on the queue x i,j will flow into link j when the traffic signal y i,j is on. Meanwhile, link i will receive vehicles from other incoming links, a random fraction of which are then assigned to each of the queues in (x i,j ) i&#8594;j . Mathematically,</p><p>where [x] S 0 means max(min(x, S), 0), C i,j (t) (and similarly C k,i (t)) is an i.i.d. random variable indicating the random amount of vehicles leaving i for j, and R i,j (t) is an i.i.d. random variable that controls the split of the inflow to link i to the queue x i,j (t). See <ref type="bibr">Varaiya [2013]</ref> for the complete details. Given a fixed distribution on the random variables C i,j (t), R i,j (t), (5) provides a complete characterization of the distribution of s i (t + 1) conditioned on s N i (t) and a N i (t), in which the local state at each link s i (t + 1) only depends on its neighbors' current states and current actions, which fits into our local interaction structure (equation (1) and Remark 1). The local reward is a characterization of the congestion level at link i, and one version of the reward is the negative queue length</p><p>The local state space S i , the local action space A i , the local transition probabilities (5) and the local rewards (6) form a networked MDP discussed in Section 2.1. We comment that policies for traffic signal control, like the max pressure policy in <ref type="bibr">Varaiya [2013]</ref>, typically require knowledge of the statistics of the random variables C ij (t) and R ij (t), while our approach learns from data in a model-free manner.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Background in RL</head><p>To provide background for the analysis in this paper, we review a few key concepts in RL. First, fixing a localized policy tuple &#952; = (&#952; 1 , . . . , &#952; n ), an important notion is the Q-function, which is defined for policy &#952; as a "table" of values for each state-action pair (s, a) &#8712; S &#215; A and it is the expected infinite horizon discounted reward under policy &#952; conditioned on the initial state and action being (s, a):</p><p>In the last step, we have defined Q &#952; i (s, a) which is the Q function for the individual reward r i . Both Q &#952; and Q &#952; i are exponentially large tables and, therefore, are intractable to compute and store. Additionally, another important concept we use is the policy gradient theorem, which provides a characterization of the gradient of the objective J(&#952;) and is the basis of many algorithmic results in RL. The policy gradient theorem shows that the gradient of J(&#952;) depends on Q &#952; and, therefore, is intractable to compute using the form in Lemma 1.</p><p>Lemma 1 <ref type="bibr">(Sutton et al. [2000]</ref>). Let &#960; &#952; be a distribution on the state space given by &#960; &#952; (s) = (1 -&#947;) &#8734; t=0 &#947; t &#960; &#952; t (s), where &#960; &#952; t is the distribution of s(t) under a fixed policy &#952; when s(0) is drawn from &#960; 0 . Then, we have,</p><p>3 Algorithm Design and Results</p><p>In this paper we propose an algorithm, Scalable Actor Critic (SAC), which provably finds an O(&#961; &#954;+1 )-stationary point of the objective J(&#952;) (i.e. a &#952; s.t. &#8711;J(&#952;) 2 &#8804; &#949;) for some &#961; &#8804; &#947;, with complexity scaling in the size of the local state-action space of the largest &#954;-hop neighborhood. We state our main result formally in Theorem 4 after introducing the details of SAC and the key idea underlying its design.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Key Idea: Exponential Decay of Q-function Leads to Efficient Approximation</head><p>Recall that the policy gradient in Lemma 1 is intractable to compute due to the dimension of the Q-function. Our key idea is that exponential decay of the Q function allows efficient approximation of the Q-function via truncation. To illustrate this, we start with the definition of the exponential decay property. Recall that N &#954; i is the set of &#954;-hop neighborhood of node i and define N &#954; -i = N /N &#954; i , i.e. the set of agents that are outside of i'th &#954;-hop neighborhood. We write state s as (s N &#954; i , s N &#954; -i ), i.e. the states of agents that are in the &#954;-hop neighborhood of i and outside of the &#954;-hop neighborhood respectively. Similarly, we write a as (a N &#954; i , a N &#954; -i ). The exponential decay property is then defined as follows.</p><p>Definition 1. The (c, &#961;)-exponential decay property holds if, for any localized policy &#952;, for any</p><p>It may not be immediately clear when the exponential decay property holds. Lemma 2 (a) below highlights that the exponential decay property holds generally with &#961; = &#947;, without any assumption on the transition probabilities except for the factorization structure (1) and the localized policy structure. Further, many MDPs in practice have ergodicity and fast mixing properties, and Lemma 2 (b) shows that when such fast mixing property holds, the (c, &#961;)-exponential decay property holds for some &#961; &lt; &#947; depending on the mixing rate. The proof of Lemma 2 is postponed to Appendix A. The condition on mixing rate in Lemma 2 (b) is similar to those used in the literature on the finite time analysis of RL methods, e.g. <ref type="bibr">Zou et al. [2019]</ref>. In fact, our condition is weaker than the common mixing rate condition in that we only require the distribution of the local state-action pair (s i (t), a i (t)) to mix, instead of the full state-action pair (s(t), a(t)). We leave it as future work to study such "local" mixing behavior and its relation to the local transition probabilities (1).</p><p>Lemma 2. Assume &#8704;i, r i is upper bounded by r. Then the following holds.</p><p>(a) The ( r 1-&#947; , &#947;)-exponential decay property holds.</p><p>(b) If there exists c &gt; 0 and &#181; &#8712; (0, 1) s.t. under any policy &#952;, the Markov chain is ergodic and starting from any initial state, TV(&#960; t,i , &#960; &#8734;,i ) &#8804; c &#181; t , &#8704;t, where &#960; t,i is the distribution of (s i (t), a i (t)) and &#960; &#8734;,i is the distribution for (s i , a i ) in stationarity, and recall TV(&#8226;, &#8226;) is the total variation distance. Then, the ( 2c r 1-&#947;&#181; , &#947;&#181;)-exponential decay property holds. Our definition of exponential decay is similar in spirit to the "correlation decay", or "spatial decay" that has been studied in the literature <ref type="bibr">[Gamarnik, 2013;</ref><ref type="bibr">Gamarnik et al., 2014;</ref><ref type="bibr">Bamieh et al., 2002]</ref>, though these works consider very different settings. For example, <ref type="bibr">Gamarnik [2013]</ref> and <ref type="bibr">Gamarnik et al. [2014]</ref> study optimization in a graphical model setting (no concept of state and/or time), and show that the effect of cost functions far away on the optimal solution at a particular node shrinks exponentially in their graph distance, under certain weak interaction assumptions. Compared to these works where the optimization problem is static, we focus on an MDP setting which has states that evolve on a time axis. Further, our exponential decay is in terms of the Q functions, as opposed to the optimal solution. That being said, we believe there are deep connections between our results and that in <ref type="bibr">Gamarnik [2013]</ref>; <ref type="bibr">Gamarnik et al. [2014]</ref>, and we leave the investigation of it as future work.</p><p>The power of the exponential decay property is that such properties usually lead to scalable and distributed algorithm design, as in <ref type="bibr">Gamarnik [2013]</ref>. In our context, the exponential decay property guarantees that the dependence of Q &#952; i on other agents shrinks quickly as the distance between them grows. This motivates us to consider the following class of truncated Q-functions,</p><p>where</p><p>) are any non-negative weights satisfying</p><p>With the definition of the truncated Q-function, our key insight is the following Lemma 3, which says when the exponential decay property holds, the truncated Q-function (9) approximates the full Q-function with high accuracy and can be used to approximate the policy gradient. The proof of Lemma 3 is postponed to Appendix B.</p><p>Lemma 3. Under the (c, &#961;)-exponential decay property, the following holds:</p><p>(a) Any truncated Q-function in the form of (9) satisfies,</p><p>(b) Given i, define the following truncated policy gradient,</p><p>where Q&#952; j can be any truncated Q-function in the form of (9). Then, if</p><p>Algorithm 1: SAC: Scalable Actor Critic Input: &#952; i (0); parameter &#954;; T , length of each episode; step size parameters h, t 0 , &#951;.</p><p>i to be the all zero vector.</p><p>Each agent i calculates approximated gradient,</p><p>12 Each agent i conducts gradient step</p><p>The power of this lemma is that the truncated Q function has a much smaller dimension than the true Q function, and is thus scalable to compute and store. However, despite the reduction in dimension, the error resulting from the approximation is small. In the next section, we use this idea to design a scalable algorithm.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Algorithm Design: Scalable Actor Critic (SAC)</head><p>The good properties of the truncated Q-function open many possibilities for algorithm design. For instance, one can first obtain the truncated Q-function in some way (which could be much easier than directly computing the full Q-function) and then do a policy gradient step using the Lemma 3. In this subsection, we propose one particular approach using the actor critic framework. Our approach, Scalable Actor Critic (SAC), uses temporal difference (TD) learning to obtain the truncated Q-function and then uses policy gradient for policy improvement. The pseudocode of the proposed algorithm is given in Algorithm 1.</p><p>Overall structure. The overall structure of SAC is a for-loop from line 1 to line 13. Inside the outer loop, there is an inner loop (line 4 through line 9) that uses temporal difference learning to get the truncated Q-function, which is followed by a policy gradient step that does policy improvement.</p><p>The Critic: TD-inner loop. Line 4 through line 9 is the policy evaluation inner loop that obtains the truncated Q function, where line 7 and 8 are the temporal difference update. We note that steps 7 and 8 use the same update equation as TD learning, except that it "pretends" (s N &#954; i , a N &#954; i ) is the true state-action pair while the true state-action pair should be (s, a). As will be shown in the theoretic analysis, such a TD update implicitly gives an estimate of a truncated Q function.</p><p>The Actor: policy gradient. Line 10 through line 12 define the actor actions. Here, each agent calculates an estimate of the truncated gradient based on (11), and then conducts a gradient step.</p><p>Communication. To implement our training algorithm, each agent needs to communicate with other agents in its &#954;-hop neighborhood in line 7 and line 11; after training is done, each agent implements its localized policy that does not need communication. This communication requirement is weaker than the "centralized training with decentralized execution" paradigm in the multi-agent RL literature <ref type="bibr">[Lowe et al., 2017]</ref>, where in the training phase, global communication is used. We also comment that when &#954; = 0, our algorithm does not need communication and is effectively the same as the independent learner approach in the literature <ref type="bibr">[Tan, 1993;</ref><ref type="bibr">Lowe et al., 2017]</ref>, as each agent simply runs a single-agent actor critic method based on its local state and local action. When &#954; &gt; 1, our algorithm requires communication with agents beyond the direct 1-hop neighbors, which may be unrealistic for some applications. An interesting future direction is to reduce the communication requirements, e.g. potentially using consensus schemes like in <ref type="bibr">Zhang et al. [2018]</ref>, and also techniques that only communicate quantized bits as opposed to real numbers <ref type="bibr">[Magn&#250;sson et al., 2020]</ref>.</p><p>Discussion. Our algorithm serves as an initial concrete demonstration of how to make use of the truncated Q-functions to develop a scalable RL method for networked systems. There are many extensions and other approaches that could be pursued, either within the actor critic framework or beyond. One immediate extension is to do a warm start, i.e., initialize Q0</p><p>i as the final estimate QT i in the previous outer-loop. Additionally, one can use the TD-&#955; variant of TD learning, incorporate variance reduction schemes like the advantage function (Advantage Actor Critic), or incorporate function approximation. Further, beyond the actor critic framework, another direction is to develop Q-learning/SARSA type algorithms based on the truncated Q-functions. An appealing aspect of Q-learning/SARSA algorithms is that they may exhibit better convergence properties, but the challenge is that, unlike the actor critic framework, it is not straightforward to enforce the policy to be local in Q-learning/SARSA algorithms. These are interesting topics for future work.</p><p>Remark 3 (Model-based vs model-free). We note that by using an actor critic framework, the proposed approach is model-free, meaning it does not explicitly estimate the transition probabilities and the reward function. This is in contrast to model-based RL, which explicitly estimates the transition probabilities (or parameters that determine the transition probabilities, like the &#948; i , &#946; i (&#8226;) parameter in the epidemic example in Section 2.2). On one hand, it is known that model-based RL can be more sample efficient than model-free RL in certain circumstances <ref type="bibr">[Tu and Recht, 2019]</ref>.</p><p>Additionally, for specific applications, model-based control design may come with properties like robustness <ref type="bibr">[Varaiya, 2013]</ref>. On the other hand, model-free RL offers more flexibility since it does not impose assumptions on the model class. The comparison and tradeoff between model-based and model-free approaches is an open research question and is beyond the scope of this paper. We refer the reader to <ref type="bibr">Tu and Recht [2019]</ref>; <ref type="bibr">Qu et al. [2020b]</ref> for more details.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Approximation Bound</head><p>In this section, we state and discuss the formal approximation guarantee for SAC. Before stating the theorem, we first state the assumptions we use. The first assumption is standard in the RL literature and bounds the reward and state/action space size.</p><p>Assumption 1 (Bounded reward and state/action space size). The reward is upper bounded as 0 &#8804; r i (s i , a i ) &#8804; r, &#8704;i, s i , a i . The individual state and action space size are upper bounded as</p><p>Assumption 2 (Exponential decay). The (c, &#961;)-exponential decay property holds for some &#961; &#8804; &#947;.</p><p>Note that under Assumption 1, Assumption 2 automatically holds with &#961; = &#947;, cf. Lemma 2 (a). However, we state the exponential decay property as an assumption to account for the more general case that &#961; could be strictly less than &#947;, cf. Lemma 2 (b).</p><p>Our third assumption can be interpreted as an ergodicity condition which ensures that the state-action pairs are sufficiently visited.</p><p>Assumption 3 (Sufficient local exploration). There exists positive integer &#964; and &#963; &#8712; (0, 1) s.t. under any fixed policy &#952; and any initial state-action (s, a)</p><p>Assumption 3 requires that every state action pair in the &#954;-hop neighborhood must be visited with some positive probability after some time. This type of assumption is common for finite time convergence results in RL. For example, in <ref type="bibr">Srikant and Ying [2019]</ref>; <ref type="bibr">Li et al. [2020]</ref>, it is assumed that every state-action pair is visited with positive probability in the stationary distribution and the state-action distribution converges to the stationary distribution with some rate. This implies our assumption which is weaker in the sense that we only require local state-action pair (s N &#954; i , a N &#954; i ) to be visited as opposed to the full state-action pair (s, a). Having said that, we note that by making Assumption 3, we do not consider the exploration-exploitation tradeoff, which is a challenging issue even in single-agent RL. One potential way to relax Assumption 3 is to use Upper Confidence Bound (UCB) bonuses to encourage exploration, which has been proposed in single-agent RL <ref type="bibr">[Jin et al., 2018]</ref>. We leave the study of the exploration-exploitation tradeoff in the multi-agent networked setting as future work.</p><p>Finally, we assume boundedness and Lipschitz continuity of the gradients, which is standard in the RL literature.</p><p>Assumption 4 (Bounded and Lipschitz continuous gradient). For any i, a i , s i and &#952; i , we assume</p><p>With these assumptions in hand, we are ready to state our convergence result.</p><p>Theorem 4. Under Assumption 1, 2, 3 and 4, for any &#948; &#8712; (0, 1), M &#8805; 3, suppose the critic step size</p><p>where</p><p>The proof of Theorem 4 is deferred to Section 4. To interpret the result, note that the first term in (13) converges to 0 in the order of &#213;(<ref type="foot">foot_0</ref> &#8730; M ) and the second term, which we denote as &#949; &#954; , is the bias caused by the truncation of the Q-function and it scales in the order of O(&#961; &#954;+1 ). As such, our method SAC will eventually find an O(&#961; &#954;+1 )-approximation of a stationary point of the objective function J(&#952;), which could be very close to a true stationary point even for small &#954; as &#949; &#954; decays exponentially in &#954;.</p><p>In terms of complexity, (13) gives that, to reach a O(&#949; &#954; )-approximate stationary point, the number of outer-loop iterations required is</p><p>), which scales polynomially with the parameters of the problem. We emphasize that it does not scale exponentially with n. Further, since the left hand side of ( <ref type="formula">12</ref>) decays to 0 as T increases in the order of &#213;( 1 &#8730; T ) and the right hand side of ( <ref type="formula">12</ref>) is in the same order as O(&#949; &#954; ), the inner-loop length required is</p><p>). This iteration complexity for the inner loop can potentially be further reduced if we do a warm start for the inner-loop, as the Q-estimate from the previous outer-loop should be already a good estimate for the current outer-loop. We leave the finite time analysis of the warm start variant as future work.</p><p>In the complexity bound, a key parameter is &#963;, which we recall is defined in Assumption 3 and it roughly means the probability that a state-action pair in a &#954;-hop neighborhood is visited. Suppose we interpret &#963; to scale with &#963; &#8764; interpreted as follows. In multi-agent RL, the typical way to handle the curse of dimensionality is to use function approximation of the Q function. For example, <ref type="bibr">Zhang et al. [2018]</ref> uses linear function to approximate the Q-function. Alternatively, it has also been popular to use neural networks to do the approximation <ref type="bibr">[Lowe et al., 2017]</ref>. However, in these approximation methods, the resulting steady state error in the actor critic framework depends on the approximation error, and it is generally unclear how to choose the function approximator that is both computationally tractable and also accurate in representing the true Q function. In the context of these works, our truncated Q-functions can be viewed as a specific way of function approximation that exploits the local interaction structure in our problem, and our method is not only computationally tractable but also has a small approximation error.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Proof of Main Result</head><p>In this section, we provide the proof of the main result Theorem 4 with some auxiliary derivations postponed to the appendix. As our algorithm is an actor critic algorithm, the proof is divided into two parts: firstly, we provide an analysis of the critic, i.e. TD learning that estimates the truncated Q-function; secondly, we analyze the actor and finish the proof of Theorem 4.</p><p>Analysis of the critic. The first part of the analysis concerns the critic, and we show that the critic inner loop converges to an estimate with steady-state error exponentially small in &#954;. Specifically, recall that within iteration m the inner loop update is</p><p>i is initialized to be all zero, and &#945; t = h t+t 0 is the step size. We note that when implementing (14) within outer loop iteration m, trajectory (s(t), a(t)) is generated by the agents taking a fixed policy &#952;(m). Let Q &#952;(m) i &#8712; R S&#215;A be the true Q-function for reward r i under this fixed policy &#952;(m) as defined in (7). Given the above notation, we prove the following theorem on the critic, which bounds the error between the approximation QT i generated by ( <ref type="formula">14</ref>) and the true Q &#952;(m) i . Theorem 5. Assume Assumption 1, 2, 3 are true and suppose t 0 , h satisfies, h &#8805; 1 &#963; max(2, 1 1-&#8730; &#947; ) and t 0 &#8805; max(2h, 4&#963;h, &#964; ). Then, inside outer loop iteration m, for each i &#8712; N , with probability at least 1 -&#948;, we have the following error bound,</p><p>where</p><p>The proof of Theorem 5 is postponed to Section 4.1. The result in Theorem 5 is an upper bound of the infinity-norm error between the truncated Q-function QT i obtained by TD learning and the true Q-function. This error bound can be further decomposed into two parts -a transient part that converges to zero in the order of &#213;( 1 &#8730; T + 1 T ), and a steady state error that is exponentially small in &#954;. In proving Theorem 5, we use two key techniques. The first is using the exponential decay property (cf. Definition 1, Lemma 2) to show that in the "steady state", the error of the truncated Q-function is bounded by O(&#961; &#954;+1 ). This is possible due to our results in Lemma 3 which shows the class of truncated Q-functions are good approximations of the full Q-function. Our second proof technique is that we develop novel finite time analysis tools for TD learning to obtain the finite time error bound. Our proof uses a novel recursive decomposition of the error. Compared to existing work on finite time analysis on TD learning in <ref type="bibr">Srikant and Ying [2019]</ref>, our result does not need the ergodicity assumption (the weaker Assumption 3 instead), and obtains error bounds in the infinity norm as opposed to the (weighted) Euclidean norm in <ref type="bibr">Srikant and Ying [2019]</ref>. This finite time proof technique could be of independent interest.</p><p>Analysis of the actor. We view the actor step as a biased stochastic gradient step, with the bias characterized by our result in the critic (Theorem 5). Under this viewpoint, we finish the analysis of the actor and the proof of the main result in Section 4.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Analysis of the Critic: Proof of Theorem 5</head><p>Since Theorem 5 is entirely about a particular outer-loop iteration m, inside which the policy is fixed to be &#952;(m), to simplify notation we drop the dependence on m and &#952;(m) throughout this section. Particularly, we refer to Q &#952;(m) i as Q * i , which is the true Q-function for reward r i under policy &#952;(m) (cf. ( <ref type="formula">7</ref>)). We also introduce short-hand notation z = (s, a) &#8712; Z = S &#215; A to represent a particular state action pair (s, a) &#8712; S &#215; A. Similarly, we define z i = (s i , a i ) &#8712; Z i = S i &#215; A i , and</p><p>A vector v &#8712; R Z means a vector of dimension |Z| that is indexed by z &#8712; Z, with its z'th entry denoted by v(z). For example, the full Q-function Q * i will be treated as a vector in R Z with its z'th entry denoted by</p><p>, and its z N &#954; i 'th entry will be denoted by v(z N &#954; i ). For example, the truncated Q-functions Qt i will be treated as vectors in R Z N &#954; i . Following a similar convention, a matrix A &#8712; R Z&#215;Z will be a |Z|-by-|Z| matrix indexed by (z, z ) &#8712; Z &#215; Z, with its (z, z )'th entry denoted by A(z, z ).</p><p>Theorem 5 essentially says that the critic iterate Qt i in (14) will become a good estimate of Q * i as t increases. We note that the full Q-function Q * i must satisfy the Bellman equation <ref type="bibr">[Bertsekas and Tsitsiklis, 1996]</ref>,</p><p>where TD : R Z &#8594; R Z is the standard Bellman operator for reward r i and P &#8712; R Z&#215;Z is the transition probability matrix from z(t) to z(t + 1) under policy &#952;(m). Note in (15), without causing any confusion, r i is interpreted as a vector in R Z although r i only depends on z i .</p><p>Our proof is divided into 3 steps. In Step 1, we rewrite (14a) and (14b) in a linear update form (cf. ( <ref type="formula">17</ref>)), study its behavior (Lemma 6), and decompose the error into a recursive form (Lemma 7). In Step 2, we bound the noise sequences in the error decomposition (Lemma 10 and Lemma 11). Finally, in Step 3, we use the recursive error decomposition and the bound on the noise sequences to prove Theorem 5.</p><p>Step 1: error decomposition. Define e z N &#954; i to be the indicator vector in R Z N &#954; i , i.e. the z N &#954; i 'th entry of e z N &#954; i is 1 and other entries are zero. Then, the critic update equation ( <ref type="formula">14</ref>) can be written as,</p><p>with Q0 i being the all zero vector in R</p><p>Qt-1 i , we can make the following definition (where we omit the dependence on i in notation A z,z , b z ),</p><p>and rewrite ( <ref type="formula">16</ref>) in a linear form</p><p>We define the following simplifying notations,</p><p>. Let F t be the &#963;-algebra generated by z(0), . . . , z(t). Then, clearly A t-1 is F t -measurable and b t-1 is F t-1 measurable. As a result, Qt i is F t -measurable. Let &#964; &gt; 0 to be the integer in Assumption 3. Let</p><p>&#964; measurable random vectors (matrices). With these notations, (17) can be rewritten as,</p><p>where in the last step, we have defined sequences t-1 , &#966; t-1 &#8712; R Z N &#954; i , which are noise sequences that we will later control in Step 2. For now, we focus on the term &#256;t-1</p><p>Qt-1 i + bt-1 , and show the following Lemma 6. The proof of Lemma 6 is similar to the analysis of fixed points for TD learning with linear function approximation, see e.g. <ref type="bibr">Van Roy and Tsitsiklis [1995]</ref>. We postpone the detailed proof of Lemma 6 to Appendix C.1. Lemma 6. &#8704;t, there exists diagonal matrix</p><p>i and operator g t-1 : R</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#256;t-1</head><p>Qt-</p><p>where D t-1 satisfies D t-1 &#963;I (recall &#963; &gt; 0 is from Assumption 3) with its z N &#954; i 'th diagonal entry being dt-1 (z</p><p>From Lemma 6, we can see the first two terms in (18) can be written as (I -</p><p>, which roughly speaking drives the iterate to Q * ,t-1 i , the fixed point of operator g t-1 . Though depending on t, Q * ,t-1 i is a good approximation of the true Q-function Q * i regardless of t, as shown in Lemma 6. As such, moving towards Q * ,t-1 i at each time step should eventually produce a good estimate of the full Q-function Q * i . In the following, we turn the above intuition into an error decomposition. In details, we unroll (18) and get,</p><p>where we have used the following short-hand notations</p><p>Since every diagonal entry of D is lower bounded by &#963; almost surely (Lemma 6), we have every entry of B k,t is upper bounded by &#946; k,t and every entry of Bk,t is upper bounded by &#946;k,t almost surely. With these short-hand notations, we state the error decomposition below, which is a consequence of ( <ref type="formula">20</ref>) and the property of operator g k (&#8226;) and its fixed point Q * ,k i in Lemma 6. The proof of Lemma 7 is postponed to Appendix C.2.</p><p>The following recursion holds almost surely,</p><p>where b k,t (z</p><p>From Lemma 7, it is clear that to bound the error &#958; t , we need to bound t-1 k=&#964; &#945; k Bk,t k &#8734; and t-1 k=&#964; &#945; k Bk,t &#966; k &#8734; , which is the focus of the next step.</p><p>Step 2: bound the k and the &#966; k -sequence. The goal of this step is to bound</p><p>We first show some simple properties of Qt i , t and &#966; t in Lemma 8. Since every entry of &#945; k Bk,t is upper bounded by &#946; k,t , we also show some properties of &#946; k,t , &#946;k,t in Lemma 9. The proofs of the two lemmas are postponed to Appendix C.3.</p><p>Lemma 8. We have almost surely, (a)</p><p>Lemma 9. If &#945; t = h t+t 0 , where t 0 &#8805; h &gt; 2 &#963; and t 0 &#8805; 4&#963;h, and t 0 &#8805; &#964; , then &#946; k,t , &#946;k,t satisfies (a)</p><p>We now bound t-1 k=&#964; &#945; k Bk,t k &#8734; . Clearly, t-1 is F t -measurable, and satisfies</p><p>where the last equality is due to the definition of &#256;t-1 and bt-1 and the fact Qt-&#964; i is F t-&#964; -measurable. Equation ( <ref type="formula">21</ref>) shows that t-1 is a "shifted" martingale difference sequence (it is not a standard martingale difference sequence which would require E t-1 |F t-1 = 0). Therefore, t-1 k=&#964; &#945; k Bk,t k &#8734; can be controlled by Azuma-Hoeffding type inequalities, as shown by Lemma 10. We comment that Bk,t is also random and Bk,t k is no longer a martingale difference sequence. As a result, to prove Lemma 10 requires more than the direct application of the Azuma-Hoeffding bound. For more details, see the full proof of Lemma 10 in Appendix C.4.</p><p>Lemma 10. We have with probability 1 -&#948;,</p><p>Finally we bound sequence</p><p>, which is small due to the step size selection. The proof of Lemma 11 can also be found in Appendix C.4.</p><p>Lemma 11. We have almost surely,</p><p>Step 3: bound the critic error.</p><p>We are now ready to use the error decomposition in Lemma 7 as well as the bounds on the k , &#966; k -sequences in Lemma 10 and Lemma 11 to bound the error of the critic. Recall that Theorem 5 states with probability 1 -&#948;,</p><p>where C 0 = 2c&#961; &#954;+1 1-&#947; , and</p><p>To prove (22), we start by applying Lemma 10 to t &#8804; T with &#948; replaced by &#948;/T . Then, using a union bound, we get with probability 1 -&#948;, for any t &#8804; T ,</p><p>Combining this with Lemma 7 and using Lemma 11, we get with probability 1 -&#948;, for all &#964; &#8804; t &#8804; T ,</p><p>We now condition on ( <ref type="formula">23</ref>) is true and use induction to show (22). Eq. ( <ref type="formula">22</ref>) is true for t = &#964; , as</p><p>Then, assume ( <ref type="formula">22</ref>) is true for up to k &#8804; t -1, we have by ( <ref type="formula">23</ref>),</p><p>We use the following auxiliary Lemma, whose proof is provided in Section C.5.</p><p>Lemma 12. Recall &#945; k = h k+t 0 , and b k,t (z</p><p>and &#945; 0 &#8804; 1 2 , then, for any z N &#954; i , and any 0 &lt; &#969; &#8804; 1,</p><p>With Lemma 12, and using the bound on &#946;&#964;-1,t in Lemma 9 (a), we have</p><p>To finish the induction, it suffices to show F t &#8804; Ca &#8730; t+t 0 and F t &#8804; C a t+t 0 . To see this, note</p><p>1-&#947; , one can check our selection of C a and C a satisfies the above three inequalities, and so the induction is finished and the proof of Theorem 5 is concluded.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Analysis of the Actor and Proof of Main Result (Theorem 4)</head><p>With the error of the critic bounded in Theorem 5, the second part of the proof focuses on the actor, i.e. the policy gradient step. Recall that at iteration m, the policy gradient step is given by &#952;</p><p>where Qm,T i is the final estimate of the Q-function for r i at the end of the critic loop in iteration m, where we have added an additional superscript m to Qm,T i to indicate its dependence on m; {s(t), a(t)} T t=0 is the state-action trajectory with s(0) drawn from &#960; 0 (the initial state distribution defined in the objective function J(&#952;), cf. ( <ref type="formula">2</ref>)) and the agents taking policy &#952;(m). Our goal is to show that &#285;i (m) is approximately the right gradient direction, &#8711; &#952; i J(&#952;(m)), which by Lemma 1 can be written as,</p><p>where &#960; &#952;(m) t</p><p>is the distribution of s(t) under fixed policy &#952;(m) when the initial state is drawn from &#960; 0 ; Q &#952;(m) is the true Q function for the global reward r under policy &#952;(m), cf. (7).</p><p>To bound the difference between &#285;i (m) and the true gradient &#8711; &#952; i J(&#952;(m)), we define the following additional sequences,</p><p>where</p><p>is the true Q function for r i under policy &#952;(m). We also use notation h(m), g(m), &#285;(m) to denote the respective h i (m), g i (m), &#285;i (m) stacked into a larger vector. The following result is an immediate consequence of Assumption 1 and Assumption 4, whose proof is postponed to Appendix D.1.</p><p>Lemma 13. We have almost surely, &#8704;m &#8804; M , max( &#285;(m) , g(m) , h(m) , &#8711;J(&#952;(m))</p><p>With these definitions, we decompose the error between the gradient estimator &#285;(m) and the true gradient &#8711;J(&#952;(m)) into the following three terms,</p><p>In the following, we will provide bounds on e 1 (m), e 2 (m), e 3 (m), and then combine these bounds to prove our main result Theorem 4.</p><p>Bounds on e 1 (m). Notice that the difference between &#285;i (m) and g i (m) is that the critic estimate Qm,T j is replaced with the true Q-function Q &#952;(m) j</p><p>. By Theorem 5, we have Qm,T with high probability when T is large enough, based on which we can bound e 1 (m) , which is formally provided in Lemma 14. The proof of Lemma 14 is postponed to Appendix D.2. Lemma 14. When T is large enough s.t.</p><p>with &#175; = 4 r 1-&#947; + 2r, then we have with probability at least</p><p>Bounds on e 2 (m). Let G m be the &#963;-algebra generated by the trajectories in the first m outer-loop iterations. Then, &#952;(m) is G m-1 measurable, and so is h i (m). Further, by the way that the trajectory</p><p>), e 2 (m) is a martingale difference sequence w.r.t. G m , and we have the following bound in Lemma 15 which is a direct consequence of Azuma-Hoeffding bound. The proof of Lemma 15 is postponed to Section D.3. Lemma 15. With probability at least 1 -&#948;/2, we have</p><p>Bounds on e 3 (m). The term e 3 (m) is the error between h(m) and the true gradient &#8711;J(&#952;(m)), where h(m) is similar to the truncated policy gradient in Lemma 3 and therefore should be close to the true gradient by Lemma 3. We provide Lemma 16 below to bound e 3 (m) . Due to technical issues, h(m) is not exactly the same as the truncated policy gradient defined in Lemma 3, but a variant of it instead. Therefore the conclusion of Lemma 3 can not be directly used in the proof of Lemma 16. Nevertheless, the proof of Lemma 16 follows essentially the same arguments as in Lemma 3 and is postponed to Appendix D.4.</p><p>, we have almost surely, e 3 (m) &#8804; 2 Lc (1-&#947;) &#961; &#954;+1 .</p><p>Combining the bounds and proof of Theorem 4. With the above bounds on e 1 (m), e 2 (m) and e 3 (m), we are now ready to prove the main result Theorem 4. Since &#8711;J(&#952;) is L Lipschitz continuous, we have</p><p>Using the decomposition of &#285;(m) in ( <ref type="formula">28</ref>), we get, Plugging the above bounds on &#285;(m) 2 and &#8711;J(&#952;(m)), &#285;(m) into (29), we have,</p><p>where &#949; m,0 = &#8711;J(&#952;(m)), e 2 (m) , &#949; m,1 = &#8711;J(&#952;(m)) ( e 1 (m) + e 3 (m) ), &#949; m,2 = 2L ( e 1 (m) 2 + e 2 (m) 2 + e 3 (m) 2 ). Doing a telescope sum for (30), we get</p><p>where we have used</p><p>We now apply our bounds on e 1 (m), e 2 (m), e 3 (m). By Lemma 15, we have with probability</p><p>By Lemma 14 and Lemma 16, we have with probability 1</p><p>By Lemma 13, we have almost surely, max( e 1 (m) , e 2 (m) , e 3 (m) ) &#8804; 2 rL (1-&#947;) 2 , and hence,</p><p>Using a union bound, we have with probability 1 -&#948;, all three events (33), ( <ref type="formula">34</ref>) and ( <ref type="formula">35</ref>) hold, which when combined with (32) implies</p><p>). Further we use the bound J(&#952;(M )) &#8804; r 1-&#947; and J(&#952;(0)) &#8805; 0 almost surely. Combining these results, we get with probability 1 -&#948;,</p><p>This concludes the proof of Theorem 4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Numerical Studies</head><p>In this section, we first conduct numerical studies in Section 5.1 using a synthetic example where the optimal solution is known in closed form to verify our theoretic results. Then, in Section 5.2, we demonstrate our approach using the wireless communication example introduced in Section 2.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Synthetic Experiments</head><p>We first study a synthetic example where the interaction graph is a line of n nodes N = {1, 2, . . . , n} with the left most node labeled as 1 and the right most as n. Each node has a binary local state space and local action space S i = A i = {0, 1}. The left most node (node 1) has a reward 1 whenever s 1 = 1, and all other reward values of node 1 and the rewards of all other nodes are 0. Further, s 1 (t + 1) = 1 with probability 1 when the second node has state s 2 (t) = 1, and in all other cases s 1 (t + 1) = 0 with probability 1. For the i'th node with 2 &#8804; i &#8804; n -1,</p><p>all other cases.</p><p>For the last node i = n, s n (t + 1) = 1 w.p. 1 when a n (t) = 1, and s n (t + 1) = 0 w.p. 1 when a n (t) = 0. The initial states of all nodes are s i (0) = 1.</p><p>In this example, the rewards transitions are designed in a way so that the optimal policy can be determined explicitly. Only agent 1 has a non-zero reward when its state is s 1 = 1, and it will stay in state s 1 = 1 only when s 2 = 1, which happens with high probability when a 2 = 1 and s 3 = 1, and so on so forth. Therefore, it is clear that the optimal policy is for all nodes to always take action a i (t) = 1 regardless of the local state, in which case the states of all agents will stay as s i (t) = 1, and the resulting optimal global discounted reward is</p><p>In our experiments, we set n = 8, &#947; = 0.7. We use the softmax policy for the localized policies, which is a standard policy parameterization that encompasses all stochastic policies from S i to A i <ref type="bibr">[Sutton et al., 1998</ref>]. We run the SAC algorithm with &#954; = 0 up to &#954; = 7. We plot the global discounted reward throughout the training process in Figure <ref type="figure">1</ref>, and we also plot the optimality gap in Figure <ref type="figure">2</ref> computed as the difference of the optimal discounted reward ( 1 n 1 1-&#947; ) and the discounted reward achieved by the algorithm with the respective &#954; values. Figure <ref type="figure">1</ref> shows that when &#954; increases, the performance of the algorithm also increases, though the improvement becomes small when &#954; &gt; 1. This is also confirmed by Figure <ref type="figure">2</ref>, which shows the optimality gap is decaying exponentially when &#954; increases. This is consistent with our theoretic result in Theorem 4. We note that the exponential decay appears to stop in Figure <ref type="figure">2</ref> when &#954; &gt; 4, and this is due to the increasing complexity in training for large &#954; as discussed in Section 3.3, and may also be due to the fact that any further potential improvement may be too small (in the order of 1 &#215; 10 -3 ) to be noticed because of the noise.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Multi-Access Wireless Communication</head><p>We next study the multi-access wireless communication example discussed in Section 2.2. We consider a grid of users in Figure <ref type="figure">3</ref>, where each user has access points on the corners of the block it is in. In the experiments, we set the grid size as 6 &#215; 6, deadline as d i = 2, and all parameters p i (packet arrival probability for user i) and q k (success transmission probability for access point y k ) are generated uniformly random from [0, 1]. We set &#947; = 0.7 and the initial state to be uniformly random, and run the SAC algorithm with &#954; = 0, 1, 2 to learn a localized softmax policy, starting from an initial policy where the action is chosen uniformly random. We compare the proposed method with a benchmark based on the localized ALOHA protocol <ref type="bibr">[Roberts, 1975]</ref>, where each user has a certain probability of sending the earliest packet and otherwise not sending at all. When it sends, it sends the packet to a random access point in its available set, with probability proportional  to the success transmission probability of this access point and inverse proportional to the number of users that share this access point. The results are shown in Figure <ref type="figure">4</ref>. It shows that the proposed algorithm can outperform the ALOHA based benchmark, despite the proposed algorithm does not have access to the transmission probability q k which the benchmark has access to. It also shows that the SAC with &#954; = 1, 2 outperforms &#954; = 0, which as we mentioned in Section 3.2 corresponds to the independent learner approach in the literature <ref type="bibr">[Tan, 1993;</ref><ref type="bibr">Lowe et al., 2017]</ref>. SAC with &#954; = 2 outperforms &#954; = 1, but the improvement is small, which is consistent with the results in the synthetic experiments in Section 5.1. We also study a case with the same 6 &#215; 6 grid of access points, but the 36 users are assigned randomly to the square blocks in the grid. We plot the results in Figure <ref type="figure">5</ref>. Similar phenomenons can be observed in Figure <ref type="figure">5</ref> as in Figure <ref type="figure">4</ref>, with SAC outperforms the benchmark and SAC with &#954; = 1, 2 outperforms &#954; = 0, the independent learner approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Concluding Remarks and Extensions</head><p>This paper proposes a SAC algorithm that provably finds a close-to-stationary point of J(&#952;) in time that scales with the local state-action space size of the largest &#954;-hop neighborhood, which can be much smaller than the full state-action space size when the graph is sparse. This represents the first scalable RL method for localized control of multi-agent networked systems with such a provable guarantee. There are many possible extensions and open questions, which we discuss below.</p><p>Average Reward. One extension is to consider average reward instead of discounted reward, i.e., to consider</p><p>Under appropriate ergodicity assumptions, the average reward above is equivalent to the reward under the stationary distribution. Average reward is common in applications where the performance is measured in stationarity, e.g. throughput or waiting time in communication and queueing networks. Despite the importance in applications, average reward RL is known to be more challenging even in single-agent settings, see e.g. <ref type="bibr">Tsitsiklis and</ref><ref type="bibr">Van Roy [1999, 2002]</ref>. For example, the Q function needs to be defined in a different way,</p><p>and because of the lack of a discounting factor &#947;, the associated Bellman operator no longer has &#947; as the natural contraction factor. In ongoing work summarized in <ref type="bibr">Qu et al. [2020a]</ref>, we have begun to study an average reward multi-agent RL setting with the same local interaction structure (1) as in this paper. While similar exponential decay properties on the Q-functions can be defined, in the average reward setting, <ref type="bibr">Qu et al. [2020a]</ref> shows that the exponential decay only holds under a form of bounded interaction strength assumption. Under this assumption, <ref type="bibr">Qu et al. [2020a]</ref> proposes a variant of the SAC algorithm that achieves similar guarantees as the algorithm in this paper. However, the bounded interaction strength assumption in <ref type="bibr">Qu et al. [2020a]</ref> may be restrictive, and searching for weaker assumptions that guarantee the exponential decay (or weaker forms of decay) is an interesting open future direction. Time-varying dependence structure. In this paper, the interaction graph that defines the dependence structure (1) is a fixed graph, while in many real world applications the graph is time-varying. In other words, if the interaction graph at time t is G t , then (1) becomes,</p><p>where N i (G t ) means the set of neighbors of i in graph G t . In another of our extensions of this work <ref type="bibr">[Lin et al., 2021]</ref>, we study a stochastic graph setting where, at each time step, the graph G t is sampled from a fixed graph distribution in which each link is present with probability exponentially decreasing in a predefined distance measure between the two nodes. <ref type="bibr">Lin et al. [2021]</ref> shows that a weaker form of exponential decay holds in this setting, termed &#181;-decay, and a similar SAC algorithm can be used. Beyond <ref type="bibr">Lin et al. [2021]</ref>, an open question is how to handle the case where G t can be  arbitrarily chosen, rather than stochastically sampled. An issue in this setting is that the transition kernel may not be time-homogeneous anymore, which means that is a challenging open problem.</p><p>Other future directions. While SAC is an actor critic based algorithm, the idea underlying SAC, including the exponential decay property and the truncated Q-function (9), is a contribution in its own right, and can potentially lead to other scalable RL methods for networked systems. For example, within the actor critic framework, one idea is to change the critic to variants like TD-&#955;, change the actor to include the advantage function, or use the simultaneous update version of the actor critic algorithm (as opposed to the inner-loop structure used in this paper). Beyond the actor critic framework, one can develop policy-iteration type algorithms (e.g. Q-learning/SARSA type methods) on our truncated Q-functions. Further, our proof technique in showing the finite time convergence of TD learning can be of independent interest. In fact, we have already considered preliminary applications of our technique in the context of Q-learning and TD learning with state aggregation <ref type="bibr">[Lin et al., 2021]</ref>, and we expect other applications to emerge in the coming years. Additionally, the setting we consider here does not provide an investigation of the tradeoff between exploration and exploitation. Adding consideration of this tradeoff is an interesting direction for future work. Finally, other future directions include studying the landscape of our policy optimization problem, and studying the robustness of the trained policies.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A The Exponential Decay Property</head><p>Our main results depend on the (c, &#961;)-exponential decay of the Q-function (cf. Definition 1), which means that for any i, any s</p><p>In Section 3.1, we have pointed out in Lemma 2 that the (c, &#961;)-exponential decay property always holds with &#961; &#8804; &#947;, assuming the rewards r i are upper bounded. We now provide the proof of Lemma 2.</p><p>Proof of Lemma 2. We first prove part (a). For notational simplicity, denote s = (s</p><p>) and a = (a</p><p>). Let &#960; t,i be the distribution of (s i (t), a i (t)) conditioned on (s(0), a(0)) = (s, a) under policy &#952;, and let &#960; t,i be the distribution of (s i (t), a i (t)) conditioned on (s(0), a(0)) = (s , a ) under policy &#952;. Then, we must have &#960; t,i = &#960; t,i for all t &#8804; &#954;. The reason is that, due to the local dependence structure (1) and the localized policy structure, &#960; t,i only depends on (s N t i , a N t i ) (the initial state-action of agent i'th t-hop neighborhood) which is the same as (s N t i , s N t i ) when t &#8804; &#954; per the way the initial state (s, a), (s , a ) are chosen. With these definitions, we expand the definition of</p><p>where TV(&#960; t,i , &#960; t,i ) is the total variation distance between &#960; t,i and &#960; t,i which is upper bounded by 1. The above inequality shows that the ( r 1-&#947; , &#947;)-exponential decay property holds and concludes the proof of Lemma 2 (a).</p><p>The proof of part (b) is almost identical to that of part(a). The only change is that in step (37), we use TV(&#960; t,i , &#960; t,i ) &#8804; 2c &#181; t .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Proof of Lemma 3</head><p>We first show part (a), the truncated Q function is a good approximation of the true Q function.</p><p>To see that, we have for any (s, a) &#8712; S &#215; A, by ( <ref type="formula">9</ref>) and (10),</p><p>where in the last step, we have used the (c, &#961;)-exponential decay property, cf. Definition 1. Next, we show part (b). Recall by the policy gradient theorem (Lemma 1),</p><p>where we have used</p><p>) by the localized policy structure. With the above equation, we can compute &#293;i (&#952;) -&#8711; &#952; i J(&#952;),</p><p>We claim that E 2 = 0. To see this, consider for any j &#8712; N &#954; -i ,</p><p>= s,a 1 ,...,a i-1 ,a i+1 ,...,an</p><p>where in the last equality, we have used Q&#952; j (s N &#954; j , a N &#954; j ) does not depend on a i as i &#8712; N &#954; j ; and</p><p>where in the last step, we have used (38) and the upper bound &#8711; &#952; i log &#950; &#952; i i (a i |s i ) &#8804; L i . This concludes the proof of Lemma 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Proof of Auxiliary Results for the Analysis of the Critic C.1 Proof of Lemma 6</head><p>The goal of Lemma 6 is to understand &#256;t-1</p><p>Qt-</p><p>Recall that P is the transition matrix from z(t-1) to z(t). Given any distribution d on the stateaction space Z, we define &#195;z = E z &#8764;P (&#8226;|z) A z,z and &#256;d = E z&#8764;d &#195;z , bd = E z&#8764;d b z . Then, &#256;t-1 and bt-1 can be rewritten as &#256;t-1 = &#256;d t-1 , bt-1 = bd t-1 . In what follows, we provide characterizations of &#256;d , bd for general distributions d.</p><p>Firstly, &#195;z is given by,</p><p>where P (&#8226;|z) is understood as the z'th row of P and is treated as a row vector. Also, we have defined &#934; &#8712; R Z&#215;Z N &#954; i to be a matrix with each row indexed by z &#8712; Z and each column indexed by</p><p>Further, the z'th row of &#934; is the indicator vector e z N &#954; i , in other words &#934;(z, z</p><p>Then, &#256;d , bd are given by,</p><p>where diag(d) &#8712; R Z&#215;Z is a diagonal matrix with the z'th diagonal entry being d(z); in the last equation, r i is understood as a vector over the entire state-action space Z, though it only depends on z i . With the above characterizations, we show the following property of &#256;d and bd in Lemma 17. Lemma 17 (with d set as d t-1 ) will directly lead to the results in Lemma 6, with D t-1 being the D in Lemma 17 and it satisfies D t-1 &#963;I due to Assumption 3; g t-1 and Q * ,t-1 i being g d t-1 and Qd t-1 i in Lemma 17.</p><p>Lemma 17. Given distribution d on state-action pair z whose marginalization onto z N &#954; i is non-zero for every z N &#954; i , we have &#256;d Qi + bd can be written as</p><p>where</p><p>i is a diagonal matrix, with the z N &#954; i 'th entry being the marginalized distribution of z N &#954; i under distribution d; g d (&#8226;) is given by g d ( Qi ) = &#928; d TD&#934; Qi , where </p><p>i is a diagonal matrix, and the z N &#954; i 'th diagonal entry is the marginal probability of z N &#954; i under d, which is non-zero by the assumption of the lemma. Therefore, &#934; diag(d)&#934; &#8712; R</p><p>i is invertable and matrix &#928; d = (&#934; diag(d)&#934;) -1 &#934; diag(d) is well defined. Further, the z N &#954; i 'th row of &#928; d is in fact the conditional distribution of the full state z given z N &#954; i . So, &#928; d must be a stochastic matrix and is non-expansive in infinity norm.</p><p>By the definition of &#256;d and bd , we have,</p><p>where TD is the Bellman operator for reward r i defined in (15), and operator g d is given by</p><p>Notice that &#934; is non-expansive in &#8226; &#8734; norm since each row of &#934; has precisely one entry being 1 and all others are zero. Also since &#928; d is non-expansive in &#8226; &#8734; norm and TD is a &#947;-contraction in &#8226; &#8734; norm, we have g d = &#928; d TD&#934; is a &#947; contraction in &#8226; &#8734; norm. As a result, g d has a unique fixed point Qd i . Finally, we show (42), which bounds the distance between &#934; Qd i and Q * i , where Q * i is the true Q-function for reward r i and it is the unique fixed point of TD operator (15). We have,</p><p>where the equality follows from the fact that Qd i is the fixed point of &#928; d TD&#934;, Q * i is the fixed point of TD; the last inequality is due to &#934;&#928; d TD is a &#947; contration in infinity norm. Therefore,</p><p>Next, recall that the z N &#954; i 's row of &#928; d is the distribution of the state-action pair z conditioned on its N &#954; i coordinates being fixed to be z N &#954; i . We denote this conditional distribution of the states outside of</p><p>And therefore,</p><p>Further, we have</p><p>where the last inequality is due to the exponential decay property (cf. Definition 1 and Assumption 2). Therefore,</p><p>Combining the above with (43), we get the desired result</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.2 Proof of Lemma 7</head><p>Recall that the z N &#954; i 'th diagonal entry of B k,t is b k,t (z N &#954; i ), and we define similarly the z N &#954; i 'th diagonal entry of Bk,t to be bk,t (z N &#954; i ). Using these notations, equation ( <ref type="formula">20</ref>) can be written as,</p><p>Notice that by definition, b&#964;-1,t (z</p><p>where in the thrid inequality, we have used that g k is &#947;-contraction in infinity norm with fixed point Q * ,k i , and in the last inequality, we have used the property of Q * ,k i in Lemma 6. Combining the above with (44), we have</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.3 Proof of Lemma 8 and Lemma 9</head><p>In this section, we provide proofs of the two auxiliary lemmas, Lemma 8 and Lemma 9. We start with the proof of Lemma 8.</p><p>Proof of Lemma 8. First, notice that A</p><p>Part (a) can be proved by induction. Part (a) is true for t = 0 as Q0</p><p>or in other words,</p><p>And for other entries of Qt i , it stays the same as Qt-1</p><p>which finishes the induction and the proof of part (a).</p><p>For part (b), notice that t = (A t -&#256;t ) Qt+1-&#964; i + b t -bt . Therefore, it is easy to check that by part (a), t &#8734; &#8804; 4 r 1-&#947; + 2r = &#175; .</p><p>For part (c), notice that, for any k</p><p>Therefore, by triangle inequality,</p><p>As a consequence,</p><p>Proof of Lemma 9. Notice that log(1 -x) &#8804; -x for all x &lt; 1. Then,</p><p>(</p><p>which leads to the bound on &#946; k,t and &#946;k,t .</p><p>For part (b),</p><p>where we have used (k + 1 + t 0 ) 2&#963;h &#8804; 2(k + t 0 ) 2&#963;h , which is true when t 0 &#8805; 4&#963;h. Then,</p><p>,</p><p>where in the last inequality we have used 2&#963;h -1 &gt; &#963;h.</p><p>For part (c), notice that for k</p><p>where we have used (k + 1 + t 0 ) &#963;h &#8804; 2(k + t 0 ) &#963;h , and &#963;h -1 &gt; 1 2 &#963;h.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.4 Proof of Lemma 10 and Lemma 11</head><p>Since</p><p>and Lemma 9 (c) directly leads to the bound in Lemma 11. So, in this section, we focus on the proof of Lemma 10. We start by stating a variant of the Azuma-Hoeffding bound that handles our "shifted" Martingale difference sequence.</p><p>Lemma 18. Let X t be a F t -adapted stochastic process, satisfying EX t |F t-&#964; = 0. Further, |X t | &#8804; Xt almost surely. Then with probability 1 -&#948;, we have,</p><p>Proof of Lemma 18. Let be an integer between 0 and &#964; -1. For each , define process Y k = X &#964; k+ , scalar &#562; k = Xk&#964;+ , and define Filtration F k = F &#964; k+ . Then, Y k is F k -adapted, and satisfies</p><p>Therefore, applying Azuma-Hoeffding bound on Y k , we have</p><p>i.e. with probability at least 1 -&#948; &#964; , k:k&#964; + &#8804;t</p><p>Using the union bound for = 0, . . . , &#964; -1, we get that with probability at least 1 -&#948;,</p><p>where the last inequality is due to Cauchy-Schwarz.</p><p>Recall that Lemma 10 is an upper bound on</p><p>with d (z </p><p>Proof of Lemma 19. Let p k be a scalar sequence defined as follows. Set p &#964; = 0, and</p><p>, and to prove Lemma 19 we need to bound</p><p>We claim that for all k &#8805; k 0 + 1, p k and pk have the same sign, and |p k | &#8804; |p k |. This is obviously true for k = k 0 + 1. Suppose it is true for for k -1. Without loss of generality, suppose both p k-1 and pk-1 are non-negative. Since k -1 &gt; k 0 and by the definition of k 0 , we must have</p><p>|. These imply pk &#8805; p k &gt; 0. The case where both p k-1 and pk-1 are negative are similar. This finishes the induction, and as a result,</p><p>By the definition of k 0 , we have</p><p>, where in the last step, we have used the upper bound on k 0 &#8734; in Lemma 8 (b). As a result,</p><p>With the above preparations, we are now ready to prove Lemma 10.</p><p>Proof of Lemma 10. Fix z N &#954; i and &#964; &#8804; k 0 &#8804; t -1. As have been shown in (21), k (z</p><p>As a result, we can use the Azuma-Hoeffding bound in Lemma 18 to get with probability 1 -&#948;,</p><p>By a union bound on &#964; &#8804; k 0 &#8804; t -1, we get with probability 1 -&#948;, </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.5 Proof of Lemma 12</head><p>Throughout the proof, we fix z N &#954; i and prove the desired upper bound. For notational simplicity, we drop the dependence on z N &#954; i and write b k,t and dk instead, and we will use the property dk &#8805; &#963;. Define the sequence</p><p>We use induction to show that e t &#8804; 1 &#8730; &#947;(t+t 0 ) &#969; . The statement is clearly true for t = &#964; + 1, as e &#964; +1 = b &#964;,&#964; +1 1 (&#964; +t 0 ) &#969; = &#945; &#964; d&#964; 1 (&#964; +t 0 ) &#969; &#8804; 1 &#8730; &#947;(&#964; +1+t 0 ) &#969; (last step needs &#945; &#964; &#8804; 1 2 , (1 + 1 t 0 ) &#969; &#8804; 2 &#8730; &#947; , implied by t 0 &#8805; 1, &#969; &#8804; 1). Let the statement be true for t -1. Then, notice that,</p><p>where the inequality is based on induction assumption. Then, plug in &#945; t-1 = h t-1+t 0 and use dt-1 &#8805; &#963;, we have,</p><p>Now using the inequality that for any x &gt; -1, (1 + x) &#8804; e x , we have, 1 -&#963;h t -1 + t 0</p><p>(1 -&#8730; &#947;) 1 + 1 t -1 + t 0  </p><p>where we have used that Qm,T The upper bounds for g(m) , h(m) and &#8711;J(&#952;(m)) can be obtained in an almost identical way and their proof is therefore omitted.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2 Proof of Lemma 14</head><p>Let G m be the &#963;-algebra generated by the trajectories in the first m outer-loop iterations. Then, Theorem 5 implies that, fixing each m &#8804; M and i &#8712; N , conditioned on G m-1 , following event happens with probability at least 1 -&#948;:</p><p>where</p><p>with &#175; = 4 r 1-&#947; + 2r. We can take expectation and average out G m-1 , and apply union bound over 0 &#8804; m &#8804; M -1 and i &#8712; N , getting with probability at least 1 - </p><p>As a result, sup 0&#8804;m&#8804;M -1 &#285;(m) -g(m) &#8804; 4cL&#961; &#954;+1</p><p>(1 -&#947;) 3 , which is true conditioned on event (47) is true that happens with probability at least 1 -&#948; 2 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.3 Proof of Lemma 15</head><p>By The rest of the proof is essentially the same as Lemma 3. For completeness we provide a proof below. Combining the above two equations, we have,</p><p>Clearly, the second term satisfies E 2 &#8804; L i r (1-&#947;) 2 &#947; T +1 . For E 1 , we have as defined in (9). We claim E 4 is zero. To see this, consider for any j &#8712; N &#954; -i and any t, </p><p>where in the last equality, we have used Q&#952;(m) j (s N &#954; j , a N &#954; j ) does not depend on a i as i &#8712; N &#954; j ; and</p><p>(a i |s i ) = &#8711; &#952; i 1 = 0. For E 3 , by the exponential decay property, the truncated Q function has a small error, cf. ( <ref type="formula">38</ref> and as a result,</p><p>Therefore,</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>(|S||A|) f (&#954;) , where we recall f (&#954;) is the size of the largest &#954;-hop neighborhood around any node, and (|S||A|)f (&#954;)  is the largest state-action space size of &#954;-hop neighborhoods of any node. Then, the iteration complexity scales with 1 &#963; &#8764; (|S||A|) f (&#954;) , whereas the steady state error depends on &#961; &#954;+1 . Therefore, &#954; is a parameter that balances between complexity and performance -the larger &#954; is, the higher the complexity but the smaller the steady state error. Exactly how the complexity grows depends on f (&#954;), the size of &#954;-hop neighborhoods, which in turn depends on the topology of the interaction graph. On one hand, for a sparse graph where f (&#954;) is a constant much smaller than the number of nodes n, the state-action space size of &#954;-hop neighborhoods is much smaller than the global state-action space size, in which case our algorithm can avoid the exponential scaling in n and is scalable to implement. In the case where the graph is very dense or even complete, we have f (&#954;) = &#8486;(n) for any &#954; &gt; 0 and our algorithm still suffers from the curse of dimensionality as the complexity scales with (|S||A|) &#8486;(n) . However, in the case of dense or complete graphs, the local interaction structure becomes degenerate as it takes exponentially many parameters to even specify the local transition probabilities in (1). How to handle the dense or complete graph case remains an interesting future direction, and we believe more structural assumptions are needed to break the curse of dimensionality in that case.Another thing to note in Theorem 4 is that our guarantee is an upper bound on the running average of the squared norm of the gradient and is essentially a local convergence guarantee. This kind of local convergence is typical for actor critic methods, see e.g.<ref type="bibr">Konda and Tsitsiklis [2000]</ref>;<ref type="bibr">Zhang et al. [2018]</ref>. Recently, there have been works studying the optimization landscape for policy optimization, showing that in single-agent settings and for certain parameterizations of the policy, the objective of the policy optimization problem J(&#952;) may satisfy the gradient dominance property, indicating any stationary point will be a global optimum[Bhandari and Russo,  </p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>2019;<ref type="bibr">Agarwal et al., 2021]</ref>. One interesting future direction is to show whether similar properties hold in our multi-agent networked setting with the local interaction structure, and whether the local convergence guarantee can imply a global optimality guarantee.When put in the context of the broader literature in multi-agent RL, our contribution can be</p></note>
		</body>
		</text>
</TEI>
