<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Unbiased Asymmetric Reinforcement Learning under Partial Observability</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10329260</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the  International Joint Conference on Autonomous Agents and Multiagent Systems</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Andrea Baisero</author><author>Christopher Amato</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In partially observable reinforcement learning, offline training gives access to latent information which is not available during online training and/or execution, such as the system state. Asymmetric actor-critic methods exploit such information by training a history-based policy via a state-based critic. However, many asymmetric methods lack theoretical foundation, and are only evaluated on limited domains. We examine the theory of asymmetric actor-critic methods which use state-based critics, and expose fundamentalissues which undermine the validity of a common variant, and limit its ability to address partial observability. We propose an unbiased asymmetric actor-critic variant which is able to exploit state information while remaining theoretically sound, maintaining the validity of the policy gradient theorem, and introducing no bias and relatively low variance into the training process. An empirical evaluation performed on domains which exhibit significant partial observability confirms our analysis, demonstrating that unbiased asymmetric actor-critic converges to better policies and/or faster than symmetric and biased asymmetric baselines.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Partial observability is a key characteristic of many real-world reinforcement learning (RL) control problems where the agent lacks access to the system state, and is restricted to operate based on the observable past, a.k.a. the history. Such control problems are commonly encoded as partially observable Markov decision processes (POMDPs) <ref type="bibr">[15]</ref>, which are the focus of a significant amount of research effort. Offline learning/online execution is a common RL framework where an agent is trained in a simulated offline environment before operating online, which offers the possibility of using latent information not generally available in online learning, e.g., the simulated system state, or the state belief from the agent's perspective <ref type="bibr">[6,</ref><ref type="bibr">14,</ref><ref type="bibr">16,</ref><ref type="bibr">25,</ref><ref type="bibr">26,</ref><ref type="bibr">34]</ref>.</p><p>Offline learning methods are in principle able to exploit this privileged information during training to achieve better online Figure <ref type="figure">1</ref>: Memory-Four-Rooms-9x9, a procedurally generated navigation task which requires information-gathering and memorization. The agent must avoid the bad exit and reach the good exit, which is identifiable by the color of the beacon. performance, so long as the resulting agent does not use the latent information during online execution. Specifically, actor-critic methods <ref type="bibr">[17,</ref><ref type="bibr">31]</ref> are able to adopt this approach via critic asymmetry, where the policy and critic models receive different information <ref type="bibr">[9,</ref><ref type="bibr">18,</ref><ref type="bibr">20,</ref><ref type="bibr">26,</ref><ref type="bibr">32,</ref><ref type="bibr">36,</ref><ref type="bibr">37]</ref>, e.g., the history and latent state, respectively. This is possible because the critic is merely a training construct, and is not required or used by the agent to operate online. By the very nature of actor-critic methods, critic models which are unable or slow to learn accurate values act as a performance bottleneck on the policy. Consequently, critic asymmetry is a powerful tool which, if carried out with rigor, may provide significant benefits and bootstrap the agent's learning performance.</p><p>Unfortunately, existing asymmetric methods use asymmetric information heuristically, and demonstrate their validity only via empirical experimentation on selected environments <ref type="bibr">[9, 18, 20, 21, 25-28, 32, 36, 37]</ref>; the lack of a sound theoretical foundation leaves uncertainties on whether these methods are truly able to generalize to other environments, particularly those wich feature higher degrees of partial observability (see Figure <ref type="figure">1</ref>). In this work, (a) we analyze a standard variant of asymmetric actor-critic and expose analytical issues associated with the use of a state critic, namely that the state value function is generally ill-defined and/or causes learning bias; (b) we prove an asymmetric policy gradient theorem for partially observable control, an extension of the policy gradient theorem which explicitly uses latent state information; (c) we propose a novel unbiased asymmetric actor-critic method, which lacks the analytical issues of its biased counterparts and is, to the best of our knowledge, the first of its kind to be theoretically sound; (d) we validate our theoretical findings through empirical evaluations on environments which feature significant amounts of partial observability, and demonstrate the advantages of our unbiased variant over the symmetric and biased asymmetric baselines.</p><p>This work sets the stage for other asymmetric critic-based policy gradient methods to exploit asymmetry in a principled manner, while learning under partial observability. Although we focus on advantage actor-critic (A2C), our method is easily extended to other critic-based learning methods such as off-policy actor-critic <ref type="bibr">[8,</ref><ref type="bibr">33]</ref>, (deep) deterministic policy gradient <ref type="bibr">[19,</ref><ref type="bibr">29]</ref>, and asynchronous actorcritic <ref type="bibr">[22]</ref>. Offline training is also the dominant paradigm in multiagent RL, where many asymmetric actor-critic methods could be similarly improved <ref type="bibr">[9,</ref><ref type="bibr">18,</ref><ref type="bibr">20,</ref><ref type="bibr">21,</ref><ref type="bibr">27,</ref><ref type="bibr">28,</ref><ref type="bibr">32,</ref><ref type="bibr">36,</ref><ref type="bibr">37]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>The use of latent information during offline training has been successfully adopted in a variety of policy-based methods <ref type="bibr">[7,</ref><ref type="bibr">9,</ref><ref type="bibr">18,</ref><ref type="bibr">20,</ref><ref type="bibr">26,</ref><ref type="bibr">32,</ref><ref type="bibr">34,</ref><ref type="bibr">36,</ref><ref type="bibr">37]</ref> and value-based methods <ref type="bibr">[7,</ref><ref type="bibr">21,</ref><ref type="bibr">27,</ref><ref type="bibr">28]</ref>. Among the single-agent methods, asymmetric actor-critic for robot learning <ref type="bibr">[26]</ref> uses a reactive variant of DDPG with a state-based critic to help address partial observability; belief-grounded networks <ref type="bibr">[25]</ref> use a belief-reconstruction auxiliary task to train history representations; and Warrington et al. <ref type="bibr">[34]</ref> and Chen et al. <ref type="bibr">[6]</ref> use a fully observable agent trained offline on latent state information to train a partially observable agent via imitation.</p><p>Asymmetric learning has also become popular in the multi-agent setting: COMA <ref type="bibr">[9]</ref> uses reactive control and a shared asymmetric critic which can receive either the joint observations of all agents or the system state to solve cooperative tasks; MADDPG <ref type="bibr">[20]</ref> and M3DDPG <ref type="bibr">[18]</ref> use the same form of asymmetry with individual asymmetric critics to solve cooperative-competitive tasks; R-MADDPG <ref type="bibr">[32]</ref> uses recurrent models to represent non-reactive control, and the centralized critic uses the entire histories of all agents; CM3 <ref type="bibr">[37]</ref> uses a state critic for reactive control; while ROLA <ref type="bibr">[36]</ref> trains centralized and local history/state critics to estimate individual advantage values. Asymmetry is also used in multi-agent value-based methods: QMIX <ref type="bibr">[28]</ref>, MAVEN <ref type="bibr">[21]</ref>, and WQMIX <ref type="bibr">[27]</ref> all train individual Q-models using a centralized but factored Q-model, itself trained using state, joint histories, and joint actions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">BACKGROUND</head><p>In this section, we review background topics relevant to understand our work, i.e., POMDPs, the RL graphical model, standard (symmetric) actor-critic, and asymmetric actor-critic.</p><p>Notation. We denote sets with calligraphy X, set elements with lowercase &#119909; &#8712; X, random variables (RVs) with uppercase &#119883; , and the set of distributions over set X as &#916;X. Occasionally, we will need absolute and/or relative time indices; We use subscript &#119909; &#119905; to indicate absolute time, and superscript &#119909; (&#119896;) to indicate the relative time of variables, e.g., &#119909; (0) marks the beginning of a sequence happening at an undetermined absolute time, and &#119909; (&#119896;) is the variable &#119896; steps later. We also use the bar notation to represent a sequence of superscripted variables x = (&#119909; (0) , &#119909; (1) , &#119909; (2) , . . .). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">POMDPs</head><p>A POMDP <ref type="bibr">[15]</ref>  In the partially observable setting, the agent lacks access to the underlying state, and actions are selected based on the observable history &#8462;, i.e., the sequences of past actions and observations. We denote the space of realizable 1 histories as H &#8838; (A &#215; O) * , and the space of realizable histories of length &#119897; as H &#119897; &#8838; (A &#215; O) &#119897; . Generally, an agent operating under partial observability might have to consider the entire history to achieve optimal behavior <ref type="bibr">[30]</ref>, i.e., its policy should represent a mapping &#120587; : H &#8594; &#916;A. The beliefstate &#119887; : H &#8594; &#916;S is the conditional distribution over states given the observable history, i.e., &#119887; (&#8462;) = Pr(&#119878; | &#8462;), and a sufficient statistic of the history for optimal control <ref type="bibr">[15]</ref>. We define the history reward function as R(&#8462;, &#119886;) = E &#119904; |&#8462; [R(&#119904;, &#119886;)]; from the agent's perspective, this is the reward function of the decision process. We denote the last observation in a history &#8462; as &#119900; &#8462; , and say that an agent is reactive if its policy &#120587; : O &#8594; &#916;A only uses &#119900; &#8462; rather than the entire history. A policy's history value function &#119881; &#120587; : H &#8594; R is the expected return following a realizable history &#8462;,</p><p>which supports an indirect recursive Bellman form,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">The RL Graphical Model</head><p>Some of the theory and results developed in this document concerns whether certain RVs of interest are well-defined; therefore, we review the RVs defined by POMDPs. The environment dynamics and the agent policy jointly induce a graphical model (see Figure <ref type="figure">2</ref>) over timed RVs &#119878; &#119905; , &#119860; &#119905; , and &#119874; &#119905; . Note that only timed RVs are defined A probability is a numeric value associated with the assignment of a value &#119909; from a sample space X to an RV &#119883; , e.g., Pr(&#119883; = &#119909;). Although it is common to use simplified notation to informally omit the RV assignment (e.g., Pr(&#119909;)), it must always be implicitly clear which RV (&#119883; ) is involved in the assignment. In the reinforcement learning graphical model, a probability is well-defined if and only if (a) it is grounded (implicitly or explicitly) to timed RVs (or functions thereof); or (b) it is time-invariant (i.e., it can be impicitly grounded to any time index). For example, Pr(&#119904; &#8242; | &#119904;, &#119886;) is implicitly grounded to the RVs of a state transition Pr(&#119878; &#119905; +1 = &#119904; &#8242; | &#119878; &#119905; = &#119904;, &#119860; &#119905; = &#119886;), and although the time-index &#119905; is not clear from context, the probability is time-invariant and thus well defined. As another example, Pr(&#119904; | &#8462;) is implicitly grounded to the RVs of a belief Pr(&#119878; &#119905; = &#119904; | &#119867; &#119905; = &#8462;), where the time-index &#119905; is implicitly grounded to the history length &#119905; = |&#8462;|, which makes the probability well defined.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">(Symmetric) Actor-Critic for POMDPs</head><p>Policy gradient methods <ref type="bibr">[31]</ref> for fully observable control can be adapted to partial observable control by replacing occurrences of the system state &#119904; with the history &#8462; (which is the Markov-state of an equivalent history-MDP). In advantage actor-critic methods (A2C) <ref type="bibr">[17]</ref>, a policy model &#120587; : H &#8594; &#916;A parameterized by &#120579; is trained using gradients estimated from sample data, while a critic model V : H &#8594; R parameterized by &#120599; is trained to predict history values &#119881; &#120587; (&#8462;). Note that we annotate parametric critic models with a hat V , to distinguish them from their analytical counterparts &#119881; &#120587; . In A2C, the critic is used to bootstrap return estimates and as a baseline, both of which are techniques for the reduction of estimation variance <ref type="bibr">[10]</ref>. The actor and critic models are respectively trained on L policy (&#120579; ) + &#120582;L neg-entropy (&#120579; ) and L critic (&#120599;).</p><p>Policy Loss. The policy loss L policy (&#120579; ) = -E &#8734; &#119905; =0 &#120574; &#119905; R(&#119904; &#119905; , &#119886; &#119905; ) encodes the agent's performance as the expected return. The policy gradient theorem <ref type="bibr">[17,</ref><ref type="bibr">31]</ref> provides an analytical expression for the policy loss gradient w.r.t. the policy parameters,</p><p>Value &#119876; &#120587; (&#8462; &#119905; , &#119886; &#119905; ) is replaced by the temporal difference (TD) error &#120575; &#119905; to reduce variance (at the cost of introducing modeling bias),</p><p>Critic Loss. The critic loss</p><p>&#119905; is used to minimize the total TD error, the gradient of which should propagate through V (&#8462; &#119905; ), but not through the bootstrapping V (&#8462; &#119905; +1 ).</p><p>Negative-Entropy Loss. Finally, the negative-entropy loss is commonly used, L neg-entropy (&#120579; ) = -E [ &#119905; H [&#120587; (&#119860; &#119905; ; &#8462; &#119905; )]], in combination with a decaying weight &#120582;, to avoid premature convergence of the policy model and to promote exploration <ref type="bibr">[35]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Asymmetric Actor-Critic for POMDPs</head><p>While asymmetric actor-critic can be understood to be an entire family of methods which use critic asymmetry, for the remainder of this document we will be specifically referring to a non-reactive and non-deterministic variant of the work by Pinto et al. <ref type="bibr">[26]</ref>, which uses critic asymmetry to address image-based robot learning. Their work uses a reactive variant of deep deterministic policy gradient (DDPG) <ref type="bibr">[19]</ref> trained in simulation, and replaces the reactive observation critic V (&#119900;) with a state critic V (&#119904;); the variant we will be analyzing applies the same critic substitution to A2C. In practice, this state-based asymmetry is obtained by replacing the TD error of Equation ( <ref type="formula">6</ref>) (used in both the policy and critic losses) with</p><p>Although <ref type="bibr">[26]</ref> claim that their work addresses partial observability, their evaluation is based on reactive environments which are effectively fully observable; while the agent only receives a single image, each image provides a virtually complete and occlusion-free view of the entire workspace. In practice, the images are merely high-dimensional representations of a compact state.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">THEORY OF ASYMMETRIC ACTOR-CRITIC</head><p>In this section, we analyze the theoretical implications of using a state critic under partial observability, as described in Section 3.4, and expose critical underlying issues. The primary result will be that the time-invariant state value function &#119881; &#120587; (&#119904;) of a non-reactive agent is generally ill-defined. Then, we show that the time-invariant state value function &#119881; &#120587; (&#119904;) of a reactive agent is well-defined under mild assumptions, but generally introduces a bias into the training process which may undermine learning. Finally, we show that the time-invariant state value function &#119881; &#120587; (&#119904;) of a reactive agent under stronger assumptions can be both well-defined and unbiased. Later, in Section 5, we provide a more general alternative which guarantees well-defined and unbiased time-invariant state-based value functions for arbitrary policies and control problems.</p><p>Informally, the issue with &#119881; &#120587; (&#119904;) is that the state alone does not contain sufficient information to determine the agent's future behavior-which generally depends on the history-and is thus unable to accurately represent expected future returns. Ironically, state values suffer from a form of history aliasing, i.e., being unable to infer the agent's history from the system's state. This is particularly evident in control problems which require the agent to perform forms of information gathering (a common occurrence in partially observable control) which are not reflected in the system state, e.g., reach a certain spot to observe a piece of information which is necessary to determine future optimal behavior and solve the control task. In such cases, the state alone does not generally indicate whether the agent has collected the necessary information in the past or not, and is therefore unable to represent adequately whether the current state is a positive or negative occurrence. Formally, we will show that &#119881; &#120587; (&#119904;) is generally not a well-defined quantity and, even in special cases where it is well-defined, generally introduces a bias in the learning process caused by the imperfect correlation between histories and states; in essence, the average value of histories inferred from the current state is not an accurate estimate of the current history's value.</p><p>Methodology. We note that replacing the history critic is intrinsically questionable: the policy gradient theorem for POMDPs (Equation (4)) specifically requires history values, and replacing them with other state-based values will generally result in biased gradients and a general loss of theoretical guarantees. Therefore, we analyze state values &#119881; &#120587; (&#119904;) as stochastic estimators of history values &#119881; &#120587; (&#8462;) and consider the corresponding estimation bias, i.e., the difference between the expected estimate E &#119904; |&#8462; [&#119881; &#120587; (&#119904;)] and the ground truth estimation target &#119881; &#120587; (&#8462;) for any given history &#8462;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">General Policy under Partial Observability</head><p>A policy's state value function &#119881; &#120587; : S &#8594; R is tentatively defined as the expected return following a realizable state &#119904;,</p><p>which, if well-defined, supports an indirect recursive Bellman form,</p><p>In Equation ( <ref type="formula">9</ref>), we note the term Pr(&#119886; | &#119904;), which encodes the likelihood of an action being taken from a given state. Because the agent policy depends on histories (not states), this term is not directly available, but must be derived indirectly by integrating over possible histories. Further, because &#119904; is timeless, and no additional context is available to narrow down time, there is no choice but to integrate over histories of all possible lengths.</p><p>Equation ( <ref type="formula">11</ref>) reveals the probability term Pr(&#8462; | &#119904;), which encodes the likelihood of a history having taken place in the past given a current state. While Pr(&#8462; | &#119904;) may look harmless, it is the underlying cause of serious analytical issues. As discussed in Section 3.2, a probability is only well-defined if associated with well-defined RVs, and unfortunately such RVs do not exist for Pr(&#8462; | &#119904;). On one hand, timed RVs Pr(&#119867; &#119905; = &#8462; | &#119878; &#119905; = &#119904;) cannot be used, because Equation <ref type="bibr">(11)</ref> integrates over the sample space of all histories, and not just those of a given length &#119905;. On the other hand, time-less RVs Pr(&#119867; = &#8462; | &#119878; = &#119904;) cannot be used, because such time-less RVs do not exist in the RL graphical model. Ultimately, Pr(&#8462; | &#119904;) is mathematically ill-defined, which consequently causes both Pr(&#119886; | &#119904;) and &#119881; &#120587; (&#119904;) to be ill-defined as well.</p><p>Theorem 4.1. In partially observable control problems, a timeinvariant state value function &#119881; &#120587; (&#119904;) is generally ill-defined.</p><p>The practical implications of an ill-defined value function are not obvious; even though the analytical value function &#119881; &#120587; (&#119904;) is ill-defined, the state critic's V (&#119904;) training process is based on valid calculations over sample data, which results in syntactically valid updates of the critic parameters. However, given that asymptotic convergence is theoretically impossible when &#119881; &#120587; (&#119904;) is ill-defined, the critic's target will continue shifting indefinitely based on the recent batches of training data, even when unbiased Monte Carlo return estimates are used to train the critic (without bootstrapping). In practice, the effects are not necessarily catastrophic for all control problems, and likely vary depending on the amount of partial observability, on the agent's need to gather and remember information, and on the specific state and observation representations.</p><p>In principle, timed value functions &#119881; &#120587; &#119905; (&#119904;) represent a straightforward solution to all these issues (see appendix <ref type="bibr">[2]</ref>). However, learning a timed critic model is likely to pose additional learning challenges, due to the need to generalize well and accurately across time-steps. Rather, we will demonstrate that there are special cases of the general control problem which do guarantee well-defined time-invariant value functions &#119881; &#120587; (&#119904;) (see Sections 4.2 and 4.3). However, before that, we can already show that, even when &#119881; &#120587; (&#119904;) is guaranteed to be well-defined, it is not guaranteed to be unbiased. Theorem 4.2. Even when well-defined, a time-invariant state value function &#119881; &#120587; (&#119904;) is generally a biased estimate of &#119881; &#120587; (&#8462;), i.e., it is not guaranteed that</p><p>Proof. Consider two histories which are different, &#8462; &#8242; &#8800; &#8462; &#8242;&#8242; , and result in different action distributions, &#120587; (&#119860;; &#8462; &#8242; ) &#8800; &#120587; (&#119860;; &#8462; &#8242;&#8242; ), but are associated with the same belief, &#119887; (&#8462; &#8242; ) = &#119887; (&#8462; &#8242;&#8242; )-a fairly common occurrence in many POMDPs (see appendix <ref type="bibr">[2]</ref>). On one hand, because the two histories result in different behaviors, future trajectories and rewards will differ, leading to different history values, &#119881; &#120587; (&#8462; &#8242; ) &#8800; &#119881; &#120587; (&#8462; &#8242;&#8242; ). On the other hand, because the two beliefs are equal, the expected state values must also be equal,</p><p>held for all histories, then it would hold for &#8462; &#8242; and &#8462; &#8242;&#8242; too, which implies &#119881; &#120587; (&#8462; &#8242; ) = E &#119904; |&#8462; &#8242; [&#119881; &#120587; (&#119904;)] = E &#119904; |&#8462; &#8242;&#8242; [&#119881; &#120587; (&#119904;)] = &#119881; &#120587; (&#8462; &#8242;&#8242; ) -a simple contradiction. Therefore, either &#119881; &#120587; (&#8462; &#8242; ) &#8800; E &#119904; |&#8462; &#8242; [&#119881; &#120587; (&#119904;)] or &#119881; &#120587; (&#8462; &#8242;&#8242; ) &#8800; E &#119904; |&#8462; &#8242;&#8242; [&#119881; &#120587; (&#119904;)] (or both). &#9633;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Reactive Policy under Partial Observability</head><p>We show that &#119881; &#120587; (&#119904;) is well-defined if we make two assumptions about the agent and environment: (a) that the policy is reactive (a common but inadequate assumption); and (b) that the POMDP observation function depends only on the current state, O : S &#8594; &#916;O, rather than the entire state transition (a mild assumption). Under these assumptions, we can expand Pr(&#119886; | &#119904;) by integrating over the space of all observations (rather than all histories),</p><p>In this case, Pr(&#119900; | &#119904;) is time-invariant, and can therefore be implicitly grounded to RVs of any time index Pr(&#119874; &#119905; = &#119900; | &#119878; &#119905; = &#119904;). This leads to a well-defined value &#119881; &#120587; (&#119904;) which, however, generally remains biased compared to &#119881; &#120587; (&#8462;), per Theorem 4.2. In addition to Theorem 4.2, which is applicable in a more general setting, see appendix <ref type="bibr">[2]</ref> for two additional proofs which also take into account the specific assumptions made here. Broadly speaking, the bias is caused by the fact that hidden in &#119881; &#120587; (&#119904;) is an expectation over observations &#119900; which are not necessarily consistent with the true history &#8462;; each proof covers this issue from different angles.</p><p>Although the value function is well-defined under reactive control, there are still two significant issues which preclude these assumptions from representing a general solution: (a) reactive policies are inadequate to solve many POMDPs; and (b) the value function bias may prevent the agent from learning a satisfactory behavior.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Reactive Policy under Full Observability</head><p>We show that the state value function is both well-defined and unbiased under two assumptions: (a) that the policy is reactive (a common but inadequate assumption); and (b) that there is a bijective abstraction &#120601; : O &#8594; S between observations and states (an unrealistic assumption). The abstraction &#120601; encodes the fact that the environment is not truly partially observable, but rather that states and observations fundamentally contain the same information, albeit at different levels of abstraction. For example, in the control problems used by Pinto et al. <ref type="bibr">[26]</ref>, and an image displaying a workspace without occlusions is a low-level abstraction (observation), while a concise vector representation of the object poses in the workspace are a high-level abstraction (state).</p><p>In this case, the action probability term Pr(&#119886; | &#119904;) does not need to be obtained indirectly by integrating other variables; rather, bijection &#120601; can be used to relate it to the policy model Pr(&#119886; | &#119904;) = &#120587; (&#119886;; &#120601; -1 (&#119904;)). Contrary to the previous cases, the overall state value function &#119881; &#120587; (&#119904;) is not only well-defined, but also unbiased. Proof. The bijection between &#119900; &#8462; and &#119904; not only implies a manyto-one relationship between histories and states, but also fully determines the agent's state-conditioned action. In the following derivation, we use these facts to determine the first action and reward, a process which can be repeated indefinitely for future actions and rewards.</p><p>(repeat process until end of episode)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9633;</head><p>The benefit of using a state critic under this scenario is that the critic model can avoid learning a representation of the observations before learning the values <ref type="bibr">[26]</ref>. Naturally, the main disadvantage of this scenario is that most POMDPs do not satisfy the bijective abstraction assumption; if anything, this assumption is intrinsically incompatible with partial observability, and any POMDP which satisfies this assumption is really an MDP in disguise. Nonetheless, if a control problem only deviates mildly from full observability, it is likely that a state critic will benefit the learning agent despite the theoretical issues.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">UNBIASED ASYMMETRIC ACTOR-CRITIC</head><p>In this section, we introduce unbiased asymmetric actor-critic, an actor-critic variant able to exploit asymmetric state information during offline training while avoiding the issues of state value functions exposed in Section 4. Consider the history-state value function &#119881; &#120587; (&#8462;, &#119904;) <ref type="bibr">[5]</ref>, defined as the expected return following a realizable history-state pair &#8462; and &#119904;,</p><p>which supports an indirect recursive Bellman form,</p><p>Note that the history &#8462; and state &#119904; cover different and orthogonal roles: the history &#8462; determines the future behavior of the agent, while the state &#119904; determines the future behavior of the environment. Compared to the history value &#119881; &#120587; (&#8462;), the state information in &#119881; &#120587; (&#8462;, &#119904;) provides additional context to determine the agent's true underlying situation, its rewards, and its expected return. Compared to the state value &#119881; &#120587; (&#119904;), the history information in &#119881; &#120587; (&#8462;, &#119904;) provides additional context to determine the agent's future behavior, which guarantees that &#119881; &#120587; (&#8462;, &#119904;) is well-defined and unbiased.</p><p>Theorem 5.1. For arbitrary control problems and policies, &#119881; &#120587; (&#8462;, &#119904;) is an unbiased estimate of &#119881; &#120587; (&#8462;), i.e., &#119881; &#120587; </p><p>Proof. Follows from Equations ( <ref type="formula">1</ref>) and ( <ref type="formula">14</ref>),</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9633;</head><p>As we have done for state values &#119881; &#120587; (&#119904;), we are interested in the properties of history-state values &#119881; &#120587; (&#8462;, &#119904;) in relation to history values &#119881; &#120587; (&#8462;). Theorem 5.1 shows that history and history-state values are related by &#119881; &#120587; (&#8462;) = E &#119904; |&#8462; [&#119881; &#120587; (&#8462;, &#119904;)], i.e., history-state values are interpretable as Monte Carlo (MC) estimates of the respective history values. In expectation, history-state values provide the same information as the history values, therefore an asymmetric variant of the policy gradient theorem can be formulated. Theorem 5.2 (Asymmetric Policy Gradient).</p><p>Proof. Following Theorem 5.1, we have</p><p>Therefore,</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9633;</head><p>As estimators, history-state values &#119881; &#120587; (&#8462;, &#119904;) can be described in terms of their bias and variance w.r.t. history values &#119881; &#120587; (&#8462;). Beyond providing the inspiration for the MC interpretation, Theorem 5.1 already proves that &#119881; &#120587; (&#8462;, &#119904;) is unbiased, while its variance is dynamic and depends on the history &#8462; via the belief-state Pr(&#119878; | &#8462;); in particular, low-uncertainty belief-states result in low variance, and deterministic belief-states result in no variance. Given that operating optimally in a partially observable environment generally involves information-gathering strategies associated with low-uncertainty belief-states, the practical variance of the historystate value is likely to be relatively low once the agent has learned to solve the task to some degree of success.</p><p>Inspired by Theorem 5.2, we propose unbiased asymmetric A2C, which uses a history-state critic V : H &#215; S &#8594; R trained to model history-state values &#119881; &#120587; (&#8462;, &#119904;),</p><p>Because V (&#8462;, &#119904;) receives the history &#8462; as input, it can still predict reasonable estimates of the agent's expected future discounted returns; and because it receives the state &#119904; as input, it is still able to exploit state information while introducing no bias into the learning process, e.g., for the purposes of bootstrapping the learning of critic values and/or aiding the learning of history representations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Interpretations of State</head><p>Although the history-state value is analytically well-defined, it remains worthwhile to question why the inclusion of the state information should help the actor-critic agent at all. We attempt to address this open question, and consider two competing interpretations, which we call state-as-information and state-as-a-feature.</p><p>State as Information. Under this interpretation, state information is valuable because it is latent information unavailable in the history, which results in more informative values which help train the policy. However, we argue that this interpretation is flawed for two reasons: (a) The policy gradient theorem specifically requires &#119881; &#120587; (&#8462;), which contains precisely the correct information required to accurately estimate policy gradients. In this context, history values already contain the correct type and amount of information necessary to train the policy, and there is no such thing as "more informative values" than history values. (b) In theory, the history-state value in Theorem 5.2 could use any other state sampled according to s &#8764; &#119887; (&#8462;), rather than the true system state, which would also result in the same analytical bias and variance properties. In practice, we only use the true system state due to it being directly available during offline training; however, we believe that its identity as the true system state is analytically irrelevant, which leads to the next interpretation of state.</p><p>State as a Feature. We conjecture an alternative interpretation according to which the state can be seen as a stochastic high-level feature of the history. Consider a history critic V (&#8462;); to appropriately model the value function &#119881; &#120587; (&#8462;), V (&#8462;) must first learn an adequate history representation, which is in and of itself a significant learning challenge. The critic model would likely benefit from receiving auxiliary high-level features of the history &#120601; (&#8462;). The resulting critic V (&#8462;, &#120601; (&#8462;)) remains fundamentally a history critic, as the auxiliary features are exclusively a modeling/architecture construct. Next, we consider what kind of high-level features &#120601; (&#8462;) would be useful for control. While the specifics of what makes a good history representation depend strongly on the task, there is a natural choice which is arguably useful in many cases: the belief-state &#119887; (&#8462;). Because the belief-state is a sufficient statistic of the history for control, providing it to the critic model V (&#8462;, &#119887; (&#8462;)) is likely to greatly improve its ability to generalize across histories. Finally, we conjecture that any state sampled according to the belief-state &#119904; &#8764; &#119887; (&#8462;)-including the true system state-can be considered a stochastic realization of the belief-state feature, resulting in the history-state critic V (&#8462;, &#119904;). According to this interpretation, the importance of the state in the history-state critic is not in its identity as the true system state, but as a stochastic realization of hypothetical belief-state features, and presumably any other state sampled from the belief-state s &#8764; &#119887; (&#8462;) could be equivalently used.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">EVALUATION</head><p>We compare the learning performances of five actor-critic variants. A2C, A2C-asym-s, and A2C-asym-hs are respectively (symmetric) A2C with history critic V (&#8462;), asymmetric A2C with state critic V (&#119904;), and asymmetric A2C with history-state critic V (&#8462;, &#119904;). To demonstrate that the environments feature significant partial     In contrast, our proposed unbiased asymmetric variant A2Casym-hs displays some of the best learning characteristics across all environments. In Cleaner, Memory-Four-Rooms-7x7, and Memory-Four-Rooms-9x9, its performance matches that of A2C (Figures 3f to 3h), while in Car-Flag it matches that of A2C-asyms (Figure <ref type="figure">3e</ref>). In and of itself, this indicates that A2C-asym-hs is able to exploit whichever source of information (history or state) happens to be more suitable in practice to solve a given task. On top of that, A2C-asym-hs demonstrates strictly better final performance and/or convergence speed than both A2C and A2C-asym-s in Shopping-5 and Shopping-6 (Figures <ref type="figure">3c</ref> and<ref type="figure">3d</ref>), demonstrating that it is not only able to use the best source of information, but also of combining both sources to achieve a higher best-of-bothworlds performance. This ability is pushed one step further and demonstrated in Heaven-Hell-3 and Heaven-Hell-4, where A2Casym-hs is the only method capable of learning to solve the task at all (Figures <ref type="figure">3a</ref> and<ref type="figure">3b</ref>). These results strongly demonstrate the importance of exploiting asymmetric information in ways which are theoretically justified and sound, as done in our work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.2">Critic Values.</head><p>To further inspect the behavior of each critic, Figure <ref type="figure">4</ref> shows the evolution of critic values over the course of training for important history-state pairs in Heaven-Hell-4. We use 4 deliberately chosen history-state pairs which are particularly important in this environment. In each case, the agent is located at the fork between heaven and hell, and the cases differ by the position of heaven (left or right) and whether the agent has previously performed the information-gathering sequence of actions necessary to know the position of heaven (by visiting the priest).</p><p>Unsurprisingly, we first note that critic values are correlated with the respective agent's performance (Figure <ref type="figure">3b</ref>). Beyond that, the critics show certain individual characteristics: namely, the critics which focus on a single aspect of the join history-state output the exact same values for different history-states. Although hard to see, the A2C critic V (&#8462;) outputs are identical in Figures <ref type="figure">4a</ref> and<ref type="figure">4b</ref>, as those values are associated with the same histories (but not the same states). Similarly, the A2C-asym-s critic V (&#119904;) outputs are identical in Figures <ref type="figure">4a</ref> and<ref type="figure">4c</ref> and Figures <ref type="figure">4b</ref> and<ref type="figure">4d</ref> respectively, as those values are associated with the same states (but not the same histories). This confirms a straightforward truth: that the state critic V (&#119904;) is intrinsically unable to differentiave between values associated to different histories if they happen to be associated with the same state, which can be particularly detrimental in such informationgathering and memory dependent tasks. On the other hand, the A2C-asym-hs critic V (&#8462;, &#119904;) has the ability to output different values, as needed, for each of the four cases. Note, in particular, that the A2C-asym-hs critic is able to associate a higher reward to the agent if it has already performed the information-gathering actions (Figures <ref type="figure">4c</ref> and<ref type="figure">4d</ref>), compared to when it has not (Figures <ref type="figure">4a</ref> and<ref type="figure">4b</ref>), which helps the agent determine that the information-gathering actions are important and should be performed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">CONCLUSIONS</head><p>In partially observable control problems, the offline training/online execution framework offers the peculiar opportunity to access the system's state during training, which otherwise remains latent during execution. Asymmetric methods trained offline can potentially exploit such privileged information to help train the agents to reach better performance and/or train more efficiently and using less data than before. While this idea has great potential, current state-of-the-art methods are motivated and driven by empirical results rather than theoretical analysis. In this work, we exposed fundamental theoretical issues with a standard variant of asymmetric actor-critic which made use of state critics &#119881; &#120587; (&#119904;), and proposed an unbiased asymmetric variant which makes use of history-state critics &#119881; &#120587; (&#8462;, &#119904;) and is the first of its kind to be analytically sound and theoretically justified. Although this represents a relatively simple change, its effects are profound, as demonstrated in both theoretical analysis and empirical results. Our evaluations confirm our analysis, and demonstrate both the issues with state-based critics and the benefits of history-state critics in environments which exhibit significant partial observability.</p><p>Although our evaluation only concerns A2C, the same concepts are easily extensible to other critic-based RL methods <ref type="bibr">[8,</ref><ref type="bibr">19,</ref><ref type="bibr">22,</ref><ref type="bibr">29]</ref>. The potential for future work is varied. One possibility is to extend the theory of history-state value functions to optimal value functions &#119876; * (&#8462;, &#119904;, &#119886;), and develop theoretically sound asymmetric variants of value-based deep RL methods such as DQN <ref type="bibr">[23]</ref>. Another possibility is to integrate asymmetric information with state-ofthe-art maximum entropy value/critic-based methods such as soft Q-learning <ref type="bibr">[11]</ref>, and soft actor-critic <ref type="bibr">[12]</ref>. Finally, another venue for improvement is to extend our theory and approach to multiagent methods, potentially bringing theoretical rigor and improved performance <ref type="bibr">[9,</ref><ref type="bibr">18,</ref><ref type="bibr">20,</ref><ref type="bibr">21,</ref><ref type="bibr">27,</ref><ref type="bibr">28,</ref><ref type="bibr">32,</ref><ref type="bibr">37]</ref>.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Proc. of the 21st International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2022), P. Faliszewski, V. Mascardi, C. Pelachaud, M.E. Taylor (eds.), May 9-13, 2022, Online. &#169; 2022 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved.(a) State. (b) Observation.</p></note>
		</body>
		</text>
</TEI>
