<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Symmetric Machine Theory of Mind</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10356960</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the 39th International Conference on Machine Learning</title>
<idno></idno>
<biblScope unit="volume">162</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Melanie Sclar</author><author>Graham Neubig</author><author>Yonatan Bisk</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Theory of mind, the ability to model others’ thoughts and desires, is a cornerstone of human social intelligence. This makes it an important challenge for the machine learning community, but previous works mainly attempt to design agents that model the "mental state" of others as passive observers or in specific predefined roles, such as in speaker-listener scenarios. In contrast, we propose to model machine theory of mind in a more general symmetric scenario. We introduce a multi-agent environment SymmToM where, like in real life, all agents can speak, listen, see other agents, and move freely through the world. Effective strategies to maximize an agent’s reward require it to develop a theory of mind. We show that reinforcement learning agents that model the mental states of others achieve significant performance improvements over agents with no such theory of mind model. Importantly, our best agents still fail to achieve performance comparable to agents with access to the gold-standard mental state of other agents, demonstrating that the modeling of theory of mind in multi-agent scenarios is very much an open challenge.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Human communication is shaped by the desire to efficiently cooperate and achieve communicative goals <ref type="bibr">(Tomasello, 2009)</ref>. Children quickly learn that other people have independent mental states, and that communicating is necessary to obtain information from or shape the intentions of those they interact with. Remembering and reasoning Figure <ref type="figure">1</ref>. In SymmToM, agents aim to gain all available information (depicted as diamonds, black for known, white for unknown). Since hearing is limited to its neighbor cells, they must guess what happened beyond this range. Agents can see the whole grid, but mistakes in inferences may happen (as with the red agent). over others' mental states ensures efficient communication by avoiding having to repeat information, and contributes to achieving common goals with minimal effort.</p><p>Because of this, there is growing interest in developing agents that can exhibit this kind of behavior, referred to as Theory of Mind (ToM) by developmental psychologists <ref type="bibr">(Premack &amp; Woodruff, 1978)</ref>. Previous work on agents imbued with such capabilities has focused mainly on two types of tasks. The former are tasks where the agent is a passive observer of a scene that has to predict the future by reasoning over others' mental states. These tasks may involve natural language <ref type="bibr">(Nematzadeh et al., 2018)</ref> or be purely spatial <ref type="bibr">(Gandhi et al., 2021;</ref><ref type="bibr">Rabinowitz et al., 2018;</ref><ref type="bibr">Baker et al., 2011)</ref>. The latter are tasks where the theory of mind agent has a specific role, such as "the speaker" in speaker-listener scenarios <ref type="bibr">(Zhu et al., 2021)</ref>.</p><p>In contrast, human cooperation and communication is often multi-party, and rarely assumes that people have singular pre-specified roles. Moreover, human interlocutors are seldom passive observers of a scene but instead active participants. These dynamics mean human communication has additional complexities, such as the coordination between theory of mind, planning, and action, that are not easily tested in previous work. Therefore, we develop a more flexible environment, SymmToM, where we can study what happens when all participants must act as both speaker and listener. SymmToM is a fully symmetric multi-agent environment where all agents can see, hear, speak, and move, and are active players of a simple information-gathering game. To solve SymmToM, agents need to exhibit different levels of theory of mind, as well as efficiently communicate through a simple channel with a fixed set of symbols.</p><p>SymmToM is partially observable for all agents: even if agents have full vision, hearing may be limited. This also differentiates SymmToM from prior work, as modeling may require probabilistic theory of mind. In other words, agents need to not only remember and infer other agents' knowledge based on what they saw, but also estimate the probability that certain events happened. This estimation may be performed by assuming other agents' optimal behavior and processing the partial information available.</p><p>Despite its simple action space, SymmToM both fulfills the properties required for symmetric theory of mind to arise (which will be discussed in the following section), and empirically cannot be completely solved either by using well-known multi-agent deep reinforcement learning (RL) models, or even by tailoring those models to our task. In addition, all dimensions of complexity can be easily scaled to be more or less challenging, and we demonstrate how to test for different levels of theory of mind with corresponding metrics. Given this simplicity, flexibility, and difficulty, we contend that the SymmToM environment is an attractive first step towards testing the ability of agents to develop symmetric machine theory of mind.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Theory-of-Mind (ToM) Agents</head><p>A Theory-of-Mind agent can be defined as a modification of the standard multi-agent RL paradigm, where the agents' policies are conditioned on their beliefs about others. Formally, we define a reinforcement learning problem M as a tuple of a state space S, action space A, state transition probability function T &#8712; S &#215; A &#215; S &#8594; [0,1], and reward R &#8712; S &#215; A &#8594; R, i.e. M := &#10216;S, A, T, R&#10217;. In this setting, an agent learns a (possibly probabilistic) policy &#960; : S &#8594; A mapping states to actions to maximize their reward.</p><p>In a multi-agent RL setting each agent can potentially have its own state space, action space, transition probabilities, and reward function, so we can define an instance of M i = &#10216;S i , A i , T i , R i &#10217; for each agent i. For convenience, we can also define a joint state space S = &#8899;&#65025; i S i that describes the entire world in which all agents are interacting. Importantly, in this setting each agent will have its own view of the entirety of the world, described by a conditional observation function &#969; i : S &#8594; &#8486; i that maps from the state of the entire environment to only the information observable by agent i.</p><p>Since theory of mind is the ability to know (and act upon) the knowledge that an agent has, agents with no theory of mind will follow a policy that depends only on their current (potentially partial or noisy) observation of their environment: &#960; i (a i,t | &#969; i (s t )). Agents with zeroth order theory of mind <ref type="bibr">(Flobbe et al., 2008;</ref><ref type="bibr">Hedden &amp; Zhang, 2002)</ref> can reason over their own knowledge. These agents will be stateful,</p><p>t is i's hidden state. Hidden states are always accessible to their owner, i.e. i has access to h</p><p>Agents with capabilities of reasoning over other agents' mental states will need to estimate h (j) t for j &#824; = i. We denote i's estimation of j's mental state in time t as h &#710;(i,j)</p><p>How do we estimate h &#710;(i,j) t ? As a function of i's (the predicting agent) previous hidden state t-1, i's observation in t-1, and i's prediction of the hidden states of every agent in the previous turn:</p><p>i's prediction of other agents' observation in t -1 is also crucial, but not explicitly mentioned since it can be computed using &#969; i (s t-1 ). For the initial turn, h &#710;(i,j) 0 may be initialized depending on the problem: if initial knowledge is public, h &#710;(i,j)</p><p>0 may be estimated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Symmetric Theory-of-Mind</head><p>We define symmetric theory of mind environments as settings where theory of mind is required to perform a task successfully, and all agents have the same abilities. Having the same abilities means that all agents would have the same set of legal actions if placed in the same state (in terms of both location and knowledge), which is independent of the policy each agent executes. There are at least four defining characteristics for symmetric theory of mind to arise:</p><p>Symmetric action space. In symmetric theory of mind all agents are required to have the same action space (in contrast to, for example, theory of mind tasks in speakerlistener settings). Concretely, A i = A j &#824; = &#8709; &#8704;i, j.</p><p>Imperfect information. In perfect information scenarios all knowledge is public, making it impossible to have agents with different mental states. In theory of mind tasks in general, there could be a subset of agents with perfect information (e.g. a passive observer predicting future behavior). In symmetric theory of mind, since all agents have the same abilities and roles, all agents must have imperfect information. More precisely, &#969; i -the subset of the full state that agent i can observe if placed in each state-must not be the identity for any agent i.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Observation of others.</head><p>Agents must have at least partial information of another agent to estimate its mental state. In contrast to passive-observer settings, in symmetric theory of mind every agent must be able to partially observe all others. More precisely, &#969; i must observe at least partial information about s</p><p>t (the subset of s t that refers to agent j), although we do not require s (j) t &#824; = &#8709; in every single turn. Moreover, if communication is allowed, it is desirable to partially observe or infer interactions between two or more agents to develop second order theory of mind (i.e. predicting what an agent thinks about what another agent is thinking) or higher.</p><p>Information-seeking behavior. It should be relevant for successfully performing the task to gather as much information as possible, and this information-gathering should involve some level of reasoning over other agent's knowledge. This is true for first-order theory of mind tasks in general, and can be formalized as &#960; * &#824; = &#960; for any zeroth-order theory of mind policy &#960;</p><p>In general, tasks can incorporate perpetual information seeking behaviors, to incentive efficient play even in long episodes. However, to achieve this with finish capacity, requires forgetting. Forgetting can be implemented as an explicit loss of knowledge under specific conditions, or degradation of memories. This introduces the concept of information staleness. Since information is not cumulative and the environment is only partially observable, agents will need to estimate whether what they knew to be true still holds in the present.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">The SymmToM Environment</head><p>SymmToM is an environment where n agents are placed in a w&#215;w grid world, and attempt to maximize their reward by gathering all the information available in the environment. Its construction mirrors the requirements specified above. There are c available information pieces, that each agent may or may not know initially. Information pieces known at the start of an episode are referred to as first-hand information. Each turn, agents may move through the grid to one of its four neighboring cells, and may speak exactly one of their currently known information pieces. More precisely, the action space of agent j is defined as follows:</p><p>A j = {left, right, up, down, no move} &#215; {1, . . . , c} (1)</p><p>When an agent utters an information piece, it is heard by every agent in its hearing range (a 2h + 1 &#215; 2h + 1 grid centered in each agent, with 2h + 1 &lt; w). The agents who heard the utterance can share this newly information with others in following turns. We refer to this as second-hand information, since it is learned -as opposed to first-hand information, given at the start of each episode. The state space is comprised of the position of the agents and their current knowledge: S ={{(p i , k i ), for i &#8712; {1, . . . , n}} where p i &#8712; {1, . . . , w} &#215; {1, . . . , w}, and k i &#8712; {0, 1} c } Each agent aims to maximize their individual reward R i via information seeking and sharing. Rewards are earned by hearing a new piece of information, giving someone else a new piece of information, or correctly using recharge bases. Recharge bases are special cells that reset an agent's knowledge in exchange for a large reward (e.g. (n -1)c times the reward for listening to or sharing new information). Each agent has its own stationary recharge base during an episode. To trigger a base, an agent steps into its base having acquired all the available pieces of information, causing the agent to lose all the second-hand information it learned. Recharge bases guarantee that there is always reward to seek information. Concretely, let s = {(p i , k i ), for i &#8712; {1, . . . , n}} be a state and a i = (a dir i , a comm i ) &#8712; A i an action, where a dir i represents the physical and a comm i the communicative action. We define agent i's reward R i as the addition of three components. First, the reward for hearing new information, measured as the number of new information pieces heard by i. Second, the reward for hearing new information, computed as the number of agents that heard what i said and it was new to them. And lastly, the reward for using the recharge base correctly. Formally,</p><p>where k i,a comm j = 0 represents that the a comm j -th element of k i is unknown (i.e. zero).</p><p>A non-theory of mind agent can only achieve limited success. Without reasoning about its own knowledge (i.e. without zeroth order theory of mind), it does not know when to use a recharge base. Moreover, without knowledge about other agent's knowledge (i.e. without first order theory of mind) it is not possible to know which agents possess the information it is lacking. Even if it accidentally hears information, a nonfirst-order theory of mind agent cannot efficiently decide what to utter in response to maximize its reward. Higher order theory of mind is also often needed in SymmToM, as we will discuss further in &#167;8.</p><p>Even though we only discussed a collaborative task for SymmToM, it can easily be extended for competitive tasks<ref type="foot">foot_0</ref> . Moreover, all our models are also designed to work under competitive settings. SymmToM satisfies the desiderata we laid out in the previous section, as we will detail below: Symmetric action space. As defined in Eq. 1, A i = A j for all i, j. Only a subset may be available at a time since agents cannot step outside the grid, speak a piece they have not heard, or move if they would collide with another agent in the same cell, but they all share the same action space.</p><p>Imperfect information. Messages sent by agents outside of the hearing range will not be heard. For example, in Fig. <ref type="figure">2a</ref> green sends a message but it is not heard by anyone, since it is outside of red's and blue's range. Hearing ranges are guaranteed not to cover the whole grid, since 2h+1 &lt; w.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Observation of others.</head><p>Agents have perfect vision of the grid, even if they cannot hear what was said outside of their hearing range. Hence, an agent may see that two agents were in range of each other, and thus probably interacted, but not hear what was communicated. An example of this can be seen in Fig. <ref type="figure">2a</ref>, where green observes blue and red interacting without hearing what was uttered. The uncertainty in the observation also differentiates SymmToM from prior work: to solve the task perfectly, an agent needs to assess the probability that other agents outside its hearing range shared a specific piece of information to avoid repetition. This estimation may be performed using the knowledge of what each agent knows (first order theory of mind), the perceived knowledge of each of the agents in the interaction (second order theory of mind), as well as higher order theory of mind.</p><p>Information-seeking behavior Rewards are explicitly given for hearing and sharing novel information, guaranteeing information-seeking is crucial in SymmToM.</p><p>Recharge bases (Fig. <ref type="figure">2b</ref>) ensure that the optimal solution is not for all agents to accumulate in the same spot and quickly share all the information available; and that the information tracking required is more complex than accumulating past events. Conceptually, with recharge bases we introduce an explicit and observable forgetting mechanism. As discussed in Section 3, this allows for perpetual information seeking and requires information staleness estimation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Baseline Learning Algorithms and Bounds</head><p>To learn a strong baseline policy for SymmToM, we use MADDPG <ref type="bibr">(Lowe et al., 2017)</ref>, a well-known multiagent actor-critic framework with centralized training and decentralized execution, to counter the non-stationarity nature of multi-agent settings. In MADDPG, each actor policy receives its observation space as input, and outputs the probability of taking each action. Notably, actors in MADDPG have no way of remembering past turns. This is a critical issue in SymmToM, as agents cannot remember which pieces they know, which ones they shared and to whom, and other witnessed interactions. To mitigate this, it is necessary to add a mechanism to carry over information from past turns, for example via incorporating a recurrent network as RMADDPG <ref type="bibr">(Wang et al., 2020)</ref> does.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Perfect Information, Heuristic and Lower Bound Models</head><p>Performance is difficult to interpret without simpler baselines. As a lower bound model we use the original MADDPG, that since it does not have recurrence embedded, should perform worse or equal to any of the modifications described above. We also include an oracle model (MADDPG-Oracle), that does not require theory of mind since it receives the current knowledge K for all agents in its observation space. The performance of MADDPG-Oracle may not always be achieved, as there could be unobserved communication with multiple situations happening with equal probability. Moreover, as the number of agents and size of the grid increases, current reinforcement learning models may not be able to find an optimal spatial exploration policy; they may also not be capable of inferring the optimal piece of information to communicate in larger settings. In these cases, MADDPG-Oracle may not perform optimally, so we also include a baseline with heuristic agents to compare performance. Heuristic agents will always move to the center of the board and communicate round-robin all the information pieces they know until they have all the available knowledge. Then, they will move efficiently to their recharge base and come back to the center of the grid, where the process restarts. We must mention that this heuristic is not necessarily the perfect policy, but it will serve as a baseline to note settings where current multi-agent reinforcement learning models fail even with perfect information. Qualitatively, smaller settings have shown to approximately follow a policy like the heuristic just described.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Explicit Modeling of Symmetric Theory of Mind</head><p>In contrast to RMADDPG <ref type="bibr">(Wang et al., 2020)</ref>, we specifically design algorithms for our environment. This will ensure that we test the current limits of performance with known multi-agent deep reinforcement learning models.</p><p>If even these models fail to solve the task, it will be a clear signal that there is more modeling research needed, and that SymmToM will be a useful benchmark to develop and test on. Intuitively, our model computes a matrix, K &#8712; {0, 1} c&#215;n , that reflects the information pieces known by each agent from the perspective of the agent being modeled: K ij reflects if the agent being modeled believes that agent j knows i. K is updated every turn and used as input of the following turn of the agent, obtaining the desired recurrent behavior. K is also concatenated to the usual observation space, to be processed by a two-layer ReLU MLP and obtain the probability distributions for speech and movement, as in the original MADDPG. There are several ways to approximate K. It is important to note that each agent can only partially observe communication, and thus it is impossible to perfectly compute K deterministically.</p><p>The current knowledge is comprised of first-hand information (the initial knowledge of every agent, F , publicly available) and second-hand information. Secondhand information may have been heard this turn (S, whose computation will be discussed below) or in previous turns (captured in the K received from the previous turn, noted K (t-1) ). Additionally, knowledge may be forgotten when an agent steps on a base having all the information pieces.</p><p>To express this, we precompute a vector B &#8712; {0, 1} n that reflects whether each agent is currently on its base; and a vector E &#8712; {0, 1} n that determines if an agent is entitled to use their recharge base:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>, n}</head><p>We are then able to compute K as follows:</p><p>) and not (B j and E j ) (2) F , K (t-1) , and B are given as input, but we have not yet discussed the computation of the second-hand information S. S often cannot be deterministically computed, since our setting is partially observable. We will identify three behaviors and then compute S as the sum of the three:</p><p>For simplicity, we will assume that we are modeling agent k. S [0] will symbolize the implications of the information spoken by agent k: if agent k speaks a piece of information, they thus know that every agent in its hearing range must have heard it (first order theory of mind). S [1]  will symbolize the implications of information heard by k: this includes updating k's known information (zeroth order theory of mind) and the information of every agent that is also in hearing range of the speaker heard by k. S [2] will symbolize the estimation of information pieces communicated between agents that are out of k's hearing range. Since we assume perfect vision, k will be able to see if two agents are in range of each other, but not hear what they communicate (if they do at all).</p><p>S [0] and S [1] can be deterministically computed. To do so, it is key to note that every actor knows the set of communicative actions A &#8712; {0, 1} c&#215;n performed by each agent last turn, given that those actions were performed in their hearing range. Moreover, each agent knows which agents are in its range, as they all have perfect vision. We precompute H &#8712; {0, 1} n&#215;n to denote if two given agents are in range.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Then, S</head><p>[0] ij = 1 if and only if information piece i was said by k, and agents k and j are in hearing range of each other:</p><p>and only if agent k (the actor we are modeling) heard some agent &#8467; speaking information piece i, and agent j is also in range of agent &#8467;. Note that agent k does not need to be in hearing range of agent j. More precisely,</p><p>for any agent &#8467; S [2] -the interactions between agents not in hearing range of the agent we are modeling-can be estimated in different ways. A conservative approach would be to not estimate interactions we do not witness (S [2] = 0, which we will call MADDPG-ConservativeEncounter (MADDPG-CE)); and another would be to assume that every interaction we do not witness results in sharing a piece of information that will maximize the rewards in that immediate turn. We will call this last approach MADDPG-GreedyEncounter (MADDPG-GE). MADDPG-GE assumes agents play optimally, but does not necessarily know all the known information and that could lead to a wrong prediction. This is particularly true during training, as agents may not behave optimally. The computation of S [2] for MADDPG-GE is as follows.</p><p>First, we predict the information piece U &#8467; that agent &#8467; uttered. MADDPG-GE predicts U &#8467; will be the piece that the least number of agents in range know, as it will maximize immediate reward:</p><p>With this prediction, agent j will know information i if at least one agent in its range said it: MADDPG-EE estimates the probability that an agent j uttered each piece of information (U j &#8712; R c ) by providing the current information of all agents in its range to an MLP:</p><p>{K 1&#8467; , . . . , K c&#8467; for all &#8467; where H j&#8467; })), with f an MLP Then, the probability of having heard a specific piece of information will be the complement of not having heard it, which in turn means that none of the agents in range said it:</p><p>Since MADDPG-EE requires functions to be differentiable, we use a differential approximation of Eq. 2. A pseudocode of MADDPG-EE's implementation can be found in Section A.4. MADDPG-EE solely focuses on first order theory of mind, and we leave to future work modeling with second order theory of mind. The structure of the model would be similar but with an order of magnitude more parameters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Experiments</head><p>Next we compare the aforementioned algorithms. The observation space will be constituted of a processed version of the last turn in the episode, to keep the input size controlled. More precisely, the observation space is composed of: the position of all agents, all recharge bases, the current direction each agent is moving towards, what they communicated in the last turn, the presence of a wall in each of the immediate surroundings, and every agents' first-hand information. First-hand information is publicly available in our experiments to moderate the difficulty of the setup<ref type="foot">foot_1</ref> , but this constraint could also be removed. To lift this constraint, one approach would be to assume that F ij = 0 for every unknown first-hand information, and learn K only based on heard interactions (modeled in S[1]).</p><p>We use reward as our main evaluation metric. This metric indirectly evaluates theory of mind capabilities, since information-seeking is at the core of SymmToM. We train through 60000 episodes, and with 9 random seeds to account for high variances. Our policies are parametrized by a twolayer ReLU MLP with 64 units per layer, as in the original MADDPG <ref type="bibr">(Lowe et al., 2017)</ref>. MADDPG-EE's function f is also a two-layer ReLU MLP with 64 units per layer.</p><p>We test two board sizes (w &#8712; {6, 12}), two numbers of agents (n &#8712; {3, 4}), and three quantities of information pieces (c &#8712; {n, 2n, 3n}). Agents are placed randomly, and initial information is distributed randomly but equitably: each information piece is initially known by the same number of agents. Information exchange is simultaneous among agents. h = 1 for all our experiments: only agents' immediate neighbors will hear what they communicate.</p><p>Running experiments with the same number of turns for every setting would imply that agents can move less in combinations with larger values of w. Therefore, we set the length of each episode to 5w, to make the length of each episode proportional to the grid size. Since the duration of the experiment is directly proportional to the length of the episodes, we settled on a small multiplier. 5w allows agents to move to each edge of the grid and back to the center. More design and experimental details can be found in &#167;A.5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1.">Main Results</head><p>As we can observe in Table <ref type="table">1</ref>, there is a significant difference in performance between MADDPG-Oracle and MADDPG (MADDPG-Oracle is 127% better on average): this confirms that developing theory of mind and recurrence is vital to perform successfully in SymmToM. MADDPG-Oracle is often not an upper bound: when c &gt; n, the heuristic performs better (92% on average, see details in &#167;A.5). This shows that even with perfect information, it can be difficult to learn the optimal policy using MADDPG.</p><p>Moreover, models with recurrence perform significantly better than MADDPG (&#8764;60% better), showing that remembering past information gives a notable advantage. As expected, recurrent models tailored to our problem resulted in better performance than a vanilla LSTM (RMADDPG). The performance of the best of the tailored models (MADDPG-{CE,GE,EE}) was 42% better on average than plain RMADDPG. LSTM was able to surpass the best of the tailored models only for n = 3,w = 12,c = 3n.</p><p>Increasing c generally decreases global rewards for learned agents (on average, c = 2n rewards are 74% of those for c = n, and c = 3n rewards are 76.5% of c = n). This suggests that probabilistic decisions are harder to Table <ref type="table">1</ref>. Average reward per agent evaluated during 1000 episodes. 9 runs are averaged for each learned agent, using the best checkpoint to compensate for collapses in performance seen in Fig. <ref type="figure">5</ref>  learn, or impossible to successfully navigate when several events are equally likely. MADDPG-EE did not show improvements over the other agents, and in some cases performance decreased dramatically (e.g. w = 6, c = 3n). MADDPG-EE uses an MLP in its definition of S [2] , which provides flexibility but complicated learning. We leave exploration of other probabilistic agents to future work, but the significant performance gap between learned models and the MADDPG-Oracle / heuristic shows there is ample space for improvement in this task, and hence proves SymmToM to be a simple yet unsolved benchmark.</p><p>Increasing n results in a 11% reduction of performance on average for learned models. Nonetheless, the heuristic improved its rewards by an average of 46%, given the larger opportunities for rewards when including an additional listener. Overall, this implies that increasing n also makes the setup significantly more difficult. Finally, increasing w did not have a conclusive result: for n = 4 it consistently decreased performance in 17%, but for n = 3 we saw an improvement of 18% and 61% for c = n and c = 2n respectively, and a decrease of 27% for c = 3n.</p><p>In sum, modifying c and n provides an easy way of making a setting more difficult without introducing additional rules.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">Discussion</head><p>A classic example of a scenario specifically designed to test theory of mind is the Sally-Anne task <ref type="bibr">(Wimmer &amp; Perner, 1983)</ref>. This false belief task, originally designed for children, aims to test if a passive observer can answer questions about the beliefs of another person, in situations where that belief may not match reality. If we were to use it for machine theory of mind, we could repeat the experiment and ask an agent to predict the position of an object varying the underlying conditions. This test is feasible because there is only one agent with freedom of action, which ensures that desired conditions are met every time. We can set up a similar setting in SymmToM if we allow for manual control of all agents but one, as shown in Fig. <ref type="figure">3</ref>. Other tests besides the ones shown may be designed. In particular, in Fig. <ref type="figure">3d</ref> we show an example of probabilistic theory of mind where two communicative events are equally likely, but one could modify this scenario to have different probabilities and test the expected value of the turns until red successfully shares an information piece. One could also design retroactive deduction tests: for example, in Fig. <ref type="figure">3d</ref> if red communicates and receives no reward, it can deduce that green had received that information from blue. If there had been another agent (e.g. a yellow agent) in range of blue when it spoke to green, the red agent could also update its knowledge about yellow. Results and full discussion for the proposed tests are detailed in App. A.1. Models generally failed tests depicted in Fig. <ref type="figure">3a</ref> and<ref type="figure">3b</ref>, with significant variance between runs. As expected, w = 12 proved harder than the same test in a smaller grid. In w = 6 models often converged to a suboptimal but reasonable policy, whereas in w = 12 efficient movement to a suboptimal goal was nontrivial. Notably, the second-order theory of mind test (Fig <ref type="figure">3c</ref>)</p><p>averaged &#8764; 75% success rate, which we hypothesize is due to having a mobile agent that the tested agent perceives as feedback.</p><p>Post-hoc analysis also has its challenges in multi-agent settings, even in the most direct cases. Thanks to our reward shaping, using recharge bases is always the optimal move when an agent has all the information available: an agent will have a reward of (n -1)c for using the base, whereas it can only gain up to n -1 + c -1 per turn if it decides not to use it. Even in this case, small delays in using the base may occur, for example if the agent can gather additional rewards on its path to the base. More generally, having multiple agents makes a specific behaviors attributable to any of the several events happening at once, or a combination of them.</p><p>Even though it may be difficult to establish causality when observing single episodes, we developed metrics Example tests for 0 th , 1 st , 2 nd order, and probabilistic Theory of Mind. We test red agents, immobilize gray agents, and control blue and green agents' movements. In Fig. <ref type="figure">3a</ref>, red will go to the top right if it remembers to have heard the first piece, and to the left otherwise. In Fig. <ref type="figure">3b</ref>, red will move to the right if and only if it assumes that the two agents on the left played optimally (red cannot hear them). In Fig. <ref type="figure">3c</ref>, blue is controlled to ensure it will search the agent on the bottom left (its optimal play, in five moves). Red's optimal move is to meet blue, and hence must only move to the bottom left, even if the agent currently there will not provide any reward. In Fig. <ref type="figure">3d</ref>, red will interact with green not knowing what blue previously shared with it. Red should be able to share the missing piece to green with an expected value of 1.5 turns.</p><p>that comparatively show which models are using specific features of the environment better than others. Reward can also be understood as a metric with a more indirect interpretation.</p><p>Post-hoc analyses of single episodes can also be blurred by emergent communication. Since agents were trained together, they may develop special meaning assignment to specific physical movements or messages. Even though qualitatively this does not seem to be the case for the models presented, tests should also account for future developments.</p><p>This also implies that one should not over-interpret small differences in metrics.</p><p>We briefly describe the developed metrics below, full tables of results are available in Appendix A.2. All metrics are normalized by number of agents (i.e., they show the score for a single agent). This allows for better comparison between n = 3 and n = 4 settings.</p><p>Unsuccessful recharge base rate: Average times per episode an agent steps on its recharge base without having all the information available (i.e. wrong usage of the recharge base). Note that an agent may step on its base just because it is on the shortest path to another cell. Therefore, a perfect theory of mind agent will likely not have zero on this score; but generally, lower is better. See A.2 Table <ref type="table">4</ref>.</p><p>Wrong communication piece selection: Average times per episode an agent attempted to say information they currently do not possess. In these cases, no communication happens. Lower is better. See A.2 Table <ref type="table">5</ref>.</p><p>Useless communication piece selection: Average times per episode an agent communicated an information piece that everyone in its hearing range already knew, when having a piece of information that at least one agent in its range did not know. Lower is better. See A.2 Table <ref type="table">6</ref>.</p><p>Useless movement: Average times per episode an agent moves away from every agent that does not have the exact same information it has, given that the agent does not currently possess all the information available. This means that the agent is moving away from any possible valuable interaction. Lower is better. See A.2 Table <ref type="table">7</ref>.</p><p>A.2 contains full results tables. Briefly, we saw that MADDPG-CE and MADDPG-GE used recharge bases unsuccessfully at similar rates as Oracle, whereas RMADDPG performed 41% worse. Regarding information sharing, results suggest all models may be making wrong communicational decisions, but RMADDPG is more biased towards sharing redundant information when in-doubt, whereas MADDPG-CE and MADDPG-EE tend towards not communicating at all (the true effect of trying to share information one does not know).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">Related Work</head><p>Theory-of-Mind has been studied for decades in cognitive science <ref type="bibr">(Premack &amp; Woodruff, 1978;</ref><ref type="bibr">Wellman, 1992;</ref><ref type="bibr">Astington &amp; Baird, 2005)</ref>. More recently, there has been work on developing agents that show that they can reason over the beliefs and goals of others <ref type="bibr">(Rabinowitz et al., 2018;</ref><ref type="bibr">Rescorla, 2015)</ref>. In many cases, models have been evaluated by being passive omniscient observers of a scene, either in a 2D <ref type="bibr">(Rabinowitz et al., 2018)</ref> or natural language <ref type="bibr">(Nematzadeh et al., 2018)</ref> world. Trained models are asked to predict the future given omniscient knowledge, but communication between observed agents is either nonexistent or handcrafted. In cases where the modeled agent is active in the scene, movement or speech may be restricted for some agents but not others, leading to an asymmetric dynamic. For example, MADDPG <ref type="bibr">(Lowe et al., 2017)</ref> has two tasks where oral communication is allowed, but there is only one speaker and the listener(s) have to react. Moreover, the speaker is immobile, in contrast to the listener(s). Other theory of mind speaker-listener tasks were evaluated only with two conversational agents, such as <ref type="bibr">Zhu et al. (2021)</ref>.</p><p>Work in reinforcement learning also often implicitly has some theory of mind modeling, especially in collaborative tasks. Even if the models can scale to multi-agent scenarios, models are often evaluated with only two agents for simplicity <ref type="bibr">(Wang et al., 2020;</ref><ref type="bibr">Jain et al., 2019)</ref>. Evaluating on only two agents often limits the opportunities for efficiently using higher-order theory of mind to solve a task: for example, agents never have to reason about i's assessment of j's modeling of k's mental state. One fully-symmetric multi-agent task often used in reinforcement learning is cooperative navigation <ref type="bibr">(Lowe et al., 2017)</ref>, a collaborative task that requires agents to cover landmarks without collision. Agents need to estimate where other agents will move, thus modeling their mental states. Traditionally, this task does not allow explicit communication between agents, resulting in impoverished theory of mind capabilities <ref type="bibr">(Astington &amp; Baird, 2005)</ref>. Concurrently with our work, ToM2C <ref type="bibr">(Wang et al., 2021)</ref> extended cooperative navigation and another related task (target coverage) to allow communication. Since all agents are symmetric, ToM2C may be understood as an example of multi-agent Symmetric Machine Theory of Mind. Nonetheless, some key differences arise: ToM2C only allows for targeted communication between pairs of sender and receiver, impeding deductions from bystanders of a specific sent message. Moreover, ToM2C only allows the sender to communicate the current estimation of the receiver goals, whereas -as detailed in Section 2-SymmToM allows agents to communicate pieces of information that they estimate are not known to people in their vicinity, but they never reveal this knowledge estimation directly. Information gathering plays a much more crucial role in SymmToM, and agents also need to be able to predict that other agents may forget information.</p><p>In the present work, we only focus on creating a task for analyzing complex reasoning over other agents' knowledge or lack thereof. Although theory of mind typically refers to reasoning over mental states, other aspects of theory of mind include understanding preferences, goals, intentions, and desires of others. Passive-observer benchmarks <ref type="bibr">(Gandhi et al., 2021;</ref><ref type="bibr">Shu et al., 2021)</ref> have been proposed for evaluating the understanding of agent's goals and preferences, as well as understanding agent intentions <ref type="bibr">(Ullman et al., 2009;</ref><ref type="bibr">Netanyahu* et al., 2021)</ref>. Modeling is often analyzed by comparing to a human baseline, which is mainly possible due to the static nature of these datasets. Recently, <ref type="bibr">Tejwani et al. (2021;</ref><ref type="bibr">2022)</ref>, developed a reinforcement learning framework called Social MDP, that incorporates social interactions into MDPs by reasoning recursively about the goals of other agents. As mentioned, reasoning about others goals' is another aspect of theory of mind, and it is complementary to our work. Social MDP's agents have full observation and only need to estimate other agent's goals based on their (fully observable) behavior. In SymmToM, in contrast to Social MDP, all agents have the same publicly-known information-sharing goal. What is unknown to SymmToM agents is the full state, particularly what other agents know at a given time: agents' reasoning aims to deduce interactions they did not witness. Although our reward fosters collaboration, agents in SymmToM do not directly gain from any increase or decrease in others' rewards as in Social MDP. Moreover, Social MDP's task does not have verbal communication, limiting communication to physical signaling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="10.">Conclusions and Future Work</head><p>We defined a framework to analyze machine theory of mind (ToM) in a multi-agent symmetric setting, a richer and more realistic setup than theory of mind tasks currently used. Based on the four properties needed for symmetric theory of mind to arise, we provided a simplified setup on which to test the problem, and showed we can easily increase difficulty by growing the number of agents or communication pieces. Our goal in this work was not to solve symmetric theory of mind, but rather to give a starting point to explore more complex models in this area. Even with this minimal set of rules, SymmToM proves algorithmically difficult for current multi-agent deep reinforcement learning models, even when tailored to our specific task. We leave to future work to develop models that handle second-order theory of mind and beyond, and models that reevaluate past turns to make new deductions with information gained a posteriori (i.e., models that pass retroactive deduction tests). Another interesting direction is to replace the information pieces with constrained natural language: our communication sharing is binary, whereas in language there is flexibility to communicate different subsets of a knowledge base using a single sentence. It would also be interesting to test humans on this task. We hypothesize they may converge to a suboptimal policy -like the heuristic-due to our memory constraints and difficulty to methodically update and estimate knowledge. These should not be limiting factors for agents and thus we expect better performance in agent-agent interactions. We also think it would be valuable to test the differences in human performance if we alleviate memory limitations by allowing to take notes, and re-watching past turns.  We test on the four examples shown in Figure <ref type="figure">3</ref>, adapting the examples to fit one of the grid sizes we already experimented on. For the tests described in Figure <ref type="figure">3a</ref> and Figure <ref type="figure">3b</ref>, we test two different grid sizes: w = 6 and w = 12. For the tests described in Figure <ref type="figure">3c</ref> and Figure <ref type="figure">3d</ref> we only test w = 12 and w = 6 respectively. Image depictions of the exact test configurations can be seen in Figure <ref type="figure">4</ref>.</p><p>We measure three metrics: average success rate (SR), average failure rate (FR), and ratio of average turns to succeed vs. optimum (RATSO). Note that average success rate and average failure rate do not necessarily sum 1 since these two metrics only include trials where the agent reached any of the two proposed outcomes. If, for example, the agent never moved from the starting point, the trial would not be counted positively towards Avg. Success Rate nor Avg. Failure Rate. In addition, ratio of average turns to succeed vs. optimum (RATSO) is the ratio between the average turns it took to succeed in successful trials, and the optimum number of turns to succeed in a specific trial (lower is better, minimum possible value is 1.0).</p><p>For the tests in Figure <ref type="figure">4a</ref>, 4b, 4d, and 4e, the trial ends when the red agent reaches the hearing range of one of the two possible target agents. The test depicted in Figure <ref type="figure">4f</ref> is a pass/fail test: if red moves suboptimally at any point before meeting blue, the trial is declared as failed. This makes it a particularly difficult test to pass at random. Because of the nature of this second order theory of mind test, we only report the average success rate. Finally, for the probabilistic theory of mind test (Figure <ref type="figure">4c</ref>) we want to measure how fast can red communicate all the information it has to green. The optimal number of turns is 1.5 (as described in Figure <ref type="figure">3</ref>). Since this test can end either if all information has been shared to green, or if the  maximum number of turns has been reached, we will only report SR and RATSO. In other words, by design F R = 0 will always hold in this test.</p><p>Results are shown in Table <ref type="table">2</ref>. All tests show there is significant work to be done in improving agents' reasoning. Even in the Oracle setting, agents often fail the tests. For example, MADDPG-Oracle always fails the zeroth-order theory of mind test with w = 6 (depicted in Figure <ref type="figure">4a</ref>). This shows that the trained model has learned a suboptimal but reasonable policy, since it moves towards an agent that will earn it a reward. In contrast, a high 1 -SR -F R in the test in Figure <ref type="figure">4a</ref> shows that the agent never moved to the hearing range of any of the two possible "goal" agents -hence earning zero reward. Even though this would suggest MADDPG-GE performs the worst for this test, it is important to note that immobilizing agents introduces a new confounding variable (as all ad-hoc tests do, in line with what we argue in the main text). For example, if an agent sees that another one is not moving towards them, they might infer this agent is judging the interaction as useless and avoid interaction as well. In a test with a movement-controlled agent instead of all immobilized ones (second-order theory of mind test, Figure <ref type="figure">4f</ref>) MADDPG-GE showed to perform the best among all learned agents, moving optimally in 75% of the trials. Success rate for the best of untrained models was only 33%, showing agents' learning significantly improves performance on this test.</p><p>As expected, tests in smaller grids showed to be easier than the same test performed on agents trained in a larger grid (See results for 0 th and 1 st +2 nd order theory of mind in Table <ref type="table">2</ref>; no model was successful for w = 12). Since reward signals tend to be more sparse in larger grids, all models show larger values of 1 -SR -F R. This may suggest that even efficiently moving towards a suboptimal goal may be a challenge, or that agents converged to a policy that plays a larger weight on making deductions based on other agents' movements. For w = 6, the best success rates were shown by MADDPG-*E models, although they still show ample room for improvement. As it can be seen in Table <ref type="table">3</ref>, even when average success rate is low, there sometimes exist seeds with exceptional performance. Concretely, there was one MADDPG-EE that was able to solve the 0 th test for w = 6 to perfection.</p><p>Finally, in the probabilistic theory of mind test we see that no agent was able to consistently have perfect success in the task 0 th order (w = 6) 0 th order (w = 12)</p><p>1 st order (w = 6)  (71% success rate was the highest achieved, by MADDPG-Oracle). This means that the red agent was not able to consistently communicate all the information to the green agent before the maximum number of turns was reached. Nonetheless, if we constrain ourselves to the successful trials, we see that MADDPG-EE was able to finish the test in less than twice the time of the theoretical optimum (1.89x, a similar rate as MADDPG-Oracle, whose RATSO was 2.1). This suggests that when agents succeed, they do so fairly quickly. For comparison, untrained agents have a RATSO between 5 to 7, showing the training procedure improved this metric significantly.</p><p>As we emphasized in the main text, many more tests can be proposed. Our released code base also allows for easily adding new tests to the suite.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2. Post-hoc analyses</head><p>RMADDPG had the worst scores for unsuccessful recharge base use rate and useless communication piece selection count (see Tables <ref type="table">4</ref> and<ref type="table">6</ref>). RMADDPG scored 41% more than Oracle for unsuccessful base usage on average, and 64% more than Oracle on average for usage of a useless communication piece (in all our metrics, lower is better). The best tailored models (MADDPG-CE and MADDPG-GE) performed similarly to Oracle on average for these two metrics. In contrast, MADDPG-CE and MADDPG-GE performed significantly worse than Oracle for the wrong communication piece selection count (49% and 53% more than Oracle on average, see Table <ref type="table">5</ref>). This suggests that all models may be making wrong decisions, but RMADDPG is biased towards communicating redundant information whereas MADDPG-CE and MADDPG-EE tend towards not communicating at all (the true effect of trying to communicate something they are not allowed). Further analysis is needed to truly understand if these apparently wrong behaviors were done in turns where the agent had all the information available to make a better move, or if this is their default when they believe they have nothing of value to communicate. A priori RMADDPG bias seems more principled, but it still showed worse performance overall.</p><p>No learned model performed particularly better in the useless movement metric (average differences in performance were less than 15%, see Table <ref type="table">7</ref>), suggesting that they perform pointless movements in similar frequencies. It is important not to overinterpret small differences in these metrics. For example, a useless movement may be a signal of emergent communication. Furthermore, an agent may communicate something suboptimal for its immediate reward but this move may not affect its expected reward for the trial.  </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>There are many possible competitive extensions. For example, if we ended the trial when an agent steps successfully on their base for the b-th time (b &gt; 1, to preserve the forgetting mechanism), giving that agent a positive reward and a negative one to all others, we would encourage competition.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>This simple setting is still partially observable, since the agents cannot hear interactions outside of their hearing range.</p></note>
		</body>
		</text>
</TEI>
