<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>The VoxWorld Platform for Multimodal Embodied Agents</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022 Summer</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10379209</idno>
					<idno type="doi"></idno>
					<title level='j'>LREC proceedings</title>
<idno>2522-2686</idno>
<biblScope unit="volume">13</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Nikhil Krshnaswamy</author><author>William Pickard</author><author>Brittany Cates</author><author>Nathaniel Blanchard</author><author>James Pustejovsky</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We present a five-year retrospective on the development of the VoxWorld platform, first introduced as a multimodal platform formodeling motion language, that has evolved into a platform for rapidly building and deploying embodied agents with contextual andsituational awareness, capable of interacting with humans in multiple modalities, and exploring their environments. In particular, wediscuss the evolution from the theoretical underpinnings of the VoxML modeling language to a platform that accommodates bothneural and symbolic inputs to build agents capable of multimodal interaction and hybrid reasoning. We focus on three distinct agentimplementations and the functionality needed to accommodate all of them: Diana, a virtual collaborative agent; Kirby, a mobile robot;and BabyBAW, an agent who self-guides its own exploration of the world.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Multimodal language understanding has been a topic of intense interest in the natural language processing community over at least the last half a decade. The role that perception plays in human language understanding has long been clear to much of the community and many efforts have attempted to build similar functionality into computer systems <ref type="bibr">(Pylyshyn, 1978;</ref><ref type="bibr">Waltz, 1978;</ref><ref type="bibr">Rehm and Goecke, 2000;</ref><ref type="bibr">Rohlfing et al., 2003;</ref><ref type="bibr">Pecher and Zwaan, 2005;</ref><ref type="bibr">Mooney, 2008;</ref><ref type="bibr">Scherer et al., 2012;</ref><ref type="bibr">Krishnamurthy and Kollar, 2013;</ref><ref type="bibr">Miller and Johnson-Laird, 2013)</ref>. There has also been interest in context-aware language understanding systems, and the role that situatedness and embodiment play in such processes. This is an even harder problem than merging computer vision and NLP into a single system, and perhaps because of this, interest has remained high. In 2016, at the LREC conference, we introduced a modeling language VoxML (Pustejovsky and <ref type="bibr">Krishnaswamy, 2016)</ref>, intended to capture the semantics necessary to construct 3D visualizations of concepts denoted by linguistic expressions; as well as VoxSim, later that year at COLING 2016 (Krishnaswamy and Pustejovsky, 2016), a software system that generated such visualizations from the VoxML semantics. Interest in this approach has persisted through the community <ref type="bibr">(Cohen, 2017;</ref><ref type="bibr">Quick and Morrison, 2017;</ref><ref type="bibr">Fischer et al., 2018;</ref><ref type="bibr">Abrami et al., 2020;</ref><ref type="bibr">Bonn et al., 2020;</ref><ref type="bibr">Henlein et al., 2020;</ref><ref type="bibr">Rodrigues et al., 2020;</ref><ref type="bibr">Tamari et al., 2020;</ref><ref type="bibr">Kozierok et al., 2021;</ref><ref type="bibr">Richard-Bollans, 2021)</ref>, and out of our initial goal to model visualized and situated object and event semantics grew an additional line of research: creating intelligent agent behaviors capable of performing reasoning over such semantics <ref type="bibr">(Krishnaswamy et al., 2017;</ref><ref type="bibr">Narayana et al., 2018)</ref>. VoxML and VoxSim became the platform on which to build such agent behaviors. However, intelligent agents can have many forms and many purposes. The process of developing such agents in time made clear what constituted the key components of a platform for rapidly developing multimodal embodied agents, what was essential and what was extraneous, and how to develop and sustain such a platform for continued use. In this paper we present a retrospective of the evolution of the VoxML/VoxSim platform, now collectively known as VoxWorld, discuss our sometimes surprising conclusions regarding the development of VoxWorld as a distinct resource, and demonstrate how we now facilitate development of new interactive agents that synthesize our work with various other contributions, representations, and pipelines from the NLP/CL and AI communities. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Earlier work in "agent architectures" detailed the arrangement of conceptual components, e.g., facts, goals, and plans, for particular types of intelligent control system. A canonical example in NLP is TRIPS <ref type="bibr">(Ferguson et al., 1998)</ref>. Other cognitive architectures for intelligent agents include SOAR <ref type="bibr">(Laird, 2019)</ref>, <ref type="bibr">ACT-R (Anderson et al., 1997)</ref>, and other Belief-Desire-Intention (BDI) agents <ref type="bibr">(Rao et al., 1995)</ref>. Explicit representation has largely fallen out of favor in modern AI, and earlier agent architectures have therefore adapted to integrate deep learning inputs. <ref type="bibr">Brooks (1991)</ref> presaged this move, but notably viewed the shortcomings of symbolic AI as including its lack of capturing either situatedness and embodiment with regard to robotics; the robotics community has, in turn, provided many treatments for robots that can plan and reason about actions, language, and social behaviors <ref type="bibr">(Breazeal et al., 2004;</ref><ref type="bibr">Dzifcak et al., 2009;</ref><ref type="bibr">Tellex et al., 2020)</ref>, with platforms for robotic development (e.g., <ref type="bibr">Thrun et al. (2000)</ref>, <ref type="bibr">Schermerhorn et al. (2006)</ref>, <ref type="bibr">Rusu et al. (2008)</ref>). Finally, along with the GPUs that facilitated the successes of deep learning came sophisticated graphics engines that allow developers, often with roots in the video gaming community, to develop photo-realistic simulators in which to develop and test intelligent agents and their ability to learn. Much of this work comes from the deep reinforcement learning community, where simulated environments are used for navigation, game-playing, and problem solving via deep RL <ref type="bibr">(Kempka et al., 2016;</ref><ref type="bibr">Kolve et al., 2017;</ref><ref type="bibr">Savva et al., 2017;</ref><ref type="bibr">Juliani et al., 2018;</ref><ref type="bibr">Yan et al., 2018;</ref><ref type="bibr">Savva et al., 2019)</ref>. These environmental platforms are not developed specifically to focus on communication, underspecification resolution, language grounding, or concept acquisition, though they may be used for these. One particular suite of work of note is the work done by the EASE Collaborative Research Center at the University of Bremen. VoxWorld and the EASE ecosystem are similar in some of the tasks that agents are being trained to do, e.g., moving objects around a virtual environment to achieve certain goals <ref type="bibr">(Kazhoyan and Beetz, 2019)</ref>. There are also compatibilities in information representation techniques, for example the semantics of expressing locations, relationships of objects <ref type="bibr">(Pomarlan and Bateman, 2020)</ref>, and affordances <ref type="bibr">(Be&#223;ler et al., 2020)</ref>. However, the EASE work is more geared toward teaching robots to perform human tasks than to receiving direction from a human on a cooperative task. The EASE work on hand tracking and motion interpretation <ref type="bibr">(Ramirez-Amaro et al., 2019;</ref><ref type="bibr">Rosskamp et al., 2020)</ref> is distinct from yet complementary with VoxWorld's use of gesture, gaze, etc., for communication. In VoxWorld, we provide a common architecture that is compatible with both explicit logical and implicit subsymbolic representations, making it a robust platform for situated, embodied, neurosymbolic AI. VoxWorld is unique compared to other game engine-based environments for the following reason: while in other environments, the constraints on and results of object interactions are often coded into the system ad hoc as needed for a particular scenario, VoxWorld runs on a rigorous composition of object, event, relation, attribute, and function semantics at runtime, where the results of actions in the world flow naturally from the semantics of objects over which they are enacted, enabling general-izable situated reasoning. To our knowledge, VoxWorld is the first platform to bring the above areas together, with a core yet extensible architecture driven by VoxML, in a 3D simulated world that enables multimodal interaction with, and observation of, agents in real or virtual environments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">What Makes an Agent?</head><p>Discussion of agents necessarily raises the question: what makes any given system an agent? Before the emergence of AI, many in philosophy held the view that action is to be explained in terms of the intentionality associated with an act. The concepts of "agency" and "agent" were mostly used to refer to the "exercise of the capacity to perform intentional actions" <ref type="bibr">(Anscombe, 1957;</ref><ref type="bibr">Davidson, 1963)</ref>. In psychology, agency entails intentionality <ref type="bibr">(Dennett, 1987)</ref> and it cannot be meaningfully argued that computational "agents" as currently known have intentionality. "Agents" in AI avoid this criticism by eschewing direct comparisons to human intelligence. There is little argument in the AI community about whether intelligent agents display "real" intelligence or not. They are simply systems that "can be viewed as perceiving [its] environment through sensors and acting upon that environment through actuators" <ref type="bibr">(Russell and Norvig, 2002)</ref>. These actuators can be articulators like limbs, as in robots, or simply routines that effect change upon the world, as in an automated thermostat or flight-booking system. Personal digital assistants are perhaps the most common intelligent agents in use; Siri, Alexa, etc., live in our pockets or our offices, where they retrieve information or execute specific tasks on command. Agents can also, in the context of virtual environments, exist entirely in simulation, where the only world they directly affect is one of data structures and pixels. Virtual worlds may be one of the most interesting and fruitful places to study agent behaviors, because it is much simpler in a virtual world to create an agent that is situated. A necessary condition on an situated agent is that it have an epistemic point of view associated with it, from which it can observe the world, and this has been an object of previous study <ref type="bibr">(Pustejovsky and Krishnaswamy, 2019)</ref>. Once the epistemic condition is imposed, the rendering of the virtual world from that point of view becomes a mode of presentation of the agent's understanding of the situation as encoded within the virtual environment. Thus the rendering serves as a rough analogue for human perception, by allowing an observer to perceive what the agent does. Shared perception is a critical component of human communication <ref type="bibr">(Kuhl, 1998;</ref><ref type="bibr">Pustejovsky and Krishnaswamy, 2020)</ref>. When co-situated in a space together, humans make some tacit assumptions about what the other people are aware of and how they may behave, but part of those behaviors may require the agent to not only have a point of view as above, but also an embodiment. Alexa cannot see what you are pointing at, and neither can she point herself. Thus embodied agents add new dimensions to human/agent interactions compared to voice-or text-only conversational agents <ref type="bibr">(Allouch et al., 2021)</ref>. To act or manipulate within their world (real or virtual), agents must be equipped with the appropriate output, such as actuators or inverse kinematic solvers to move their joints, and text-to-speech engines, for which there are many available solutions. If an agent can interact with the world, allowing humans to interact with it requires the inverse functionality; it must be able to do things like recognize and interpret inputs in multiple modalities like gesture, speech, gaze, and action, and solving these problems has driven the development of VoxWorld as a platform. In Sec. 4 of this paper, we provide an overview of the VoxWorld platform and detail the core components that we have developed to facilitate creation of agents that act in the world and interact with people. In Sec. 5, we describe 3 specific agent implementations (Fig. <ref type="figure">1</ref>): Diana, a virtual collaborative agent; Kirby, a mobile robot; and BabyBAW, an agent who self-guides its own exploration of the world, and discuss specific considerations for each that informed the development of Vox-World. We conclude with making VoxWorld available as a public resource.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">VoxWorld Platform Overview</head><p>At the core of VoxWorld are the VoxML modeling language and its real-time interpreter, VoxSim. VoxSim is built on the Unity game engine making it a clear companion to other Unity-based frameworks (e.g., Juliani et al. ( <ref type="formula">2018</ref>), see Sec. 5.3). VoxML models contextual and common-sense information about objects and events that is otherwise difficult to capture in unimodal corpora, e.g., balls roll because they are round. VoxML is particularly apt for capturing information about habitats <ref type="bibr">(Pustejovsky, 2013)</ref> and affordances <ref type="bibr">(Gibson, 1977)</ref>. Fig. <ref type="figure">2</ref> shows the affordance structure for a [[BLOCK]] voxeme, or the visual instantiation of a block object. In implementation, VoxML is saved in XML format in a directory accessible to a VoxWorld-based agent implementation.   Because these event programs are necessarily underspecified (i.e., "move to where?"), the complete operationalization of an event relies on parameters that are inferred from the composition of the arguments with the program (see Sec. 4.1).</p><p>VoxSim provides a mechanism to do this composition in real time, and accommodates symbolic and logical methods such as qualitative calculi (e.g., <ref type="bibr">Allen (1983)</ref>, <ref type="bibr">Randell et al. (1992)</ref>, <ref type="bibr">Gatsoulis et al. (2016)</ref>), or machine learning for automated inference (e.g., <ref type="bibr">Brockman et al. (2016)</ref>, <ref type="bibr">Raffin et al. (2019)</ref>). Such 3rd-party resources are easily integrated via TCP sockets, REST connections, or direct integration through Unity plugins. Fig. <ref type="figure">3</ref> shows the high-level architecture of Vox-World, absent any specific agent implementation.</p><p>Figure <ref type="figure">3</ref>: VoxWorld Platform Architecture The agent then becomes an executor of events, which can involve manipulating objects, moving about the world, or even participating in dialogue. Event execution is handled by the event manager, which composes, interprets, grounds, directs, and monitors events frame by frame, testing for completion and satisfaction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Action and Event Composition</head><p>The composition of events is driven by the semantics of VoxML. VoxML was designed to capture event semantics in a form that can be realized as a minimal model in a dynamic logic <ref type="bibr">(Pustejovsky and Moszkowicz, 2011)</ref> while also providing space to draw more specific information from the environment itself where necessary a la <ref type="bibr">Brooks (1991)</ref>. This section details how the theoretical foundations of VoxML were translated into a robust working system that accommodates arbitrary events given a well-formed representation, and the evolution of event compositionality in implementation from our initial efforts in <ref type="bibr">Krishnaswamy and Pustejovsky (2016)</ref> to the present. Initially in VoxSim, all input was required to be typed English in the imperative mood, which was part-of-speech tagged, dependency-parsed, and transformed into a simple predicate-logic format denoting an action and its arguments, e.g., put(the(black(block)), on(the(white(block)))). This format still forms the core of VoxSim processing, but where previously, each individual predicate in the parsed input had to be operationalized directly in C# for execution in Unity, event composition has since evolved to function directly from the VoxML semantics, which breaks down complex predicates into compositions of the aforementioned primitives. In the process, VoxML specifications of events have been changed from strongly-typed feature structures to more of a duck-typed language like Python, wrapped in an attribute-value matrix (AVM)-like data structure. This makes development in VoxWorld more tractable for NLP researchers, who are likely to be more familiar with Python than C#. Composing events, and interpreting and grounding predicates more generally, makes extensive use of reflection, a process that allows managed code to read its own metadata, and therefore find assemblies and execute methods by invoking them by name rather than as declared variables. Reflection also allows us to, rather than invoking a specific hard-coded name, invoke other methods that handle the interpretation of events using both VoxML semantics and the current state of the world, that in turn invoke handler methods that ground objects, attributes, relations, and functions to the current context. Fig. <ref type="figure">4</ref> shows a high-level diagram of the flow of control when composing an event program from VoxML semantics, including where information is incorporated from the world. For example, a [[PUT]] voxeme may have the following subevent structure, with arguments x, the agent, y, the object, and z, the destination, and conditional operators while and if (Fig. <ref type="figure">5</ref>):  ] primitive such that agent x executes an instruction to move y to z. P() is a pointer to an optional function. The semantics of <ref type="bibr">[[MOVE]</ref>] in simulation assumes this to be a path planner, that takes the subsequent array (the location of y, the destination, and y itself) as additional arguments. If no planner is specified, the object is moved directly in a straight line without regard for obstacles. VoxSim comes with an A* path planner available for use, but developers may supply their own through a C# class or any other endpoints that can supply a FIFO or indexable data structure of 3D waypoints.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Reasoning about Results and States</head><p>For an agent to reason about the world it inhabits, it must be able to track relations and changes of state that come to exist as the result of acting upon the environment. For instance, if "put the black block on the white block" is parsed and executed according to the process outlined in Sec. 4.1, the result of the event changes the state of the environment such that white block supports black block. This change of state is recorded in the affordances of [[BLOCK]] (see Fig. <ref type="figure">2</ref>). VoxWorld comes with a basic level of compositional spatial reasoning capabilities, based on various qualitative reasoning calculi (e.g., <ref type="bibr">Randell et al. (1992)</ref>, <ref type="bibr">Balbiani et al. (1998)</ref>, <ref type="bibr">Moratz et al. (2002)</ref>), and through the external endpoint connections, additional reasoning or inference clients can be integrated. These may be either symbolic or machine learning-driven, and the output of any reasoner may be interpreted in terms of VoxML semantics as in Fig. <ref type="figure">6</ref>. When building embodied agents that can both query and interpret relational predicates, relations in situated environments pose a unique problem. Formally, relations are first-order functions over multiple arguments that return a boolean, e.g., [is white block in front of black block?] ! {TRUE, FALSE}. In affordance and relation composition, VoxWorld treats relations with multiple arguments as a testable proposition like this, but in a situated context with linguistic grounding, relations also have another interpretation as locations or regions, exemplified by the nested predicate in front(the(x))). In this case, we treat the relational predicate as demanding interpretation as a causal result of satisfying the propositional equivalent, and return an element R 2 R 3 denoted by a configurational relation, therefore in front(x) becomes IN (x, R), which can be left as a testable proposition for any to-be-specified object x. Force dynamic relations like [[SUPPORT]] likewise trigger operations in the Unity physics engine. Using the same VoxML semantics for both types of relational computations allows a VoxWorld agent to easily perform both execution ("put the white block in front of the black block") and recognition ("where is the black block?"). This also makes changing the frame of reference simple: the inequality in the CONSTR of Fig. <ref type="figure">6</ref> just needs to be flipped to accommodate the opposite perspective. This can been done by directly mutating the encoding in the VoxML library, or though symbolic or machine learning methods triggered at runtime through dialogue, for example, if a human directs an agent to "put the white block in front of the black block" followed by "no, on the other side." Thereafter the agent can use the frame of reference adopted by its human partner.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Blackboard Architecture and PDA</head><p>Managing the inputs and dialogue state for even a unimodal agent is no trivial task. In speech alone, humans may jump back and forth across established context; managing multimodal inputs is an order of magnitude more complex as each modality may prompt a different kind of state update simultaneously. A particular point of user frustration with interactive agents comes when the agent falls into a rigid turn-taking pattern, and does not permit interruption or redirection <ref type="bibr">(Kozierok et al., 2021)</ref>. Human communication is asynchronous; we use multiple modalities in conjunction, and keep track of interlocutors' states while continuing the interaction. Therefore robust interactive agents should also have some of this capacity. To illustrate, consider the following scenario (see Fig. <ref type="figure">8</ref>). The agent should be able to take initiative to act upon partial information (e.g., "put the yellow block there," followed by pointing), and so it may start by picking up a yellow block in the environment and moving it in the direction of deixis even if the destination is not certain. At any point, though, the human partner may change their mind or decide that the agent needs to be corrected (e.g., "no, on the white one"). There is no doubt that a person would be able to follow these instructions, and an interactive agent should support the same. Therefore in the process of developing various interactive agents and their respective dialogues, we incorporated two relatively old but powerful ideas from early AI. The first is a blackboard. Proposed in the 1970s for the Hearsay NLU system <ref type="bibr">(Erman et al., 1980)</ref>, blackboard architectures facilitate asynchronous updates from arbitrary knowledge sources that are managed by a control shell. Our blackboard was originally developed specifically for the Diana agent (Sec. 5.1) but has now been incorporated into VoxWorld directly as a convenient general-purpose data structure for managing third-party inputs. It is a strongly-typed key-value store in a modified singleton pattern where the control shell allows member functions of subclasses of the blackboard's ModuleBase class to subscribe to any or all keys on the blackboard and execute upon changes to the associated values <ref type="bibr">(Strout, 2020)</ref>. Fig. <ref type="figure">7</ref> shows the blackboard integration with knowledge sources used by the Diana agent (Sec. 5.1). Similar endpoints are used for the Kirby agent. The "Event and Dialogue Management" box connects to the rest of VoxSim (the purple box) via the event manager. Orange boxes denote agent outputs. Red boxes denote custom recognizers. Green boxes denote 3rd-party recognizers. The second is a variant on a pushdown automaton (PDA), used for higher-level dialogue management. The states are coarse-grained states in the dialogue (e.g., is the agent engaged with a human?, is the agent answering a question?, is the agent learning something new?, etc.), and the transition relation between states may be dictated by a custom-defined stack symbol class as suits the developer's needs, or may even exploit the entire blackboard for use as a stack symbol. Since the size of the stack symbol may be arbitrarily large, we write the transition relation in turns of satisfiable predicates. Therefore, an arbitrary condition may be defined in the transition relation, and a stack symbol that satisfies that condition may be used to trigger the associated state transition. In doing so, we exploit a continuationpassing style semantics (Van Eijck and Unger, 2010) to facilitate the asynchronous exchange of information across the blackboard while maintaining the separation of different high-level dialogue states (i.e., the agent must note and store various pieces of information before switching states).<ref type="foot">foot_0</ref> When the PDA enters a new state, an equivalently-named function is executed if one exists. Any change that is to be effected on the agent's world can be written into these functions. More details on this portion of the architecture can be found in <ref type="bibr">Krishnaswamy and Pustejovsky (2019b)</ref>. VoxWorld developers can instantiate new blackboard keys by simply writing a new arbitrary string key and value type to the blackboard and then subscribing functions to changes on that key. They can also create custom PDAs by typing the contents of their stack symbol and then adding states and transition relations to their PDA class. Functions that are called upon changes to the blackboard or PDA can thereafter access the agent and the world to drive interaction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Agent Implementations</head><p>In this section we detail 3 specific agents that use the VoxWorld platform, accompanying agent architectures, and underlying semantics. These agents all have different capabilities and use cases, and demonstrate how VoxWorld can use the same framework to create diverse agents. We enumerate key lessons learned in the development of each agent type that shaped the development of VoxWorld as a platform.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Diana</head><p>Diana <ref type="bibr">(Krishnaswamy et al., 2017;</ref><ref type="bibr">Narayana et al., 2018;</ref><ref type="bibr">McNeely-White et al., 2019;</ref><ref type="bibr">Krishnaswamy et al., 2020a;</ref><ref type="bibr">Krishnaswamy et al., 2020b</ref>) is an interactive multimodal agent intended to develop and test peer-to-peer human-computer communication. With multiple modalities even a simple use case like Blocks World presents a challenge of extracting and integrating meaning from each modality. These are the challenges that VoxSim, and later VoxWorld, was explicitly designed to address. Advancing the state of the art in peer-to-peer human-computer communication necessarily entailed a deep study of how humans conduct the same task. Therefore, the set of gestures that Diana interprets were derived from EGGNOG, a dataset of human-human interactions in a shared Blocks World construction task <ref type="bibr">(Wang et al., 2017)</ref>.</p><p>Figure <ref type="figure">8</ref>: Diana reaches for a block to demonstrate an interpretation of the person's deixis. Diana provided the first use case of an agent that could not be isolated to the Unity environment; an agent designed to interpret gesture and speech must have access to those inputs. The gestures of Diana's human interlocutor are recognized via custom deep CNNs over depth video. Speech recognition is currently handled using Google Cloud ASR. Outputs from these endpoints are posted to the blackboard for processing, facilitating integration of diverse multimodal inputs such as new gesture or speech engines.</p><p>One key lesson learned during the development of Diana was that it is not simply enough to give an agent access to multimodal sensor data; it must know how to react to those inputs even with incomplete information. For example, if the human indicates an object but not what to do with it, if Diana receives that information, she must react in a way that demonstrates that receipt. If she does not react, the system has no explicit way of moving the interaction forward. Fig. <ref type="figure">8</ref> shows this, where the human points to a the purple block, and Diana demonstrates her understanding by reaching for the purple block in turn.<ref type="foot">foot_1</ref> Diana has been used for studying referring expressions <ref type="bibr">(Krishnaswamy and Pustejovsky, 2019a;</ref><ref type="bibr">Krishnaswamy and Pustejovsky, 2020)</ref>, human-computer interaction <ref type="bibr">(Pustejovsky and Krishnaswamy, 2021b;</ref><ref type="bibr">Krishnaswamy and Pustejovsky, 2021)</ref>, and object affordances <ref type="bibr">(Pustejovsky and Krishnaswamy, 2021a)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.1.">Diana Evaluation</head><p>Diana was evaluated in the summer of 2021 with the aim of assessing whether naive users could interact with her to build specific block structures without prior instruction. We assessed our evaluation based on task completion rate and user satisfaction according to the System Usability Scale (SUS), a common HCI metric <ref type="bibr">(Brooke, 1996;</ref><ref type="bibr">Bangor et al., 2008)</ref>. Thirty subjects, evenly divided between men and women, aged 18-57 (&#181; = 27, = 11.8) participated in the evaluation. Users were asked to collaborate with Diana to build a variety of block structures with a 10minute time limit. Diana achieved a high task completion rate of 90%. Out of 240 total trials, only 24 could not be completed within the time limit. According to the SUS, 68 is considered "average" and 80 or above is considered "excellent." Diana achieved a mean SUS of 74.3 ( = 8.2), with scores ranging from 67.5-90. Only four scores with a 67.5 missed the "average" mark of 68 and eight SUS scores (27% of the total) received scores above 80. Qualitative feedback from participants was also positive and highlighted the multimodal aspect of the interaction, e.g., "the combination of speech and gesture at the same time was useful and unique."</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.2.">Diana in a Web Browser</head><p>The fully-featured Diana system relies on some specialized hardware, such as Kinect cameras, but the underlying interactive mechanisms can be used independently of these. We subsequently took the Diana system and deployed a version that can run natively in a web browser, using the mouse for deixis and the Web-Speech API <ref type="bibr">(Adorf, 2013)</ref> for speech recognition. By simply switching out the endpoints, we keep the core interaction between Diana and the human intact. The same control processing shown in Fig. <ref type="figure">4</ref> and the same underlying VoxML semantics drive the dialogue. The web-deployable version of Diana uses Unity's We-bGL build functionality. As such, the VoxWorld plat- form needed to be made maximally compatible with the restrictions of WebGL. This involved 3 key modifications:</p><p>1. Web builds restrict access to external files and directory trees. The solution is to compile all required VoxML encoding files and directories directly into the final binary as Unity resources. 2. No just-in-time (JIT) code is allowed by WebGL, so the functional PDA architecture, which encodes stack symbol states in the transition relation as satisfiable predicates, cannot be accommodated, and so must be deactivated for web builds. The blackboard architecture, which is compiled ahead-of-time (AoT), can still be used, so all behaviors need to be written using the blackboard. 3. Certain 3rd-party dependencies which VoxWorld initially incorporated are not compatible with We-bGL (e.g., Google ASR), therefore, removing 3rdparty dependencies from the core VoxWorld platform and leaving their inclusion to the discretion of individual agent developers is key (see Sec. 7)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Kirby</head><p>An embodied agent in a purely virtual world can directly access parameters of the world, such as exact locations of objects. Therefore the real challenge for embodied agent behaviors comes when they must be enacted in the real world where inference is noisier. Krajovic et al. ( <ref type="formula">2020</ref>) presented a prototype of a mobile robotic agent built on the VoxWorld platform. This agent, Kirby, uses the same gesture and speech recognition components as Diana, but exists not only in a virtual world, but as a real mobile robot (specifically a modified GoPiGo3 outfitted with a LIDAR) using the common Robot Operating System (ROS). Kirby functions as an agent that can navigate through locations where the human is not physically present or cannot go. As Kirby navigates its environment, it builds up a coherent model of obstacles in the environment using the LIDAR data, and of items in the environment using object or fiducial detection from its onboard camera. The human can point and gesture to guide Kirby (e.g., beckoning for Kirby to move toward the human), along with speech (e.g., "go there", "go to the green one", etc.), and can use gestures to change camera views (through swiping) or rotate the camera in three dimensions (through iterative directional gestures). Fig. <ref type="figure">9</ref> depicts Kirby's interface. We use a Redis store connected via VoxSim's socket connections to exchange messages with the ROS client running on the robot, and the blackboard manages speech and gesture inputs while the PDA manages dialogue state.</p><p>One key lesson learned in building Kirby was the ease with which Diana's blackboard architecture can be used to manage inputs, as demonstrated by the reuse of Diana's gesture set in the Kirby use case. This prompted the integration of the blackboard architecture, initially created for Diana, into VoxWorld directly. An evaluation of Kirby was planned but had to be canceled due to the COVID-19 global pandemic and an inability to hold trials with in-person subjects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">BabyBAW</head><p>Diana and Kirby are deterministic agents. Their behaviors, while customizable, are programmed for known use cases. A known set of inputs will lead to a known set of outputs or actions. However, 3D environments and embodied simulation are also very useful for exploration and learning through interaction with the world. This is a common area of research in reinforcement learning <ref type="bibr">(Aluru et al., 2015;</ref><ref type="bibr">Kolve et al., 2017;</ref><ref type="bibr">Savva et al., 2017;</ref><ref type="bibr">Juliani et al., 2019)</ref> and developmental psychology <ref type="bibr">(Battaglia et al., 2013;</ref><ref type="bibr">Ullman et al., 2017)</ref>.</p><p>BabyBAW is an agent that learns through interaction with the environment. It can be initialized with different levels of underlying knowledge (e.g., knowledge of gravity, different object properties, or different actions), and given tasks to test what it can accomplish. It uses neural networks, symbolic reasoning, and embodied simulation for their respective strengths ("Best of All Worlds") to approximate certain aspects of infant and child learning <ref type="bibr">(Hartshorne and Pustejovsky, 2021)</ref>.</p><p>Learning from exploration and interaction is an obvious problem for reinforcement learning (RL). To develop a VoxWorld agent for RL, we focused on integrating two common, well-developed RL platforms: Unity ML-Agents <ref type="bibr">(Juliani et al., 2018)</ref> and OpenAI Gym <ref type="bibr">(Brockman et al., 2016)</ref>. The goal here was to make building environments and tasks for BabyBAW and testing multiple environmental and architectural configurations as simple and rapid as possible. BabyBAW agent behaviors interface directly with the Unity ML-Agents API and the VoxWorld event management, relational reasoning, and interactive architecture components.</p><p>Our current experiments in BabyBAW are based on explorations of infant intuition about objects and support relations from developmental psychology. At slightly more than 6 months old, most infants appear able to intuit than an object will not fall if supported from the bottom on over 50% of its lower surface <ref type="bibr">(Baillargeon et al., 1992;</ref><ref type="bibr">Dan et al., 2000;</ref><ref type="bibr">Huettel and Needham, 2000;</ref><ref type="bibr">Spelke and Kinzler, 2007)</ref>. Therefore, an RL algorithm should be able to solve for a policy that resembles this intuition in a stacking task.</p><p>Due to the popularity of OpenAI Gym and increasing adoption of Unity ML-Agents, successful Vox-World integration with these tools allows the use in turn of other packages that are compatible with them. In the current work, we use the Stable-Baselines3 package <ref type="bibr">(Raffin et al., 2019)</ref>, a set of reliable implementations of RL algorithms written in PyTorch. For a continuous action space, we use a DDPG or TD3 algorithm <ref type="bibr">(Lillicrap et al., 2016;</ref><ref type="bibr">Fujimoto et al., 2018)</ref> and explore training an agent to stack objects in a 3D world (Fig. <ref type="figure">10</ref>). We evaluate BabyBAW in this task in Sec. 5.3.1 </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.1.">BabyBAW Evaluation</head><p>We evaluated BabyBAW in VoxWorld to test its ability to learn a stacking task, given the ability to extract measurements of certain concepts from the environment. We evaluate two agents: one trained with the mechanism to measure height and center of gravity, and a baseline without those capabilities.</p><p>Our initial evaluation uses a 2D action that corresponds to a location in 3D space calculated relative to the surface of the destination block. The optimal solution places the theme object exactly centered atop the destination block and the agent must solve for an action that corresponds to that event in VoxWorld. To increase the problem complexity, we can make the action space (when rescaled in the 3D environment) arbitrarily large so that the optimal solution lies in a very small section of the rescaled action space, and can perturb the action space through VoxWorld so that the optimal solution may not lie in the exact center of the action space. Each action (attempt to stack) is one timestep and a max of 10 timesteps are allowed per episode. The agent receives a negative reward for missing the destination block entirely, a small positive reward for touching the destination block with the theme block even if it falls off, and a large positive reward for stacking successfully, with a 10% decay on each additional attempt. Fig. <ref type="figure">11</ref> shows the training reward plots for three Baby-BAW stacking policies. The observation space is defined by the height of the stack and center of gravity of the stack relative to that of the bottom object. The blue plot (the accurate policy is well-optimized), while the red plot (the imprecise policy) is less so. In the green plot, where the reward starts climbing around timestep 350, the action space was perturbed so the optimal policy is far from the center of the action space, to test the algorithm's ability to generalize. Max reward is 1000, and policies were trained for 2000 timesteps.</p><p>Figure <ref type="figure">12</ref>: Reward vs. evaluation episodes for Baby-BAW and baseline. Episodes terminate upon successful stacking, so more episodes elapse in the 100 testing timesteps using the trained model than the baseline. When BabyBAW is given the mechanism to extract veridical knowledge of height and center of gravity from the environment, it can rapidly solve a stacking task in real time (&#8672;15 mins.), without even speeding up rendering. Fig. <ref type="figure">12</ref> shows the reward and cumulative mean reward per episode for a trained model (solid red and green lines) vs. the baseline trained without the height and center of gravity information (dashed blue and orange lines). A mean reward close to 1000 means BabyBAW frequently stacked the blocks on the first try.</p><p>Ongoing BabyBAW work involves learning about object properties from differences in their behavior (e.g., what happens when BabyBAW tries to stack a sphere on top of a cube?), and correlating those differences in behavior to novel object classes and linguistic labels.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Timeline of Selected Publications</head><p>This section provides a brief list of selected prior VoxWorld-related publications.</p><p>&#8226; LREC 2016: <ref type="bibr">(Pustejovsky and Krishnaswamy, 2016)</ref>. First publication of the VoxML modeling language.</p><p>&#8226; COLING 2016: <ref type="bibr">(Krishnaswamy and Pustejovsky, 2016)</ref>. First demonstration of the VoxSim software.</p><p>&#8226; IWCS 2017: <ref type="bibr">(Krishnaswamy et al., 2017)</ref>. First publication of the Diana agent.</p><p>&#8226; IntelliSys 2018: <ref type="bibr">(Narayana et al., 2018)</ref>. Diana agent with integrated gesture and speech.</p><p>&#8226; IEEE HCC 2019: <ref type="bibr">(McNeely-White et al., 2019)</ref>. "Modern" Diana agent with blackboard architecture.</p><p>&#8226; AAAI 2020: <ref type="bibr">(Krishnaswamy et al., 2020b)</ref>. Public demo of the Diana agent.</p><p>&#8226; RoboDial 2020: <ref type="bibr">(Krajovic et al., 2020)</ref>. First publication of the Kirby agent.</p><p>&#8226; CogSci 2022: <ref type="bibr">(Krishnaswamy and Ghaffari, 2022)</ref>. BabyBAW agent used for novel concept detection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusion</head><p>Interaction takes many forms. VoxWorld is intended to accommodate as many as possible with an extensible, event-centric language semantics and straightforward pipeline. In this paper we detailed how VoxWorld evolved from the theoretical VoxML framework into a distinct platform targeted to developers of embodied reasoning agents, and discussed key lessons learned through the development of different agent types. One final takeaway is the importance of keeping VoxWorld independent of 3rd-party packages, thus allowing developers to use their preferred methods for things like animation, speech recognition or text-to-speech. We have refactored the VoxWorld API to allow developers to incorporate their own endpoints as universally as possible, with a combination of C# interface classes and event handlers, allowing us to make stable builds of VoxWorld available as a single Unity package.</p><p>The bleeding edge version of the source code is at <ref type="url">https://github.com/VoxML/VoxSim</ref>, and we have created a sample project with a simple interaction to let interested researchers get started quickly at <ref type="url">https://github.com/VoxML/VoxWorld-QS</ref>.</p><p>Online documentation is under construction at <ref type="url">https://www.voxicon.net/api/</ref>.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>This functional design pattern is not compatible with web builds (Sec. 5.1.2).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>A video of Diana can be viewed here.</p></note>
		</body>
		</text>
</TEI>
