<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Vocal Sandbox: Continual Learning and Adaptation for Situated Human-Robot Collaboration</title></titleStmt>
			<publicationStmt>
				<publisher>Proceedings of the 8th Conference on Robot Learning (CoRL)</publisher>
				<date>11/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10567467</idno>
					<idno type="doi"></idno>
					
					<author>Jennifer Grannen</author><author>Siddharth Karamcheti</author><author>Suvir Mirchandani</author><author>Percy Liang</author><author>Dorsa Sadigh</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We introduce Vocal Sandbox, a framework for enabling seamless humanrobot collaboration in situated environments. Systems in our framework are characterized by their ability to adapt and continually learn at multiple levels of abstraction from diverse teaching modalities such as spoken dialogue, object keypoints, and kinesthetic demonstrations. To enable such adaptation, we design lightweight and interpretable learning algorithms that allow users to build an understanding and co-adapt to a robot's capabilities in real-time, as they teach new behaviors. For example, after demonstrating a new low-level skill for "tracking around" an object, users are provided with trajectory visualizations of the robot's intended motion when asked to track a new object. Similarly, users teach high-level planning behaviors through spoken dialogue, using pretrained language models to synthesize behaviors such as "packing an object away" as compositions of low-level skills -concepts that can be reused and built upon. We evaluate Vocal Sandbox in two settings: collaborative gift bag assembly and LEGO stop-motion animation. In the first setting, we run systematic ablations and user studies with 8 non-expert participants, highlighting the impact of multi-level teaching. Across 23 hours of total robot interaction time, users teach 17 new high-level behaviors with an average of 16 novel low-level skills, requiring 22.1% less active supervision compared to baselines and yielding more complex autonomous performance (+19.7%) with fewer failures (-67.1%). Qualitatively, users strongly prefer Vocal Sandbox systems due to their ease of use (+20.6%), helpfulness (+10.8%), and overall performance (+13.9%). Finally, we pair an experienced system-user with a robot to film a stopmotion animation; over two hours of continuous collaboration, the user teaches progressively more complex motion skills to shoot a 52 second (232 frame) movie.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Effective human-robot collaboration requires systems that can seamlessly work with and learn from people <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref>. This is especially important for situated interactions <ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref><ref type="bibr">[7]</ref> where robots and people share the same space, working together to achieve complex goals. Such settings require robots that continually adapt from diverse feedback, learning and grounding new concepts online. Yet, recent work <ref type="bibr">[8]</ref><ref type="bibr">[9]</ref><ref type="bibr">[10]</ref><ref type="bibr">[11]</ref><ref type="bibr">[12]</ref><ref type="bibr">[13]</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref> remain limited in the types of teaching and generalization they permit. For example, many systems use language models to map user utterances to sequences of skills from a static, predefined library <ref type="bibr">[16]</ref><ref type="bibr">[17]</ref><ref type="bibr">[18]</ref>. While these systems may generalize at the plan level, they trivially fail when asked to execute new low-level skills -regardless of how simple that skill might be (e.g., "hold this still"). Instead, we argue that situated human-robot collaboration requires learning and adaptation at multiple levels of abstraction, empowering collaborators to continuously teach new high-level planning behaviors and low-level skills over the course of an interaction.</p><p>Building on this insight, we present Vocal Sandbox, a framework for situated human-robot collaboration that enables users to teach through diverse modalities such as spoken dialogue, object keypoints, and kinesthetic demonstrations. Systems in our framework consist of a language model (LM) planner [19] that maps user utterances to sequences of high-level behaviors, and a family of low-level skills that ground each behavior to robot actions. Crucially, we design lightweight and interpretable learning algorithms to incorporate a user's teaching feedback, dynamically growing the capabilities of the system in real-time. Consider a user and robot collaborating to film a LEGO stop-motion animation (Fig. <ref type="figure">1</ref>). <ref type="foot">1</ref> The user is the animator, articulating LEGO figures and structures to rig each keyframe (Fig. <ref type="figure">1</ref>; Bottom), while the robot carries out precise camera motions (e.g., tracking smoothly to zoom into a character). Early in the interaction, the user asks: "Let's get a tracking shot around the Hulk?" In response, the LM attempts to generate a plan, but fails -"tracking" is not a known concept. The robot vocalizes its failure, deferring to the user for next steps. Empowered by our framework, the user chooses to explicitly teach the robot how to "track" by providing a kinesthetic demonstration. Using this single example as supervision (Fig. <ref type="figure">1</ref>; Left), we immediately parameterize a new skill (i.e., &#946; track ), and synthesize the corresponding behavior (i.e., track(loc: Location)). Later (Fig. <ref type="figure">1</ref>; Right), when the user directs to robot to "push in on Loki and then track around the tower," the robot can immediately use what it has learned to generate both the full task plan (zoom_in(Loki); track(Tower)), as well as a visualization of the motion trajectory (shown via a custom interface; &#167;3.3). On user confirmation, the robot executes in the real-world, capturing the desired shot.</p><p>We evaluate Vocal Sandbox through systematic ablations and long-horizon user studies. In our first setting, non-expert participants work with a Franka Emika fixed-arm manipulator for the task of collaborative gift bag assembly; we run a within-subjects user study with N = 8 users spanning a total of 23 hours of robot interaction time, comparing a Vocal Sandbox system with two baselines: a static variant of our system without learning at either the high-and low-level <ref type="bibr">[8]</ref>, as well as a variant of our system with only high-level language teaching. We show that our contributions enable users to teach 17 new high-level behaviors and an average of 16 new low-level skills, resulting in 22.1% faster collaborative task performance with more complex autonomous performance (+19.7%) with fewer failures (-67.1%) compared to baselines; qualitatively, participants strongly prefer our system over all baselines due to its ease of use (+20.6%), helpfulness (+10.8%) and overall performance (+13.9%). We then scale to a more advanced setting where an experienced system-user (an author of this work) works with a robot to shoot a stop-motion animation (Fig. <ref type="figure">1</ref>); over two hours of continuous collaboration the user teaches progressively more complex concepts, building a rich library of skills and behaviors to produce a 52 second (232 frame) movie.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Vocal Sandbox: A Framework for Continual Learning and Adaptation</head><p>Vocal Sandbox is characterized by a language model planner that maps user utterances to sequences of high-level behaviors and a family of skill policies that ground behaviors to robot actions (Fig. <ref type="figure">2</ref>). In this section, we first formalize planning ( &#167;2.1) and skill execution ( &#167;2.2), then introduce our contributions for teaching new behaviors from diverse modes of feedback ( &#167;2.3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">High-Level Planning with Language Models</head><p>We implement high-level planning using a language model (LM) prompted with an API specification &#923; t = (F t , C t ) (Fig. <ref type="figure">2</ref>; Left) that consists of a set of functions F t (synonymous with "behaviors") and argument literals C t , indexed by interaction timestep t -this indexing is important as users can teach new behaviors (add to the API) over the course of a collaboration. Each argument c &#8712; C t is a typed literal -for example, we define Enums such as Location that associate a canonical name (e.g., Location.HOME in Fig. <ref type="figure">2</ref>) to values that are used by the robot when executing a skill. We define each function f &#8712; F t as a tuple (n, &#963;, d, b) consisting of a semantically meaningful name n (e.g., goto), a type signature &#963; (e.g., [ObjectRef] &#8594; None), human-readable descriptive docstring d (e.g., "Move above the specified obj"), and function body b. Crucially, we assume we can define new function compositions -for example pickup(obj: ObjectRef) as goto(obj); grasp().</p><p>Given a user utterance u t at time t, the language model generates a plan p t that is comprised of a program, or sequence of function invocations. For example, a valid plan for an instruction such as "place the candy in the gift bag" subject to the API in Fig. <ref type="figure">2</ref> would be p t = pickup(ObjectRef.CANDY); goto(ObjectRef.GIFT_BAG); release(). We formalize planning as generating</p><p>, where u t is the user's utterance, &#923; t is the current API, and h t corresponds to the full interaction history through t. Concretely, h t = [(u 1 , p 1 ), (u 2 , p 2 ), . . . (u t-1 , p t-1 )]; see our project page for full prompts.</p><p>Note that there are cases where the language model may not be able to generate a valid plan -for example, given an utterance describing a behavior that does not exist in the API (e.g., "can you pack a gift bag with three candies"). In such a situation, we raise an exception, and rely on the commonsense understanding of the LM to generate a helpful error message (e.g., "I am not sure how to pack; could you teach me?"). These exceptions (as well as the successfully generated plans) are shown to users via a custom graphical interface (GUI; &#167;3.3) to inform their next action.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Low-Level Skills for Action Execution</head><p>To ground plans p t to robot actions, we assemble skill policies &#960; f (a | o t , c 1 , c 2 , . . . ) that define the execution of a function f &#8712; F as a mapping from a state observation o t &#8712; R n and individual "resolved" arguments c 1 , c 2 , . . . to a sequence of robot actions a &#8712; R T &#215;D (e.g., end-effector poses). Of particular note is the resolution operator &#8226; that maps an LM-generated plan p t to a sequence of skill policy invocations; to do this, we first iterate through each function invocation in p t and recursively expand higher-order functions as necessary. In our "place the candy in the gift bag" example, this means expanding the initial pickup(ObjectRef.Candy) to its resolved function body b pickup goto(ObjectRef.Candy) and release(). We then resolve each argument, substituting each literal (variable name) with its corresponding value -for example ObjectRef.Candy = "A gummy, sandwich-shaped candy" while Location.HOME = ([0.36, 0.00, 0.49] pos , [1, 0, 0, 0] rot ).</p><p>The corresponding values are passed to the underlying policy implementation &#960; f .</p><p>In this work, we look at three different classes of skill policies: hand-coded primitives (e.g., GO_HOME or GRASP), visual keypoint-conditioned policies that identify end-effector poses given natural language object descriptions, and dynamic movement primitives ( &#167;3.2). Note that in general, our formulation permits arbitrary policy classes, including learned closed-loop control policies <ref type="bibr">[20]</ref><ref type="bibr">[21]</ref><ref type="bibr">[22]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Teaching via Program Synthesis</head><p>A core component of Vocal Sandbox is the ability to synthesize new behaviors and skills from diverse modalities of teaching feedback such as spoken dialogue, object keypoints, and kinesthetic demonstrations. To do this, we leverage the commonsense priors and strong generalization ability of our underlying language model to synthesize new functions and arguments, updating the API &#923; t in real-time. Given an failed plan, each "teaching" step first uses the language model to vocalize the "missing" concept(s) to be taught, followed by a interaction that prompts the user to provide targeted feedback. The language model then synthesizes new functions and arguments for the API. Fig. <ref type="figure">2</ref> (Right) shows the two types of teaching we develop: 1) argument teaching and 2) function teaching.</p><p>The goal of argument teaching is to teach and ground new literals c &#710;&#8712; C -for example, identifying "green toy car" as a new argument given the utterance "Grab the green toy car" in Fig. <ref type="figure">2</ref> (Right). To do this, the LM parses as much of the utterance as possible subject to the current API, mapping "grab" to pickup. Because it is able to identify the correct function (but not the arguments), the LM then uses the corresponding type signature to infer that "green toy car" should be of type ObjectRef; it then automatically synthesizes an API update, adding a new literal ObjectRef.TOY_CAR. This addition is then shown to the user (as part of the second stage of teaching), who then then provides the supervision needed to successfully ground the literal (i.e., providing the keypoint supervision to localize the object). On successful execution, the LM commits this change, yielding &#923; t+1 .</p><p>The goal of function teaching is to teach new functions f &#710;&#8712; F -for example, defining pack from the utterance "now can you pack the candy in the bag" in Fig. <ref type="figure">2</ref> (Right). Here, the LM cannot partially parse the utterance -"pack" does not have an associated function, so there is no reliable way to infer a type signature. Instead, the LM highlights "pack" as a new behavior, and explicitly asks the user to teach its meaning through decomposition <ref type="bibr">[23,</ref><ref type="bibr">24]</ref>, breaking "pack" down into a chain of existing skills. In this case, the user says: "Pick up the candy; go above the bag; drop it" with program pickup(ObjectRef.CANDY); goto(GIFT_BAG); release(). The LM then explicitly synthesizes the new function f &#710;= (n &#710;, &#963; &#710;, d &#710;), with name n &#710;= pack, signature &#963; &#710;= obj: ObjectRef &#8594; None, and docstring d &#710;= "Retrieve the object and place it in the gift bag." We also generate the "lifted" body pickup(obj); goto(GIFT_BAG); release() via first-order abstraction <ref type="bibr">[24,</ref><ref type="bibr">25]</ref>. This combination of argument and function teaching enables the expressivity and real-time adaptivity of our framework; in defining lightweight algorithms for learning and synthesizing new API specifications from interaction, we provide users with a reliable method of growing and reasoning over the robot's capabilities during the course of a collaboration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Implementation &amp; Design Decisions</head><p>While we introduce Vocal Sandbox as a general learning framework, this section provides implementation details specific to the experimental settings we explore in our experiments. Notably, our first experimental setting involves a collaborative gift-bag assembly task ( &#167;4.1) where a non-expert user and robot work together to assemble a gift bag with a set of known objects (visualized in Fig. <ref type="figure">2</ref>). Our second setting ( &#167;4.2) pairs an experienced system user (an author of this work) with a robot for the task of LEGO stop-motion animation (visualized in Fig. <ref type="figure">1</ref>). For all settings, we use a Franka Emika Panda arm equipped with a Robotiq 2F-85 gripper following DROID <ref type="bibr">[26]</ref>; we also assume access to an overhead ZED 2 RGB-D camera with known intrinsics and extrinsics to obtain point clouds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Language Models for High-Level Planning</head><p>We use GPT-3.5 Turbo with function calling [v11-06; 27, 28] due to its high latency and cost-effective pricing. We encode &#923; t as a Python code block (formatted as Markdown) in the system prompt u prompt . To constrain the LM outputs to well-formed programs, we use the function calling feature provided by the OpenAI chat completion endpoint <ref type="bibr">[28]</ref>, formatting each function in &#923; t as a "tool" that the LM can invoke in response to a user endpoint. Full code and prompts are on our project page.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Skill Policies for Object Manipulation and Dynamic Motions</head><p>We implement different families of skill policies &#960; f for each of our two experimental settings. In our first setting ( &#167;4.1) we implement skills using a visual keypoints model, while for our second setting ( &#167;4.2) we implement skills as dynamic movement primitives [DMPs; <ref type="bibr">29,</ref><ref type="bibr">30]</ref>.</p><p>Visual Keypoint-Conditioned Policies for Object Manipulation. For the collaborative gift-bag assembly setting ( &#167;4.1), we implement skills &#960; goto and &#960; pickup via learned keypoint-conditioned models that ground object referring expressions (e.g., "a green toy car") to point clouds in a scene <ref type="bibr">[31]</ref>. Specifically, given an (image, language) input, we first predict a 2D keypoint (x, y) that corresponds to the centroid of the desired object in pixel space, then use off-the-shelf models [i.e., FastSAM; 32] to produce an object segmentation mask. Finally, we deproject this mask through our calibrated camera to obtain the object's point cloud to inform manipulation. To ensure consistency, we use a learned mask propagation model <ref type="bibr">[XMem;</ref><ref type="bibr">33]</ref> to maintain a dictionary mapping language expressions to existing object masks -this allows us to ensure that referring expressions such as "the green toy car" always refer to the same object instance for the duration of the interaction (as opposed to predictions changing over time, confusing users). We provide further detail in &#167;B.3.</p><p>Note that we adopt this modular approach (predict keypoints from language, then extract a segmentation mask) for two reasons. First, we found learning a visual keypoints model to be extremely data efficient (requiring only a couple dozen examples) and reliable, significantly outperforming existing open-vocabulary detection and end-to-end models such as OWL-v2 <ref type="bibr">[34,</ref><ref type="bibr">35]</ref> as well as visionlanguage foundation models such as GPT-4V <ref type="bibr">[36]</ref>; we quantify these results via offline evaluation in &#167;B.3. Second, we found keypoints to be the right interface for interpretability and teaching: to specify new object, users need only "click" on the relevant part of the image via our GUI ( &#167;3.3).</p><p>Learning Dynamic Movement Primitives from Demonstration. For our stop-motion animation setting ( &#167;4.2), the robot learns to control the camera to execute different cinematic motions such as panning, tracking, or zooming (amongst others); due to the dynamic nature of these motions, we implement skills as (discrete) DMPs <ref type="bibr">[29,</ref><ref type="bibr">30]</ref>, where users teach new motions by providing a single kinesthetic demonstration; using DMPs allows us to generalize motions to novel start and goal positions, while also providing nice affordances for re-timing trajectories (e.g., speeding up a motion by a factor of 2x, or constraining that a motion executes in K steps). Furthermore, as rolling out a DMP generates an entire trajectory, we can provide users with a visualization of the robot's intended path via our custom GUI ( &#167;3.3; visualized in Fig. <ref type="figure">6</ref>). To learn a DMP, we divide a user's demonstration into a fixed sequence of waypoints, and learn a set of radial basis function weights (e.g., &#946; track from Fig. <ref type="figure">1</ref>) that smoothly interpolate between them; we provide further detail in &#167;B.4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">A Structured Graphical User Interface for Transparency and Intervention</head><p>Finally, we design a graphical user interface (GUI; Fig. <ref type="figure">5</ref>) to provide users with transparency into the system state and to enable teaching. Consider an utterance "pack the candy" and a failure mode that has the robot packing a different object (e.g., the ball) instead. Without any additional information, it is impossible for the user to identify whether this failure is a result of the LM planner (e.g., generating the incorrect plan pack(ball)), or a failure in the skill policy (e.g., incorrectly predicting a keypoint on the ball instead of the candy). The GUI counteracts this confusion by explicitly visualizing the plan and the interpretable traces produced by each skill (e.g., the predicted keypoints and segmentation mask, or the robot's intended path as output by a DMP) -these interpretable traces also double as interfaces for eliciting teaching feedback (e.g., "clicks" to designate keypoints). The GUI also provides 1) the transcribed user utterance, 2) the current "mode" (e.g., normal execution, teaching, etc.), and 3) the prior interaction history.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>We evaluate the usability, adaptability, and helpfulness of Vocal Sandbox systems through two experimental settings: 1) a real-world human-subjects study with N = 8 inexperienced participants for the task of collaborative gift bag assembly ( &#167;4.1), and 2) a more complex setting that pairs an experienced system-user (author of this work) and robot to film a stop-motion animation ( &#167;4.2). All studies were IRB-approved, with all participants providing informed consent.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">User Studies: Collaborative Gift Bag Assembly</head><p>This study has participant work with the robot to assemble four gift bags with a fixed set of objects (candy, Play-Doh, and a toy car), along with a hand-written card (transcribed from a 96-word script). This task is repetitive and time-intensive, serving to study how well users can teach the robot maximally helpful behaviors to parallelize work and minimize their time spent supervising the robot.</p><p>Participants and Procedure. We conduct a within-subjects study with 8 non-expert user (3 female/5 male, ages 25.2 &#177; 1.22). Each user assembled four bags with three different systems, with a random ordering of methods across users. Prior to engaging with the robot, we gave users a sheet describing the robot's capabilities (i.e., the base API functionality), and instructions for using any teaching interfaces (if applicable). Prior to starting the bag assembly task, users were allowed to practice using each method for the unrelated task of throwing (disjoint) objects from a table into a trash can.</p><p>Independent Variables -Robot Control Method. We compare the full Vocal Sandbox system (VS) to two baselines. First, VS -(Low, High) (without Low, High), a static version of Vocal Sandbox that ablates both low-level skill teaching and high-level plan teaching, reflecting prior work like MOSAIC [8] that assume a fixed skill library and LM planner seeded with a base API. Our second baseline VS -Low ablates low-level skill teaching, but allows for teaching new high-level behaviors.</p><p>Dependent Measures. We consider objective metrics such as time supervising the robot, number of commands spoken, number of low-level skills executed per command (behavior complexity), as well as the number of teaching interactions at both the high-and low-level. Intuitively, we expect that methods capable of learning high-level behaviors from user feedback require less active supervision time and yield more complex behaviors than those using a static (naive) planner; similarly, we expect methods that permit teaching new low-level skills see fewer low-level skill execution errors. In addition to objective metrics, we asked users to fill out a qualitative survey after engaging with each method, describing their experience and ranking each method.</p><p>Results -Objective Metrics. We report objective results in Fig. <ref type="figure">4</ref>. We measure robot supervision time (Fig. <ref type="figure">4</ref>; Left) as a proxy for robot capability -a more capable robot should require less supervision (as it performs more actions autonomously). We observe that Vocal Sandbox (VS) systems outperform both the VS -(Low, High) and VS -Low baselines in terms of supervision time, demonstrating the capabilities afforded by the ability to teach at multiple levels of abstraction; the strengths of VS systems are made further evident in visualizing how users spend their time during a collaboration across instructing, confirming, and parallelizing (Fig. <ref type="figure">4</ref>; Bottom-Left).</p><p>We additionally measure the complexity of (taught) high-level behaviors, measures as the number of low-level skills executed per language utterance (Fig. <ref type="figure">4</ref>; Middle). We find that VS systems yield increasingly more complex behaviors compared to both baselines, with significantly (p &lt; 0.05) more complex behaviors taught (and used) compared to the VS -(Low, High) baseline. This further highlights the importance of high-level teaching via structured function synthesis ( &#167;2.3). Of equal importance is the ability of the robot to learn low-level skills, exhibiting a declining rate of skill execution failures over time (Fig. <ref type="figure">4</ref>; Right). We see that VS systems show significantly (p &lt; 0.05) fewer skill failures than both baselines, stressing the isolated importance of teaching at the low-level. We further note that VS systems show fewer skill failures as a function of numbers of bags assembled, demonstrating the ability of VS systems to improve over time.</p><p>Results -Subjective Metrics. We report subjective survey results in Fig. <ref type="figure">5 (Left)</ref>, where users ranked the three methods across six different measures. In terms of ease, helpfulness, intuitiveness, and willingness to use again, we find that users prefer our VS system over both baselines due to its transparency and teachability, as users noted "teaching is useful" and "I loved how I was able to teach the robot certain skills". We also note that users significantly (p &lt; 0.05) prefer VS to the naive VS -(Low, High) baseline, as it "felt chunkier to use". For predictability and trust, we observe that users prefer our VS system over baselines, however this trend is less pronounced because any autonomous execution incurs some loss of predictability (due to imperfect robot execution with learned policies).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Scaling Experiment: LEGO Stop Motion Animation</head><p>Finally, to push the limits of our framework, we consider a LEGO stop motion animation setting, where an experienced system-user (author of this work) works with the robot over two hours of continuous collaboration to shoot a stop-motion film. The user is the director, leading the creative vision for the film by directing footage and arranging LEGO set pieces, while the robot controls the camera to capture different types of dynamic shots. Specifically, the user teaches the robot to perform cinematic concepts including "tracking", "zooming" and "panning" via kinesthetic demonstrations, which we use to fit different DMPs (as described in &#167;3.2). The user iteratively builds on these behaviors to shoot progressively more complex frame sequences, ultimately resulting in a 52-second stop-motion film, consisting of 232 individual frames. Of the total frame count, 43% were shot with completely autonomous dynamic camera motions taught by the user -that is, camera motinos where the LEGO scene remained fixed, but the camera moved subject to a DMP rollout (e.g., a "zoom-in" shot). We found the taught skills were able to generalize across different start and end positions, as well as different timing constraints -for example "pan around slowly" commanding a pan_around motion with more frames (N = 30; 8 seconds), while "pan around quickly" modulated the same skill to execute in fewer frames (N = 8; 1.33 seconds). Over the two-hour long interaction, the robot executed on 40 novel commands -commands such as "let's frame the tower in this shot" to "zoom into Iron Man" (see Appx. D and our project page for more examples).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related Work</head><p>Vocal Sandbox engages with a large body of work proposing systems for human-robot collaboration that pair task planning with with learned skill policies for executing actions; we provide an extended treatment of related work in &#167;C.3, focusing here on prior work that center language models for task planning, or introduce generalizable approaches for learning skills from different feedback modalities.</p><p>Task Planning with Language Models. Recent work investigates methods for using LMs such as GPT-4 and PaLM <ref type="bibr">[37]</ref><ref type="bibr">[38]</ref><ref type="bibr">[39]</ref> for general-purpose task planning <ref type="bibr">[9,</ref><ref type="bibr">[16]</ref><ref type="bibr">[17]</ref><ref type="bibr">[18]</ref>. Especially relevant are approaches that use LMs to generate plans as programs <ref type="bibr">[19,</ref><ref type="bibr">[40]</ref><ref type="bibr">[41]</ref><ref type="bibr">[42]</ref><ref type="bibr">[43]</ref>. While some methods explore using LMs to generate new skills -e.g., by parameterizing reward functions <ref type="bibr">[44,</ref><ref type="bibr">45]</ref> -they require expensive simulation and offline learning. In contrast, Vocal Sandbox designs lightweight learning algorithms to learn new behaviors online, from natural user interactions.</p><p>Learning Generalizable Skills from Mixed-Modality Feedback. A litany of prior approaches in robotics study methods for interpreting diverse feedback modalities, from intent inference in HRI <ref type="bibr">[8,</ref><ref type="bibr">46,</ref><ref type="bibr">47]</ref> to learning from implicit expressions <ref type="bibr">[11,</ref><ref type="bibr">48]</ref> or explicit gestures <ref type="bibr">[12,</ref><ref type="bibr">49,</ref><ref type="bibr">50]</ref>. With the advent of LMs, language has become an increasingly popular interaction modality; however, most methods are limited to specific types of language feedback such as goal specifications <ref type="bibr">[13,</ref><ref type="bibr">51]</ref> or corrections <ref type="bibr">[14,</ref><ref type="bibr">52,</ref><ref type="bibr">53]</ref>. In contrast, Vocal Sandbox demonstrates that language alone is not sufficient when teaching new behaviors -especially for teaching new object groundings or dynamic motions. Instead, our framework leverages multiple feedback modalities simultaneously to guide learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Discussion</head><p>We present Vocal Sandbox, a framework for situated human-robot collaboration that continually learns from mixed-modality interaction feedback in real-time. Vocal Sandbox has two components: a high-level language model planner, and a family of skill policies that ground plans to actions. This decomposition allows users to give targeted feedback at the correct level of abstraction, in the relevant modality. We evaluate Vocal Sandbox through a user study with N = 8 participants, observing that our system is preferred by users (+13.9%), requires less supervision (-22.1%) and yields more complex autonomous performance (+19.7%) with fewer failures (-67.1%) than non-adaptive baselines.</p><p>Limitations and Future Work. As execution relies on low-level skills that are quickly adapted from sparse feedback, this framework struggles in dexterous settings (e.g., assistive bathing) where more data is necessary to capture behavior nuances. Another shortcoming is that the collaborations enabled by our system are relatively homogeneous -users are teachers and robots are followers -which is not suited for all settings. Future work will explore algorithms for cross-user improvement as well as sample-efficient algorithms to learn more expressive skills. We will also further probe the user's model of robot capabilities to investigate questions about human-robot trust and collaboration styles. Beyond the language model task planner that uses GPT-3.5 Turbo with function calling [v11-06; 27, 28], and the (lightweight) learned skill policy, we use a combination of Whisper <ref type="bibr">[54]</ref> for real-time speech recognition (mapping user utterances to text), and the OpenAI text-to-speech (TTS) API <ref type="bibr">[55]</ref> for vocalizing confirmation prompts and querying users for teaching feedback. All models, cameras, and API calls are run through a single laptop equipped with an NVIDIA RTX 4080 GPU (12 GB).</p><p>For the gift-bag assembly user study (N = 8), the total cost of all external APIs (Whisper, OpenAI TTS, GPT-3.5 Turbo) amounted to $0.47 + $0.08 + $1.24 = $1.79. For the entirety of the project, GPT-3.5 API spend was $5.79, with &#8764;$4.00 spent on Whisper and TTS (&lt; $10.00 total). Q2. Given the use of powerful closed-source foundation models such as GPT-3.5/GPT-4, why adopt a modular approach for implementing the visual keypoints (and similarly dynamic movement primitives for learning policies)? Why not adopt an end-to-end approach building on top of GPT-4 with Vision, or existing pretrained multitask policies?</p><p>We choose to adopt a modular approach in this work for two reasons. First, existing end-to-end models are still limited when it comes to fine-grained perception and grounding; we quantify this more explicitly through head-to-head static evaluations of our keypoint model vs. pretrained models such as OWLv2 <ref type="bibr">[34,</ref><ref type="bibr">35]</ref> in &#167;B.3. Second, we argue that modularity allows users to systematically isolate failures and address them via multimodal feedback, at the right level of abstraction. We expand on this further in &#167;C.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Q3.</head><p>The baseline methods in the user study ( &#167;4.1) are framed as ablations, rather than instances of existing systems that combine language model planning with learning from multiple modalities (e.g., on-the-fly language corrections, gestures). How were these ablations chosen? What is their explicit relationship to prior work?</p><p>In our user studies, we constructed our baselines in a way that best represented the contributions of prior work while still fulfilling the necessary prerequisites to perform in our situated human-robot collaboration setting. Though we labeled these "ablations" in the paper, each one is representative of prior work -connections we make explicit in &#167;C.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Q4. How does Vocal Sandbox fit into the context of prior human-robot interaction works? What are the new capabilities Vocal Sandbox is bringing to the table?</head><p>While the main body of the paper situates our framework against prior work in task planning and skill learning from different modalities, Vocal Sandbox builds on a rich history of work that develops systems for different modes of human-robot interaction. We provide an extended treatment of related work, as well as directions for future work in &#167;C.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Implementing Vocal Sandbox</head><p>Implementing a system in the Vocal Sandbox framework requires not only the learned components for language-based task planning and low-level skill execution, but broader support for interfacing with users via automated speech-to-text, text-to-speech for vocalizing failures or confirmation prompts, as well as a screen for visualizing the graphical user interface. We describe these additional components, as well as provide more detail around the implementation of the learned components of the systems instantiated in our paper over the following sections.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.1 System Architecture</head><p>For robust and cheap automated speech recognition (mapping user utterances to text), we use Whisper <ref type="bibr">[27,</ref><ref type="bibr">54]</ref>, accessed via the OpenAI API endpoint. Whisper is a state-of-the-art model meant for natural speech transcription, and we find that the latency for a given transcription request (&lt; 0.5s round-trip) is more than enough for all of our use-cases. API pricing is also affordable, with the Whisper API (through OpenAI) charging $0.006 / minute of transcription (less than $0.50 to run our entire gift-bag assembly user study). Note that we implement speech-to-text via explicit "push-to-talk" interface, rather than an alternative "always-listening" approach; we find that this not only allows us to keep cost and word-error rate down, but improves user experience. By gating the listening and stop-listening features with explicit audio cues, users are more aware of what the system is doing, and can more quickly localize any failures stemming from malformed speech transcriptions.</p><p>In addition to automated speech-to-text, we adopt off-the-shelf solutions for real-time text-to-speech; this is mostly for implementing confirmation prompts ("does this plan look ok to you?") and for vocalizing the system state, but also includes an adaptive component when probing users to teach new visual concepts or behaviors ("I'm sorry, I'm not sure what the 'jelly-candy thing' looks like, could you teach me?"). For these queries, we use the OpenAI TTS API <ref type="bibr">[55]</ref> with a similarly affordable pricing scheme of $15.00 per 1M characters (or approximately 200K words); to run our gift-bag assembly study, this cost fewer than $0.08. For hardware (for both speech recognition and text-to-speech), we use a standard USB speaker-microphone (the Anker PowerConf S3).</p><p>To visualize the graphical user interface to users, we use an external monitor (27 inches), placed outside of the robot's workspace. We drive the GUI, all API calls (speech recognition, text-to-speech, and language modeling via GPT-3.5 Turbo), ZED 2 camera, and all our learned models -including our visual keypoint-conditioned policy, FastSAM <ref type="bibr">[32]</ref>, and XMem <ref type="bibr">[33]</ref> -from a single Alienware M16 laptop with an NVIDIA RTX 4080 GPU with 12 GB of VRAM; this laptop was purchased matching the DROID platform specification <ref type="bibr">[26]</ref>.</p><p>Modifications for Gift-Bag Assembly User Study. For the gift-bag assembly user study (N = 8) we implement the "push-to-talk" speech recognition interface with physical buttons placed on the table; users are provided two buttons -one for "talking" and one for "cancelling" the prior actions (which serves a dual function as a secondary, software-based emergency stop when the robot is moving). These buttons are placed on the side of the user's non-dominant hand, always within reach.</p><p>Modifications for LEGO Stop-Motion Animation. For the LEGO stop-motion animation study, we use the same components as above, with two additions. As the expert user is directing and framing individual camera shots during the course of the collaboration, they add an additional laptop (a MacBook, running Stop Motion Studio) to the workspace (disconnected from the rest of the system). As the user requires both hands free for this study (for articulating LEGO minifigures and structures, or editing the clip on their laptop), we replace the tabletop "push-to-talk" buttons with a USB-connected foot pedal with two switches with the same recognition and cancel functionality.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.2 Language Models for Task Planning and Skill Induction</head><p>As described in the main body of the paper, we use GPT-3.5 Turbo with function calling [v11-06; <ref type="bibr">27,</ref><ref type="bibr">28]</ref> as our base language model for the task planner. This was the latest, most affordable, and highest latency language model at the time we began this work (prior to the release of GPT-4 and GPT-4o), with a response time between 1-3s on average, and a cost of $2.00 / 1M tokens; for this work, the total cost we spent on GPT 3.5 API calls (including development) was $5.79, with the gift-bag assembly user study itself only amounting to $1.24 of the total spend.</p><p>We use the GPT-3.5 function calling capabilities throughout our work, which requires formatting our API specification following a custom JSON schema set by OpenAI; we provide these function calling prompts (and all GPT-3.5 prompts for our work) on our supplementary website for easy visualization: <ref type="url">https://vocal-sandbox.github.io/#language-prompts</ref>. All language model outputs were generated with low-temperature sampling (0.2).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.3 Visual Keypoint-Conditioned Policy Implementation</head><p>As described in the main body of the paper, we use three components to implement Vocal Sandbox's object manipulation skills: (1) a learned language-conditioned keypoints model, (2) a pretrained mask propagatation model <ref type="bibr">[XMem;</ref><ref type="bibr">33]</ref>, and (3) a point-conditioned segmentation model <ref type="bibr">[FastSAM;</ref><ref type="bibr">32]</ref>.</p><p>Our learned keypoint model predicts object centroids from language, enabling us to generalize across object instances where XMem struggles. Given an RGB image o t &#8712; R H&#215;W &#215;3 and natural language literal c ref from the high-level language planner, it predicts a matrix of per-pixel scores H &#8712; [0, 1] H&#215;W . We take the coordinate-wise argmax of H as the predicted keypoint. We implement our model with a two-stream architecture following Shridhar et al. <ref type="bibr">[57]</ref> that fuses pretrained CLIP <ref type="bibr">[58]</ref> textual embeddings with a fully-convolutional architecture. We train this model on an small, cheap-to-collect dataset of 25 unique images each annotated with 3 keypoints (75 examples total). To fit our model, we create heatmaps from each ground-truth label, centering a 2D Gaussian around each keypoint with a fixed standard deviation of 6 pixels; we train our model by minimizing the binary cross-entropy between model predictions and these heatmaps, augmenting images with various label-preserving affine transformations (e.g., random crops, shears, rotations).</p><p>Our mask propagation model, XMem <ref type="bibr">[33]</ref> tracks object segmentation masks from one image frame to the next; we provide a brief overview here. XMem is comprised of three convolutional networks (a query encoder e, a decoder d, and a value encoder v) and three memory modules (a short-term sensory memory, a working memory, and a long-term memory). For a given image I t , the query encoder outputs a query q = e(I t ) and performs attention-based memory reading from working and long-term memory stores to extract features F cref , where c ref is the language utterance (e.g., "candy"). The decoder d then takes as input q, F , and h t-1 (the short-term sensory memory) to output a predicted mask M t . Finally, the value encoder v(I t , M t ) outputs new features to be added to the memory history h t . The query encoder e and value encoder v are instantiated with ResNet-50 and ResNet-18 <ref type="bibr">[59]</ref> respectively. The decoder d concatenates the short-term memory history h t-1 with the extracted features F , upsampling by a factor of 2x until reaching a stride of 4. While upsampling, the decoder fuses skip connections from the query encoder e at every level. The final feature map is passed through a 3 &#215; 3 convolution to output a single channel logit which is upsampled to the image size. See Cheng and Schwing <ref type="bibr">[33]</ref> for additional details.</p><p>Finally, our point-conditioned segmentation model, FastSAM <ref type="bibr">[32]</ref>, is used to extract an object mask from a predicted keypoint. It has two components: a YOLOv8 <ref type="bibr">[60]</ref> segmentation model s for all-instance segmentation, and a point prompt-guided selection for identifying the object mask in which the point lies. From a given predicted keypoint p, the segmentation model outputs the mask M from s that encompasses p. We refer to <ref type="bibr">[32]</ref> for additional details. This predicted mask M is subsequently added to the XMem memory storage after being passed through the value encoder v.</p><p>Robot actions are coded as parameterized primitives (i.e., pick_up or goto) that take object locations as input and output trajectories.</p><p>Static Evaluations -Robust Object Grounding. To highlight the need for a data-efficient, domainspecific vision system, we evaluate the performance of our vision module implementation (as described above) compared to existing closed-source foundation models such as OWLv2 and GPT-4V. To compare, we consider an (image, annotation) dataset of all visual queries from the N = 8 user study, where the annotation is where the user confirmed or manually selected a correct object location. We report measures for accuracy and precision -keypoint mean squared error in pixel distance and success counts for predictions within a toy-car radius (14 pixels) from the annotation. For the Vocal Sandbox predicted mask, the centroid of the mask is used for these point-to-point calculations. We observe that while the mean squared error across all three methods are comparable, our Vocal Sandbox vision module greatly outperforms the foundation model baselines in the precision metric. This is because all the objects are clustered together on a table (Fig. <ref type="figure">6</ref>) -randomly selecting between these objects yields low MSE predictions, however a nearby prediction is not sufficient to identify and isolate the correct object for grasping.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Keypoint MSE (px) Precision</head><p>OWLv2 Ensemble (ViT-L/14) 35.3 &#177; 1.01 1.83 &#177; 0.91 GPT-4-Turbo (w/ Vision) 36.39 &#177; 1.73 15.94 &#177; 2.55 Vocal Sandbox (Ours) 30.46 &#177; 3.61 69.41 &#177; 3.12 B.4 Learning Discrete Dynamic Movement Primitives from Demonstration</p><p>For our LEGO stop-motion animation setting, we implement our low-level skill policy as a library of discrete Dynamic Movement Primitives <ref type="bibr">[29,</ref><ref type="bibr">30]</ref>. We adopt the traditional discrete DMP formulation from Ijspeert et al. <ref type="bibr">[30]</ref>, defining a second-order point dynamical system in terms of the system state y, a goal g, and phase variable x such that:</p><p>where &#945; and &#947; define gain terms, &#964; &#8712; (0, 1] denotes a temporal scaling factor, and f (x, g) is the learned forcing function that drives a DMP to follow a specific trajectory to the goal g; f (x, g) is implemented as a learned linear combination of J radial basis functions and the phase variable x such that:</p><p>where c j and h j are the heuristically chosen centers and heights of the basis functions, respectively. We fit the DMP weights &#946; = {w 1 , w 2 . . . w J } with locally-weighted regression (LWR) from the provided kinesthetic demonstration. For all DMPs in this work, we use J = 32, with gain values &#945; y = 25, &#947; y = 25 4 and basis functions parameters set following prior work <ref type="bibr">[30]</ref>.</p><p>We choose (discrete) DMPs to implement skill learning as they permit efficient learning from a kinesthetic demonstration, and have two properties that enable rich generalization to</p><p>1) new goals (by specifying a new g) and 2) arbitrary temporal scaling (by rescaling &#964; ). This lets us induce a simple algebra for parameterizing our policy &#960; d,&#946; : (c ref , l, N ), indexing each learned DMP with a learned referent c ref , a new goal location l, and a number of waypoints N (used to set &#964; ) -in other words, allowing us to learn a new DMPtrack(loc: Location) -that we can call with arbitrary new locations (from novel initial states) with arbitrary timing parameters (e.g., "can you track around Loki in 30 frames" or "I need a tracking shot around the tower... let's try 2 seconds").</p><p>Visualizing DMP Rollouts. Another advantage of using DMPs for parameterizing control is that they allow us to visualize entire trajectories prior to execution. Similar to how we visualize the keypoints and object segmentation masks in the collaborative gift-bag assembly setting, we provide a GUI that shows the robot and the planned path (and end-effector poses) via a simple MuJoCo-based viewer. Fig. <ref type="figure">6</ref> (Right) provides an example -we plot the original kinesthetic demonstration relative to the current robot pose in green (for reference), and the planned DMP trajectory in blue, along with the end-effector orientation frames at the beginning and end of the trajectory. Users additionally can dynamically advance the simulation to visualize the entire rollout (at the actual speed of execution).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B.5 Physical Robot Platform &amp; Controller Parameters</head><p>We use a Franka Emika Panda 7-DoF robot arm with a Robotiq 2F-85 parallel jaw gripper following the platform specification from DROID <ref type="bibr">[26]</ref>. The robot and its base are positioned at one side of a 3' x 5' table, across from the user, such that the user and robot share the tabletop workspace. We use an overhead ZED 2 RGB-D camera with known intrinsics and extrinsics. For robot control, we use a modified version of the DROID control stack based on Polymetis <ref type="bibr">[61]</ref>. Low-level policies command joint positions at 10 Hz to a joint impedance controller which runs at 1 kHz. We implement two compliance modes: a stiff mode which is activated when the robot is executing a low-level skill, and a compliant mode for when the user provides a kinesthetic demonstration.</p><p>Safety. We include multiple safeguards to ensure user safety. Users have the option to cancel any proposed behavior when an interpretable trace is presented with a physical Cancel button as described in &#167;B.1 -this prevents execution and immediately backtracks the Vocal Sandbox system. Second, during execution of any low-level skill, the user can interrupt the robot's motion with this button as well. This halts the robot's motion and it immediately becomes fully compliant. Lastly, during user studies, both the user and proctor have access to the hardware emergency stop button which cuts the robot's power supply and mechanically locks the robot arm.</p><p>situated environments <ref type="bibr">[1,</ref><ref type="bibr">4,</ref><ref type="bibr">6,</ref><ref type="bibr">7,</ref><ref type="bibr">66]</ref>, to learned methods for shared autonomy <ref type="bibr">[67]</ref><ref type="bibr">[68]</ref><ref type="bibr">[69]</ref>, to platforms for assistive robotics <ref type="bibr">[70]</ref><ref type="bibr">[71]</ref><ref type="bibr">[72]</ref>, amongst many others <ref type="bibr">[2]</ref>.</p><p>While Vocal Sandbox is heavily inspired by this prior work, especially those that learn language interfaces for grounding user intent to low-level robot behavior <ref type="bibr">[8,</ref><ref type="bibr">73,</ref><ref type="bibr">74]</ref>, this is only the beginning. Future iterations of our framework will build on the types of interactions and learning we permit (e.g., multi-robot teaming or integrating modalities such as touch or nonverbal feedback), all driving towards general and seamless human-robot collaboration.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Additional Experiment Visualizations</head><p>We present additional visualizations of the LEGO stop-motion animation experiment, including examples of user-taught camera motions and stills from the two hour interaction during filming.  </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>Wikipedia provides a thorough overview of the rich history and different styles of stop-motion animation, with a dedicated page on "Brickfilm" (e.g., using LEGOs) -the sub-genre we focus on in this work.</p></note>
		</body>
		</text>
</TEI>
