<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>ComposableNav: Instruction-Following Navigation in Dynamic Environments via Composable Diffusion</title></titleStmt>
			<publicationStmt>
				<publisher>Conference on Robot Learning (CoRL)</publisher>
				<date>11/02/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10668399</idno>
					<idno type="doi"></idno>
					
					<author>Zichao Hu</author><author>Chen Tang</author><author>Michael J Munje</author><author>Yifeng Zhu</author><author>Alex Liu</author><author>Shuijing Liu</author><author>Garrett Warnell</author><author>Peter Stone</author><author>Joydeep Biswas</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>where instructions specify the navigation goals. In contrast, we focus on instruction-following navigation in dynamic environments. Specifically, we consider the under-explored settings where the instructions describe specific robot interaction behaviors (e.g., "yield to a pedestrian") with respect to the other dynamic obstacles or agents. Addressing this problem requires developing methods capable of grounding high-level instructions into fine-grained, low-level actions that account for the dynamic behaviors of other agents. Solving this problem would allow end users (human or AI agents) to customize robotic behaviors beyond their default settings, in ways that align with user preferences and nuanced social interactions.</p><p>A crucial challenge present in instruction-following navigation is that a single instruction may contain multiple specifications for the robot to follow, e.g., "overtake the pedestrian while staying on the right side of the road" consists of two specifications: "overtake the pedestrian" and "walk on the right side of the road.". Following an instruction amounts to simultaneously satisfying each of its constituent specifications. As the robot's capabilities and environment complexity increase, the space of possible combinations of such specifications grows exponentially. This combinatorial expansion makes popular learning-based methods, such as imitation learning <ref type="bibr">[4]</ref> or reinforcement learning <ref type="bibr">[5,</ref><ref type="bibr">6]</ref>, impractical as they demand substantial data and computational resources.</p><p>To address this challenge, we build our solution upon the idea of composition. Rather than training a single model to handle an exponential number of possible combinations, we propose to train separate motion primitives for each individual category of specifications. At deployment time, we assume that an upstream module, such as a large language model, can decompose a natural-language instruction into a set of specifications online. The corresponding motion primitives can then be composed to generate a trajectory that follows the instruction. This approach significantly reduces complexity from exponential to linear: a relatively small set of motion primitives can support a combinatorially large space of instructions, enabling users to specify diverse robot behaviors required in real-world social navigation. Notably, in our setting, the relevant primitives are blended while composed, rather than stitched together sequentially <ref type="bibr">[7,</ref><ref type="bibr">8]</ref>, since the robot's trajectory needs to satisfy all the specifications simultaneously.</p><p>To this end, we present ComposableNav, a composable, diffusion-based motion planner that composes motion primitives based on the instruction specifications to generate instruction-following motion trajectories. The core intuition motivating our approach is that diffusion models <ref type="bibr">[9,</ref><ref type="bibr">10]</ref> are highly effective at representing complex probability distributions, and that these models can be composed to form joint distributions <ref type="bibr">[11]</ref>. Leveraging this property, we can separately train diffusion models to learn motion primitives, each represented as a distribution over trajectories that satisfy a specific instruction specification. At deployment time, ComposableNav composes the relevant motion primitives based on the provided instruction specifications, constructs the corresponding joint distribution, and samples a trajectory that simultaneously satisfies all specified instructions. Additionally, to avoid the onerous need for demonstrations of individual motion primitives, we introduce a two-stage training procedure consisting of supervised pre-training followed by reinforcement learning (RL) fine-tuning. Finally, to ensure real-time performance, we incorporate a model predictive controller (MPC) <ref type="bibr">[12]</ref> and an online replanning strategy for low-latency action execution.</p><p>We demonstrate the effectiveness of ComposableNav through simulated and real-world experiments. With just six motion primitives (See Tab. 1 in Appendix for the instruction list), we build a testbed with 24 instructions featuring various unseen specification combinations. Our results show that ComposableNav excels at following unseen instructions compared to baseline approaches. Our main contributions are summarized as follows. <ref type="bibr">(1)</ref> We introduce the use of composition as a strategy for instruction-following navigation in dynamic environments, making the problem tractable under limited data and computational resources for training. <ref type="bibr">(2)</ref> We propose a diffusion-based learning method to model motion primitives as probability distributions, enabling their composition at deployment time. <ref type="bibr">(3)</ref> We develop a two-stage training procedure-combining supervised pre-training and reinforcement learning fine-tuning-that effectively learns motion primitives without the need for specialized demonstration datasets for each primitive.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Instruction Following Navigation. A key area in instruction-following navigation is visionlanguage navigation (VLN) <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">3]</ref>, which combines natural language understanding with visual perception to guide agents through 3D environments <ref type="bibr">[13,</ref><ref type="bibr">14,</ref><ref type="bibr">15,</ref><ref type="bibr">16,</ref><ref type="bibr">17,</ref><ref type="bibr">18]</ref>. However, these methods assume static settings and overlook scenarios involving dynamic agents, where instructions specify interactions with moving agents. In contrast, social robot navigation focuses on enabling robots to operate in dynamic environments <ref type="bibr">[19,</ref><ref type="bibr">20,</ref><ref type="bibr">21,</ref><ref type="bibr">22,</ref><ref type="bibr">23,</ref><ref type="bibr">24,</ref><ref type="bibr">25,</ref><ref type="bibr">26,</ref><ref type="bibr">27]</ref>, but lacks instruction conditioning. Recent work has shown promise in using vision-language models (VLMs) to address this gap. CoNVOI <ref type="bibr">[28]</ref> and Social-VLM-Nav <ref type="bibr">[29]</ref> leverage VLMs' reasoning to interpret environment observations and suggest actions, but they face high inference latency and planning inconsistency. BehAV <ref type="bibr">[30]</ref> addresses some of these issues by generating cost maps from VLM outputs, yet struggles with sample inefficiency in geometric planning. Our work proposes a novel alternative using diffusion models <ref type="bibr">[9]</ref> to compose motion primitives, without relying on costly VLM inference.</p><p>Diffusion For Robotics. Diffusion models have emerged as powerful tools for solving a variety of robotics tasks <ref type="bibr">[31,</ref><ref type="bibr">32,</ref><ref type="bibr">33,</ref><ref type="bibr">34,</ref><ref type="bibr">35,</ref><ref type="bibr">36]</ref>, with training typically performed using either supervised learning <ref type="bibr">[32,</ref><ref type="bibr">33,</ref><ref type="bibr">35,</ref><ref type="bibr">31]</ref> or RL <ref type="bibr">[34,</ref><ref type="bibr">37]</ref>. A distinctive advantage of diffusion models is their ability to guide the sampling process after training <ref type="bibr">[38]</ref>. Janner et al. <ref type="bibr">[39]</ref> first relate this guided sampling mechanism to the control-as-inference framework <ref type="bibr">[40]</ref>, demonstrating that classifier guidance enables the generation of motion plans for previously unseen goal configurations. However, designing such classifiers can be challenging. To address this, Luo et al. <ref type="bibr">[36]</ref> interpret diffusion models as energy-based models, training separate models and composing them at inference time to generalize to novel environments. Building on this idea of composition, our work differs in that we do not assume access to diverse motion primitive datasets for supervised training. Instead, we propose using reinforcement learning-based <ref type="bibr">[34]</ref> to fine-tune diffusion models, and composing them to generate trajectories that satisfy unseen combinations of specifications from an instruction.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Problem Formulation</head><p>We consider the problem of instruction-following robot navigation in dynamic environments, where the objective is to generate a motion trajectory &#964; that follows a given instruction I, based on the robot's observation O of the environment. We represent the motion trajectory &#964; as a sequence of 2D waypoints at fixed-time intervals, which are then tracked by a model predictive controller to produce fine-grained actions in real time. The observation O encodes the state of entities relevant to the instruction, such as the current and predicted positions of dynamic agents. Note that other representations are also possible, such as full SE(3) poses for &#964; or RGB images for O.</p><p>In this work, we assume an instruction I can be decomposed into a set of independent specifications I &#8594; &#966; (1) , &#966; (2) , . . . , &#966; (k) . Each specification &#966; (i) : &#964; &#215; O &#8594; {0, 1} evaluates whether the trajectory meets the corresponding requirement, returning 1 if it does and 0 otherwise. To determine whether a trajectory &#964; follows an instruction I, &#964; must satisfy all relevant specifications. Formally,</p><p>Solving this problem is challenging because the trajectory must simultaneously satisfy all specifications &#966; (i) , whose combinations can grow exponentially. In the following sections, we explain how leveraging diffusion models enables us to compose motion primitives and generate trajectories that can follow instructions during robot deployment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Preliminaries</head><p>We provide a brief overview of the two key techniques used in ComposableNav, conditional diffusion models <ref type="bibr">[10,</ref><ref type="bibr">9,</ref><ref type="bibr">41,</ref><ref type="bibr">42]</ref> and denoising diffusion policy optimization (DDPO) <ref type="bibr">[34]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Conditional Diffusion Models</head><p>In this work, we consider conditional diffusion probabilistic models <ref type="bibr">[9,</ref><ref type="bibr">10,</ref><ref type="bibr">42]</ref>, which belong to a family of generative models trained to represent a conditional distribution p(x | c), where c is the corresponding context. These models are trained to reverse a forward diffusion process q(x t | x t-1 ) that gradually adds Gaussian noise to the data x 0 &#8764; p(x|c). To learn this reverse process, the model is trained to predict the noise &#491; at each step t using a denoising network, f &#952; (x t , t, c) &#8776; &#491;, where x t is the noisy data at step t. The network is optimized using a training objective that penalizes the mean squared error between the predicted and actual noise value at step t:</p><p>This objective is derived from maximizing a variational lower bound on the data log-likelihood <ref type="bibr">[9]</ref>.</p><p>At inference time, the model generates a data sample by starting from Gaussian noise x T &#8764; N (0, I) and progressively denoising it using the learned denoising network for T steps. The reverse process at each timestep t follows a Gaussian distribution with a time-dependent covariance matrix &#963; 2 t I, where &#963; 2 t is treated as a hyperparameter:</p><p>This iterative process continues until a final sample x 0 is obtained, which approximates the true conditional distribution p(x | c).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Denoising Diffusion Policy Optimization (DDPO)</head><p>ComposableNav follows the denoising diffusion policy optimization technique (DDPO) proposed by Black et al. <ref type="bibr">[34]</ref> to use reinforcement learning (RL) to fine-tune diffusion models to generate the motion primitives corresponding to the instruction specifications. DDPO models the multi-step denoising process as a multi-step Markovian Decision Process (MDP), defined as a tuple M = S, A, &#961; 0 , P, R , where S is the state space, A is the action space, &#961; 0 is the distribution of initial states, P is the transition kernel, and R is the reward function. We denote the timestep of this multi-step MDP as i. The denoising process is mapped into this MDP as follows:</p><p>where &#948; y denotes the Dirac distribution with nonzero density only at y.</p><p>The key insight behind this technique is that the reverse process in a diffusion model is a Markovian process, where each denoising step p &#952; (x t-1 | x t , c) is modeled as a Gaussian distribution (see Eq. 3). By interpreting each denoising step as the policy &#960;(a i | s i ) in an MDP, the policy itself becomes Gaussian, which allows for the exact evaluation of log-likelihoods and their gradients with respect to the diffusion model parameters. As a result, this formulation enables the use of policy gradient methods, such as PPO <ref type="bibr">[43]</ref>, to optimize the diffusion model's denoising network.</p><p>The DDPO algorithm alternates between (1) collecting denoising trajectories x T , x T -1 , . . . , x 0 via sampling and (2) updating the model parameters using gradient descent. Finally, the policy gradient objective used in DDPO can be expressed as:</p><p>where the expectation is taken over trajectories generated using the previous model parameters &#952; old .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">ComposableNav</head><p>In this section, we present ComposableNav, a diffusion-based planner for instruction-following navigation. As shown in Fig. <ref type="figure">2</ref>, ComposableNav first learns motion primitives via a two-stage training procedure (see Sec. 5.1). At deployment, given instruction specifications, it selects relevant primitives and composes them by summing the predicted noise from each diffusion model during the denoising process (Sec. 5.2). Finally, for real-time control, ComposableNav is paired with an MPC (see Appendix F.2). Please refer to our codebase for detailed implementation. <ref type="foot">3</ref>fication. The diffusion model then generates trajectories for these environments, which are evaluated using a reward function based on how well they align with the instruction. While the reward function can take various forms, we adopt a simple rule-based heuristic approach, as the primitives considered in our experiments are straightforward to evaluate (example shown in Appendix C). The resulting trajectories and rewards are stored in a replay buffer, and the model is updated using PPO <ref type="bibr">[43]</ref>. Finally, after fine-tuning, we obtain multiple diffusion models f &#966; (i) &#952; (&#964; t , t, O), each representing a motion primitive associated with a specification &#966; (i) .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Generating Instruction-Following Trajectories via Composing Motion Primitives</head><p>ComposableNav models the instruction-following motion trajectories &#964; as a conditional distribution p(&#964; |&#966; (1) , o (1) , &#8226; &#8226; &#8226; , &#966; (k) , o (k) ), where each &#966; (i) is a specification extracted from the instruction I, i.e., I &#8594; &#966; (1) , &#8226; &#8226; &#8226; , &#966; (k) , and each o (i) is the environment observation corresponding to &#966; (i) . We assume both the specifications and the environment observations can be extracted using off-the-shelf large language models and vision foundation models. Given the conditional independence assumption for each specification &#966; (i) discussed in Sec. 3, the conditional distribution can be factorized as follows (derivation shown in Eq. 8): p(&#964; |&#966; (1) , o (1) </p><p>Here, each conditional trajectory distribution p(&#964; | &#966; (i) , o (i) ) corresponds to a motion primitive represented by a diffusion model with denoising network f &#966; (i) &#952; (&#964; t , t, o (i) ). In contrast, the marginal trajectory distribution p(&#964; ) is an unconditioned motion primitive, obtained by replacing the observation o (i) with a null input &#8709;, i.e., f &#966; (i) &#952; (&#964; t , t, &#8709;), following the classifier-free guidance approach <ref type="bibr">[42]</ref>.</p><p>Following prior work <ref type="bibr">[11,</ref><ref type="bibr">36]</ref>, we compose motion primitives by summing the predicted noise from denoising networks, with user-defined weights w i controlling the guidance strength for the ith primitive (w i is set to be 1 for all primitives in this work). The composed noise is:</p><p>Here, with separate diffusion models for each primitive, p(&#964; ) is defined as the average of the unconditioned outputs f &#966; (i) &#952; (&#964; t , t, &#8709;) across all considered primitives.</p><p>Finally, ComposableNav generates trajectories by iteratively applying the reverse diffusion process, starting from a noisy trajectory &#964; T &#8764; N (0, I) and denoising via p compose (&#964; t-1 | &#964; t , &#966; (1) , o (1) , . . . , &#966; (k) , o (k) ) = N (&#964; t&#491;, &#963; 2 t I). After T steps, the process yields a trajectory &#964; 0 , drawn from a distribution concentrated on trajectories that satisfy all specifications of the given instruction (See Appendix B for an intuitive explanation of diffusion composition, based on the score-based interpretation <ref type="bibr">[47]</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Experiments and Results</head><p>We evaluate ComposableNav in simulation and the real world to address the following questions:</p><p>(1) Can ComposableNav learn individual motion primitives that satisfy each instruction specification without relying on demonstration data? <ref type="bibr">(2)</ref> To what extent can ComposableNav compose motion primitives to generate trajectories that satisfy unseen combinations of specifications, in comparison to baseline approaches? (3) Can ComposableNav operate in real-time when deployed on a real-world robot and enable the robot to follow instructions in dynamic environments involving pedestrian interactions? In our experiments, we consider six navigation motion primitives, as shown in Fig. <ref type="figure">3a</ref> and Appendix C, with training details provided in Appendix D.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Simulation Experiments</head><p>Environment Setup. Using six motion primitives, we built a testbed with 24 instructions featuring various unseen specification combinations (see Appendix E.3.2). Instructions are grouped by</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Limitations and Future Work</head><p>This work has several limitations that future research could address.</p><p>First, we only considered six commonly used navigation primitives when composing novel behaviors. These primitives are relatively simple and can be described with straightforward rule-based reward functions (see Sec. C). However, manually designing such reward functions does not scale well as the number of primitives grows. Importantly, our method is not tied to a particular way of crafting rewards. A promising and direct direction for improving scalability is to leverage vision-language models (VLMs) as verifiers-shown effective in DDPO <ref type="bibr">[34]</ref>-to automatically learn diverse and complex behaviors without relying on handcrafted rewards.</p><p>Second, we assume that tasks such as parsing instructions into specifications and detecting relevant observations can be handled by existing methods, such as off-the-shelf LLMs and VLMs. Since our focus is on composable planning, we abstract these components away in our experiments. Future work may stack high-level VLM-based modules, as in prior studies <ref type="bibr">[30,</ref><ref type="bibr">49]</ref>, on top of Composable-Nav to close the loop. Furthermore, a high-level task planner <ref type="bibr">[50,</ref><ref type="bibr">51]</ref> could be introduced to further support long-horizon instruction-following navigation.</p><p>Third, although ComposableNav significantly outperforms baseline methods, it still shows a notable decline in success rate as the number of instruction specifications increases. This limitation may stem from the underlying composition strategy: following Liu et al. <ref type="bibr">[11]</ref>, our method composes diffusion models by summing the predicted noise from individual denoising networks. As Du et al. <ref type="bibr">[38]</ref> highlight, however, this approach can lead to suboptimal results. Future work may explore more advanced sampling techniques for diffusion models-such as Hamiltonian Monte Carlo <ref type="bibr">[52,</ref><ref type="bibr">53]</ref>-to improve composition performance under higher instruction complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.2 Training</head><p>To train the base model, we collected trajectories across three types of environments: (1) collisionfree trajectories in dynamic settings, <ref type="bibr">(2)</ref> trajectories that avoid specified regions, and (3) trajectories that intentionally traverse specific regions. The first two types were generated in simulation by randomly placing obstacles and planning trajectories to avoid them. For the third type, we sampled trajectories from the first two types, randomly selected a region that each trajectory passes through, and re-labeled this region as the observation to form training pairs.</p><p>We collected approximately 2 million collision-free trajectories and trained the denoising network for 2000 epochs, using a learning rate of 2 &#215; 10 -4 and a dropout rate of 0.1. Training followed the classifier-free guidance approach <ref type="bibr">[42]</ref>, where the model was conditioned on a null context (represented by zero vectors) with a probability of 20%, instead of using features extracted from the context encoder. We also applied an exponential moving average (EMA) to the model parameters during training, which stabilizes optimization and improves generalization by smoothing out noisy updates. Following prior work <ref type="bibr">[33]</ref>, the diffusion model performs a total of 25 denoising steps using an exponential noise schedule, generating a trajectory composed of a sequence of fixed-length, time-dependent waypoints.</p><p>To fine-tune the base denoising network for each motion primitive, we adapted the DDPO implementation <ref type="foot">5</ref> . For simplicity, we replaced the original vision-language model (VLM)-based reward function with a heuristic-based one, designed according to the specific definitions of each motion primitive in this work. During each training epoch, we generated 32 different environments and trained the model for a total of 1000 epochs. A noteworthy observation from our experiments is that a significantly lower learning rate greatly enhances training performance. Based on this finding, we adopted a learning rate of 1&#215;10 -6 , which is also consistent with results reported in the literature <ref type="bibr">[6]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Additional Simulation Experiment Details E.1 Simulation ComposableNav Setup</head><p>We evaluate ComposableNav in a 20&#215;20 m 2 2D simulation arena. Dynamic humans are modeled as spheres, whose future positions are predicted under a constant-velocity assumption, while the static environment is represented as a rectangular region specified by the coordinates of its four corner points. For each instruction, 20 environments are randomly initialized, assigning initial positions and speeds to the entities based on the specific requirements of the instruction. The simulation operates at a control frequency of &#8710;t = 0.1 s, and each episode lasts for a maximum of 300 timesteps, equivalent to 30.0 seconds.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E.2 Baseline Setup</head><p>In this work, we consider three baseline methods: VLM-Social-Nav <ref type="bibr">[29]</ref>, CoNVOI <ref type="bibr">[28]</ref>, and Be-hAV <ref type="bibr">[30]</ref>. These baselines fall into two categories: the first two treat the VLM as a black-box policy that proposes a target action (e.g., next waypoint or velocity) for a geometric planner to track, while the third computes composable cost maps for planning.</p><p>None of these baseline methods is explicitly designed to solve the problem considered in this work. Therefore, we adapt them for our experimental setup. For VLM-Social-Nav and CoNVOI, we use the latest GPT-4.1 model and disregard the high inference latency associated with invoking a remote VLM and focus on evaluating their prediction accuracy. Additionally, we modify these approaches by providing annotated screenshots of the simulates scenes and prompting the VLMs for reasoning.</p><p>For BehAV, whose core idea is to create composable costmaps and plan trajectories over them (similar to Voxposer <ref type="bibr">[49]</ref>), we simplify the setup by abstracting away the segmentation vision model.  .6 82.5 80.0 8.1 8.1 81.2 84.4 6.9 6.9 89.4 68.8 34.9 38.5 82.6 99.0</p><p>them under a general instruction category: "Pass a person" (denoted as P) for motion composition purposes. Accordingly, we evaluate our method, ComposableNav, on its ability to handle both left and right variants using a single instruction specification.</p><p>The quantitative results are summarized in Tab. 3. Across all testbed scenarios, ComposableNav consistently outperforms all baseline methods in terms of success rate, particularly as the number of motion primitives in an instruction increases. While baseline methods perform reasonably with simple instructions, their performance deteriorates notably with more complex instruction sets.</p><p>Methods relying on Vision-Language Models (VLMs) as black-box policies perform particularly poorly. These models are not designed for such navigation tasks and often fail to maintain planning consistency, especially as instruction complexity grows. Similarly, BehAV performs adequately with one or two motion primitives but suffers as composition complexity increases. In addition, BehAV has the lowest goal-reaching rate, which suggests that such a costmap-based method tends to get trapped in local minima and is unable to complete tasks within the allotted time.</p><p>Furthermore, baseline methods exhibit high variance in performance across different instruction combinations. In many cases, they fail to generate any viable, instruction-following trajectory. In contrast, ComposableNav-despite not being explicitly trained for any possible instruction composition-demonstrates strong generalization capabilities and consistently higher success rates across a wide range of scenarios.</p><p>To complement the quantitative analysis, we provide a qualitative illustration in Fig. <ref type="figure">7</ref>. Here, we observe that while VLM-based methods may initially steer the robot in the correct direction, they lack the responsiveness and consistency needed for sustained instruction following. These methods often begin to avoid regions or yield to pedestrians, but then fail to complete subsequent specifi-</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.3 Deployment Failure Case Analysis</head><p>We conducted a qualitative analysis of the failure cases observed when deploying ComposableNav on the robot and identified two common issues. The first issue stems from human tracking errors. Since both the robot and the human are in continuous motion, the person may temporarily exit the camera's field of view-particularly when the robot turns-causing the system to lose track of them, even if they later reappear. While we applied a simple nearest-neighbor heuristic to reassign the human based on previous tracking data, occasional failures still occur, where the robot is unable to reliably re-identify the person. The second issue arises during replanning. We observed that the newly generated plan can sometimes diverge significantly from the original one. This can lead the MPPI controller to issue large acceleration or deceleration commands, resulting in jerky movements. Consequently, the robot may overshoot its intended state and struggle to stay on the planned path. We plan to address these issues in future work</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_0"><p>Code is released at https://github.com/ut-amrl/ComposableNav.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_1"><p>https://github.com/lucidrains/denoising-diffusion-pytorch</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_2"><p>https://github.com/kvablack/ddpo-pytorch</p></note>
		</body>
		</text>
</TEI>
