<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Group-based Motion Prediction for Navigation in Crowded Environments</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10341308</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the 5th Conference on Robot Learning</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Allan Wang</author><author>Christoforos Mavrogiannis</author><author>Aaron Steinfeld</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We focus on the problem of planning the motion of a robot in a dynamic multiagent environment such as a pedestrian scene. Enabling the robot to navigate safely and in a socially compliant fashion in such scenes requires a representation that accounts for the unfolding multiagent dynamics. Existing approaches to this problem tend to employ microscopic models of motion prediction that reason about the individual behavior of other agents. While such models may achieve high tracking accuracy in trajectory prediction benchmarks, they often lack an understanding of the group structures unfolding in crowded scenes. Inspired by the Gestalt theory from psychology, we build a Model Predictive Control framework (G-MPC) that leverages group-based prediction for robot motion planning. We conduct an extensive simulation study involving a series of challenging navigation tasks in scenes extracted from two real-world pedestrian datasets. We illustrate that G-MPC enables a robot to achieve statistically significantly higher safety and lower number of group intrusions than a series of baselines featuring individual pedestrian motion prediction models. Finally, we show that G-MPC can handle noisy lidar-scan estimates without significant performance losses.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Over the past three decades, there has been a vivid interest in the area of robot navigation in pedestrian environments <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">3,</ref><ref type="bibr">4,</ref><ref type="bibr">5]</ref>. Planning robot motion in such environments can be challenging due to the lack of rules regulating traffic, the close proximity of agents and the complex emerging multiagent interactions. Further, accounting for human safety and comfort as well as robot efficiency add to the complexity of the problem.</p><p>To address such specifications, a common <ref type="bibr">[3,</ref><ref type="bibr">4,</ref><ref type="bibr">6,</ref><ref type="bibr">7,</ref><ref type="bibr">8]</ref> paradigm involves the integration of a behavior prediction model into a planning mechanism. Recent models tend to predict the individual interactions among agents to enable the robot to determine collision-free candidate paths <ref type="bibr">[3,</ref><ref type="bibr">4,</ref><ref type="bibr">9]</ref>. While this paradigm is well-motivated, it tends to ignore the structure of interaction in such environments. Often, the motion of pedestrians is coupled as a result of social grouping. Further, the motion of multiple agents can often be effectively grouped as a result of similarity in motion characteristics. Lacking a mechanism for understanding the emergence of this structure, the robot motion generation mechanism may yield unsafe or uncomfortable paths for human bystanders, often violating the space of social groups.</p><p>Motivated by such observations, we draw inspiration from human navigation to propose the use of group-based prediction for planning in crowd navigation domains. We argue that humans do not employ detailed individual trajectory prediction mechanisms. In fact, our motion prediction capabilities are short-term and do not scale with the number of agents. We do however employ effective grouping techniques that enable us to discover safe and efficient paths among motions of crowd networks. This anecdotal observation is aligned with gestalt theory from psychology <ref type="bibr">[10]</ref> which suggests that organisms tend to perceive and process formations of entities, rather than individual 5th Conference on Robot Learning (CoRL 2021), London, UK.</p><p>Figure <ref type="figure">1</ref>: Based on a representation of social grouping <ref type="bibr">[13]</ref>, we build a group behavior prediction model to empower a robot to perform safe and socially compliant navigation in crowded spaces. (a) Example of our representation overlayed on top of a scene from a real-world dataset <ref type="bibr">[14]</ref>. (b) A model predictive controller equipped with our prediction model is able to navigate around the group socially (right) as opposed to the baseline that cuts through the group (left).</p><p>components. Such techniques have recently led to advances in computer vision <ref type="bibr">[11]</ref> and computational photography <ref type="bibr">[12]</ref>. Similarly, we envision that a robot could reason the formation of effective groups in a crowded environment and react to their motion as an effective way to navigate safely.</p><p>In this paper, we propose a group-based representation coupled with a prediction model based on the group-space approximation model of <ref type="bibr">Wang and Steinfeld [13]</ref>. This model groups a crowd into sets of agents with similar motion characteristics and draws geometric enclosures around them, given observation of their states. The prediction module then predicts future states of these enclosures. We conduct an extensive empirical evaluation over 5 different human datasets <ref type="bibr">[14,</ref><ref type="bibr">15]</ref>, each with a flow following and a crossing scenario. We further conduct a same set of evaluations with agents powered by ORCA <ref type="bibr">[16]</ref> that share the start and end locations in the datasets. Last but not least, we conducted evaluation given inputs in the form of simulated laser scans, from which pedestrians are only partially observable or even completely occluded. We compare the performance of our group-based formulation against three individual reasoning baselines: a) a reactive baseline with no prediction; b) a constant velocity prediction baseline; c) one based on individual S-GAN trajectory predictions <ref type="bibr">[17]</ref>. We present statistically significant evidence suggesting that agents powered by our formulation produce safer and more socially compliant behavior and are potentially able to handle imperfect state estimates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related Work</head><p>Over the recent years, a considerable amount of research has been placed to the problem of robot navigation in crowded pedestrian environments <ref type="bibr">[3,</ref><ref type="bibr">4,</ref><ref type="bibr">8,</ref><ref type="bibr">18,</ref><ref type="bibr">19,</ref><ref type="bibr">20,</ref><ref type="bibr">21,</ref><ref type="bibr">22,</ref><ref type="bibr">23]</ref>. Such environments often comprise groups of pedestrians, navigating as coherent entities. &#352;ochman and Hogg <ref type="bibr">[24]</ref> suggests that 50-70% of pedestrians walk in groups. Many works exist in group detection. One popular area in such domain is static group detection, often leveraging F-formation theories <ref type="bibr">[25]</ref>. However, dynamic groups often dominate pedestrian-rich environments and exhibit different spatial behavior <ref type="bibr">[26]</ref>. Among dynamic group detection, one common approach is to treat grouping as a probabilistic process where groups are a reflection of close probabilistic association of pedestrian trajectories <ref type="bibr">[27,</ref><ref type="bibr">28,</ref><ref type="bibr">29,</ref><ref type="bibr">30,</ref><ref type="bibr">31]</ref>. Others use graph models to build inter-pedestrian relationships with strong graphical connections indicating groups <ref type="bibr">[32,</ref><ref type="bibr">33]</ref>. The social force model <ref type="bibr">[34]</ref> also inspires Mazzon et al. <ref type="bibr">[35]</ref>, &#352;ochman and Hogg <ref type="bibr">[24]</ref> to develop features that indicate groups. Clustering is another common group of technique to group pedestrians with similar features into groups <ref type="bibr">[36,</ref><ref type="bibr">37,</ref><ref type="bibr">38,</ref><ref type="bibr">39]</ref>. For our formulation, it is sufficient to employ a simple clustering-based grouping method proposed by Chatterjee and Steinfeld <ref type="bibr">[39]</ref>. Other grouping methods will simply result in different group membership assignments.</p><p>Applications on groups often focus on a specific behavior aspect. For example, one focus in this area is how a robot should behave as part of the group formation <ref type="bibr">[40]</ref>. On dyad groups involving a single human and a robot, some researchers examined socially appropriate following behavior <ref type="bibr">[41,</ref><ref type="bibr">42,</ref><ref type="bibr">43,</ref><ref type="bibr">44]</ref> and guiding behavior <ref type="bibr">[45,</ref><ref type="bibr">46,</ref><ref type="bibr">47]</ref>. In works that do not include robots as part of pedestrian groups, some research teams studied how a robot should guide a group of pedestrians <ref type="bibr">[48,</ref><ref type="bibr">49,</ref><ref type="bibr">50]</ref>. From navigation perspective, Yang and Peters <ref type="bibr">[26]</ref> leverage groups as obstacles, but their group space only involves occasional O-space modeling from F-formation theories. Without the engineered occurrence of O-space, their representation reduces to one of our baselines. Katyal et al. <ref type="bibr">[51]</ref> introduce an additional cost term that leverages robot's distance to the closest group in a reinforcement learning framework. They model groups using convex hulls directly generated from pedestrian coordinates instead of taking personal spaces into consideration. In our work, we additionally explore the capabilities of groups in handling imperfect sensor inputs. While our focus is on analysing the benefits of groups, our group based formulation can be easily incorporated into the work of Katyal et al. <ref type="bibr">[51]</ref>'s framework.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Problem Statement</head><p>Consider a robot navigating in a workspace W &#8838; R 2 amongst n other dynamic agents. Denote by s &#8712; W the state of the robot and by s i &#8712; W the state of agent i &#8712; N = {1, . . . , n}. The robot is navigating from a state s 0 towards a destination s T by executing a policy &#960; : W n+1 &#215; U &#8594; U that maps the assumed fully observable world state S = s &#8746; i=1:n s i to a control action u &#8712; U , drawn from a space of controls U &#8838; R 2 . We assume that the robot is not aware of agents' destinations s i T or policies &#960; i : W n+1 &#215; U i &#8594; U i , i &#8712; N . In this paper, our goal is to design a policy &#960; that enables the robot to navigate from s 0 to s T safely in a socially compliant fashion.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Group-based Prediction</head><p>We introduce a group representation building on prior work <ref type="bibr">[13]</ref> and a model for group-based prediction that is amenable for use in decentralized multiagent navigation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Group Representation</head><p>Define as &#952; i &#8712; [0, 2&#960;) the orientation of agent i &#8712; N which is assumed to be aligned with the direction of its velocity u i , extracted via finite differencing of its position over a timestep dt and denote by v i = ||u i || &#8712; R + its speed. We define an augmented state for agent i as</p><p>We treat a social group as a set of agents who are in close proximity and share similar motion characteristics. Assume that a set of J groups, J = {1, . . . , J } navigate in a scene. Define by g i &#8712; J a label indicating the group membership of agent i. We then define a group j &#8712; J as a set G j = {i &#8712; N | g i = j} and collect the set of all groups in a scene into a set G = {G j | j &#8712; J }.</p><p>Extracting Group Membership. We define the combined augmented state of all agents as q = &#8746; i=1:n q i . To obtain group memberships for a set of agents N , we apply the Density-Based Spatial Clustering of Applications with Noise algorithm (DBSCAN) <ref type="bibr">[52]</ref> on agent states:</p><p>Where s , &#952; , v are respectively threshold values on agent distances, orientation and speeds.</p><p>Extracting the Social Group Space. For each group G j , j &#8712; J , we define a social group space as a geometric enclosure G j around agents of the group. For each agent i &#8712; G j , we define a personal space P i as a two-dimensional asymmetric Gaussian based on the model introduced by Kirby <ref type="bibr">[53]</ref>. Refer to Appendix A for detailed descriptions.</p><p>Given the personal spaces P i , i &#8712; G j , of all agents in a group j, we extract the social group space of the whole group as a convex hull:</p><p>The shape described by G j represents an obstacle space representation of a group containing agents in close proximity with similar motion characteristics. For convenience, let us collect the spaces of all groups in a scene into a set G = {G j | j &#8712; J }.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Group Space Prediction Oracle</head><p>Based on the group-space representation of Sec. 4.1, we describe a prediction oracle that outputs an estimate of the future spaces occupied by a set of groups G t:t f up to a time t f = t + f , where f is a future horizon given a past sequence of group spaces G t h :t from time t h = t -h where h is a window of past observations: where O j is a model generating a group space prediction for group G j . Refer to Appendix B for detailed description of partial input handling.</p><p>We implement the oracle O j of eq. ( <ref type="formula">3</ref>) using a simple encoder-decoder network. The encoder follows the 3D convolutional architecture in <ref type="bibr">[54]</ref> whereas the decoder mirrors the model layout of the encoder. The encoder-decoder network takes as input a sequence<ref type="foot">foot_0</ref> G t h :t and outputs a sequence G t+1:t f which we pass through a sigmoid layer. We supervise the encoder-decoder network's output using the binary cross entropy loss.</p><p>We verified the effectiveness of our encoder-decoder network on the 5 scenes of our experiments by conducting a cross-validation comparison against a baseline. The baseline predicts the future shapes by linearly translating the last social group shape using its geometric center velocity. We use Intersection over Union (IoU) as our metric. Between the ground truths and the predictions, this metric divides the number of overlapped pixels by the number of pixels occupied by either one of them. As shown in Table <ref type="table">1</ref>, our encoder-decoder network outperforms the baseline.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Model Predictive Control with Group-based Prediction</head><p>We describe G-MPC, a model predictive control (MPC) framework for navigation in multiagent environments that leverages the group-based prediction oracle of Sec. 4.</p><p>We describe our group-prediction informed MPC, or G-MPC. At planning time t, given a (possibly partial) augmented world state history Q t &#293; :t , we first extract a sequence of group spaces G t h :t based on the method of Sec. 4.1. Given these, the robot computes an optimal control trajectory u * = u * 1:K of length K by solving the following optimization problem:</p><p>(s * , u * ) = arg min</p><p>where &#947; is the discount factor and J represents a cost function, eq. ( <ref type="formula">5</ref>) initializes the group space history (k = 2 -h is the timestep displaced a horizon h in the past from the first MPC-internal timestep k = 1), eq. ( <ref type="formula">6</ref>) initializes the robot state to the current robot state s t , eq. ( <ref type="formula">7</ref>) is an update rule recursively generating a predicted future group sequence up to timestep k f = k + f given history from time k h = k -h up to time k, O represents a group-space prediction oracle based on Sec. 4, and eq. ( <ref type="formula">9</ref>) is the robot state transition assuming a fixed time parametrization of step size dt.</p><p>We employ a weighted sum of costs J g and J d , penalizing respectively distance to the robot's goal and proximity to groups:</p><p>where &#955; is a weight representing the balance between the two costs and 114 penalizes a rollout according to the distance of the last collision-free waypoint to the robot's goal. Further, we define J d as:</p><p>where D(s k -G j k ) returns the minimum distance between the robot state and the space occupied by group j at time k. Using D, function D computes the minimum distance to any group for a given time. In most cases, the robot lies outside of groups, i.e., s k / &#8712; G j k -therefore, the cost J d tries to maximize the distance D. Sometimes, the robot might end up entering the group space G -in those cases, J d tries to minimize D, to steer the robot towards the direction of quickest escape from the group. In case that the robot is inside a group to begin with, we shrink the group sizes in Sec. 4.1 until the robot is outside the groups again.</p><p>To solve eq. ( <ref type="formula">4</ref>), we search over a finite set U of control trajectories of horizon K. With the assumption that the robot is holonomic and is not under any kinematic constraints, we use a set of R control rollouts U = {u 1 , ..., u R } with three levels of tangential speeds and a set of turning speed, i.e.,</p><p>To ensure compatibility between our group-based prediction model and our MPC formulation, we set the control rollout time horizon to be the prediction model's prediction horizon, or K = f .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Evaluation</head><p>We evaluate our framework through a simulation study in which the robot performs a navigation task (a transition between two points) within a crowds of dynamic agents in a set of scenes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">Experimental Setup</head><p>We consider a set of realistic pedestrian scenes, drawn from the ETH <ref type="bibr">[14]</ref> (ETH and HOTEL scenes) and UCY <ref type="bibr">[15]</ref> (ZARA1, ZARA2 and UNIVERSITY scenes) datasets, which often serve as benchmarking testbeds in the motion prediction and social navigation literature <ref type="bibr">[17,</ref><ref type="bibr">55,</ref><ref type="bibr">56,</ref><ref type="bibr">57]</ref>. In each scene, we define two navigation tasks (see Fig. <ref type="figure">2</ref>): Flow: in which the robot navigates along the crowd flow and Cross in which the robot intersects vertically with the traffic flow. For each task, we generate a set of trials by segmenting the scene recording into blocks involving challenging interactions. We define a challenging interaction to be a segment involving at least 5 pedestrians inside the test region drawn in black in Fig. <ref type="figure">2</ref>. This process provided us with a distribution of trials as shown in table <ref type="table">Table 2</ref>. Across all trials, we keep the robot's maximum speed at 1.75m/s.</p><p>We consider two experimental conditions: an Offline and an Online one. In the Offline one, the robot navigates among a crowd moving according to a recording of a human crowd. Under this condition, pedestrians act as dynamic obstacles that do not react to the robot, a situation which could arise in cases where robots are of shorter size and could thus be easily missed by navigating pedestrians. In the Online one, the robot navigates among a crowd<ref type="foot">foot_1</ref> moving by running ORCA <ref type="bibr">[16]</ref>, a policy that is frequently used as a simulation engine for benchmarking in the social navigation literature <ref type="bibr">[8,</ref><ref type="bibr">57,</ref><ref type="bibr">58]</ref>.</p><p>To investigate the value of G-MPC, we develop three variants of it. group-pred is a G-MPC in which the encoder-decoder network has a history h = 8 and a horizon f = 8. group-nopred is a variant that features no prediction at all -it just reacts to observed groups at every timesteps and it is equivalent to the framework of Yang and Peters <ref type="bibr">[26]</ref>. Finally, laser-group-pred is identical to group-pred but instead of using ground-truth pose information, it takes as input noisy lidar scan readings. We simulate this by modeling pedestrians as 1m-diameter circles and lidar scans as rays projecting from the robot. We refer to the spec sheet of a SICK LMS511 2D lidar for simulation parameters. We further inject noise into the readings according to the spec sheet. Under this simulation, pedestrians may only be partially observable or even completely occluded from the robot.</p><p>We compare the performance of these policies against a set of MPC variants using mechanisms for individual motion prediction. ped-nopred is a vanilla MPC that reacts to the current states of other agents without making predictions about their future states. ped-linear is a vanilla MPC that estimates future states of agents by propagating agents' current velocities forward. This baseline is motivated by recent work showing that constant-velocity models yield competitive performance in pedestrian motion prediction tasks <ref type="bibr">[59]</ref>. Finally, ped-sgan is an MPC that uses S-GAN <ref type="bibr">[17]</ref> to extract a sequence of future state predictions for agents based on inputs of their past states. We selected S-GAN because it is a recent highly performing model. To ensure a fair comparison, all the MPC policy variants are integrated with the same MPC controller evaluated at dt = 0.1.</p><p>We measure the performance of the policies with respect to four different metrics: a) Success rate, defined as the ratio of successful trials over total number of trials; b) Comfort, defined as the ratio of trials in which the robot does not enter any social group space over the total number of trials; c) Minimum distance to pedestrians, defined as the smallest distance between the robot and any agent per trial; d) Path length, defined as the total distance traversed by the robot in a trial.</p><p>To track the performance of G-MPC, we design hypotheses targeting aspects of safety and group space violation which we investigate under both experimental conditions, i.e., offline and online:</p><p>H1: To explore the benefits of group based representations alone, we hypothesize that group-nopred is safer than ped-nopred while achieving similar success rates but worse efficiency.</p><p>H2: To explore the full benefit of group based formulation, we hypothesize that group-pred is safer than ped-linear and ped-sgan while achieving similar success rates but worse efficiency.</p><p>H3: To explore how our formulation handles imperfect inputs, we hypothesize that laser-grouppred achieves similar safety to group-pred while achieving similar success rate and efficiency.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H4:</head><p>To check that our formulation is socially compliant, we hypothesize that group-nopred, grouppred and laser-group-pred violate agents' group space less often than the baselines.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Results</head><p>Quantitative Analysis. H1: We can see from both Fig. <ref type="figure">3</ref> and Fig. <ref type="figure">4</ref> that G-MPC achieves statistically significantly larger minimum distances to pedestrians across all scenarios, often with p &lt; 0.001. This illustrates that the group representation is in itself capable of upgrading a simple MPC with no prediction. As expected, we observe that the price G-MPC pays for that is a larger average path length. We also see that success rates are comparable. Overall, we conclude that H1 holds.   H2: When future state predictions are considered, G-MPC obtains statistically significant results in most scenes supporting its attributes of being safer at the cost of worse efficiency. Thus H2 is partially confirmed. In offline scenarios, G-MPC has lower success rates in crossing scenarios. Upon closer inspection, most failure cases are due to timeouts from G-MPC's conservative behavior. However, in online scenarios where pedestrians react to the robot, G-MPC achieves high success rates.</p><p>In real-world situations, to cross dense traffic, the robot needs to plan its actions with expectations of reactive pedestrians. Otherwise, the robot will most probably run into the freezing robot problem <ref type="bibr">[4]</ref>.</p><p>H3: Group-based representations have the potential to robustly account for imperfect stateestimates. Overall, we observe that with simulated imperfect states, G-MPC does not perform statistically significantly worse in terms of safety, but in dense crowds of the UNIV scenes it has worse efficiency and worse success rates in online cases. This shows that H3 holds in terms of safety and, in moderately dense human crowds, holds in terms of efficiency. Future work on better group representation is needed to achieve better efficiency in high-density human crowds given imperfect states.</p><p>H4: From Fig. <ref type="figure">3</ref> and Fig. <ref type="figure">4</ref>, we can see that G-MPC often has fewer group-space intrusions than its baselines. While this relationship is not always statistically significant, we do see a general trend of the group-based approaches to respect group spaces more often than individual ones. Thus, we conclude that H4 is partially confirmed.</p><p>We additionally observe a general trend that group-pred is better than group-nopred in terms of higher success rates, lower chances of group intrusions, longer minimum distances to pedestrians and shorter path lengths. This shows that our group prediction model offers benefits to the robot's navigation. However, in a few scenarios group-nopred performs better. We largely attribute this to the finite inaccuracies of future group predictions and the freezing robot problem that accompanies the robot's more conservative behavior in group-pred than in group-nopred.</p><p>Qualitative Analysis. Qualitatively, it is a more common occurrence for regular MPC to perform aggressive and socially inappropriate maneuvers than G-MPC. As shown in the two examples in Fig. <ref type="figure">5</ref> executed by ped-sgan and group-pred agents, we can see that in offline conditions, the MPC agent aggressively cuts in front of the two pedestrians to the left before proceeding headlong into the cluster of pedestrians, only managing to avoid the deadlock by escaping through the narrow gap that opens up. While for G-MPC, it tracks the movements of the two pedestrian groups coming from the left. When the two pedestrian groups merge, the agent turns around and reevaluates its approach to cross. In the online condition, we observe that the MPC agent cuts through a pedestrian group to reach the other side, forcing a member of the group to stop and yield as indicated by the pedestrian's shrinking personal space, which is proportional to its speed. In the same situation, the G-MPC agent chooses to circumvent behind the social group.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>We introduced a methodology for generating group-based representations and predicting their future states. Through an extensive evaluation over the flow and crossing scenarios drawn from 10 different real-world scenes from 2 different human datasets with both reactive and non-reactive agents, we demonstrate the value of group-based prediction in enabling safe and socially compliant navigation. Through experimentation with simulated laser scans, our model displays promising potential to process noisy sensor inputs without much performance downgrade.</p><p>Several improvements to our framework are possible. For example, we could incorporate stateof-the-art oracles in the form of advanced video prediction models <ref type="bibr">[60]</ref> or incorporate inter-group interaction modeling. Additional considerations such as the set of rollouts or the cost functions could possibly increase performance. We could also integrate our prediction model into alternative control frameworks such as reinforcement learning policies.</p><p>Finally, we plan on validating our findings on a real-world robot to fully test the capability of G-MPC to handle noisy sensor inputs. We also plan to investigate ways to improve computation time to enhance our approach's real-world applicability. These include simplifying group representation geometry and predicting future group states in metric space instead of in image space.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>The oracle input sequence is first converted into image-space coordinates using the homography matrix of the scene. We also preprocess inputs to have normalized scale and group positions. The autoencoder output is converted back into Cartesian coordinates using the inverse homography transform.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>For consistency, the agents in the crowd start and end at the same spots as the agents in the recorded crowd from the Offline condition.</p></note>
		</body>
		</text>
</TEI>
