<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>InfoSTGCAN: An Information-Maximizing Spatial-Temporal Graph Convolutional Attention Network for Heterogeneous Human Trajectory Prediction</title></titleStmt>
			<publicationStmt>
				<publisher>MDPI</publisher>
				<date>06/01/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10527277</idno>
					<idno type="doi">10.3390/computers13060151</idno>
					<title level='j'>Computers</title>
<idno>2073-431X</idno>
<biblScope unit="volume">13</biblScope>
<biblScope unit="issue">6</biblScope>					

					<author>Kangrui Ruan</author><author>Xuan Di</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>Predicting the future trajectories of multiple interacting pedestrians within a scene has increasingly gained importance in various fields, e.g., autonomous driving, human–robot interaction, and so on. The complexity of this problem is heightened due to the social dynamics among different pedestrians and their heterogeneous implicit preferences. In this paper, we present Information Maximizing Spatial-Temporal Graph Convolutional Attention Network (InfoSTGCAN), which takes into account both pedestrian interactions and heterogeneous behavior choice modeling. To effectively capture the complex interactions among pedestrians, we integrate spatial-temporal graph convolution and spatial-temporal graph attention. For grasping the heterogeneity in pedestrians’ behavior choices, our model goes a step further by learning to predict an individual-level latent code for each pedestrian. Each latent code represents a distinct pattern of movement choice. Finally, based on the observed historical trajectory and the learned latent code, the proposed method is trained to cover the ground-truth future trajectory of this pedestrian with a bi-variate Gaussian distribution. We evaluate the proposed method through a comprehensive list of experiments and demonstrate that our method outperforms all baseline methods on the commonly used metrics, Average Displacement Error and Final Displacement Error. Notably, visualizations of the generated trajectories reveal our method’s capacity to handle different scenarios.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>It is important to accurately predict pedestrian trajectories <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref>. For example, in situations like crosswalks and crowded public areas, accurately predicting pedestrian trajectories can improve safety and prevent potential accidents <ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref><ref type="bibr">[7]</ref><ref type="bibr">[8]</ref>.</p><p>In monitoring systems, predicting pedestrian trajectories is pivotal in facilitating the detection of anomalous behaviors <ref type="bibr">[9]</ref><ref type="bibr">[10]</ref><ref type="bibr">[11]</ref>. Additionally, it can help optimize the planning of transportation systems with better insights into pedestrian flow and behavior modeling <ref type="bibr">[12]</ref><ref type="bibr">[13]</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref><ref type="bibr">[16]</ref>.</p><p>Forecasting the trajectory of a pedestrian still remains a significant challenge, primarily for two reasons: <ref type="bibr">(1)</ref> the complexity of interactions among pedestrians in a given environment and (2) the heterogeneity in individual behavioral preferences. Regarding the first reason, there are multiple factors influencing a pedestrian's trajectory, e.g., static obstacles like trees and roads and dynamic components including vehicles and other pedestrians. As reported by <ref type="bibr">[17]</ref>, up to 70% of pedestrians in a crowd move in groups, such as families or friends walking together. Such interactions are mainly driven by "social interactions" <ref type="bibr">[18,</ref><ref type="bibr">19]</ref>. Regarding the second reason that this remains challenging, different individuals usually display varied behaviors under similar circumstances <ref type="bibr">[20]</ref>, which makes it complicated to establish a universal behavioral model that fully represents the Physics-based models One classical approach to tackle the challenge of pedestrian trajectory prediction is to utilize physics-based models. The physics-based models are usually characterized by some basic rules or generic functions, taking into account both physical constraints and pedestrians' social or psychological factors <ref type="bibr">[2,</ref><ref type="bibr">[36]</ref><ref type="bibr">[37]</ref><ref type="bibr">[38]</ref><ref type="bibr">[39]</ref><ref type="bibr">[40]</ref>. One well-known physics-based model is the "cellular automaton model" <ref type="bibr">[41,</ref><ref type="bibr">42]</ref>. The cellular automaton model is a discrete model, which is based on the discrete motions of pedestrians traversing a grid of cells, and it is assumed that each cell is in a finite number of states.</p><p>The second type of physics-based models is called the "social force model" <ref type="bibr">[18]</ref>. The social force model is a microscopic continuous model, which studies the motions of pedestrians by some social forces, such as the destination choice and the need to avoid collisions with other pedestrians. Basically, the original social force model assumes that most of scenarios encountered by pedestrians are standard, such that behavioral strategies acquired through experience can be utilized.</p><p>Later, ref. <ref type="bibr">[19]</ref> introduced "Nomad", a generalized version of the social force model, which incorporates behavioral rules, and continuous route choice. This activity-based approach allows pedestrians to adapt their movements based on different traffic conditions, e.g., distance to a destination. In <ref type="bibr">[36]</ref>, the authors provided a comprehensive review of crowd motion simulation models, explaining their characteristics, applicability, and the underlying crowd movement phenomena. Interested in pedestrian behaviors at signalized crosswalks, ref. <ref type="bibr">[43]</ref> adapted the social force model and calibrated its parameters using maximum likelihood estimation.</p><p>Another type of physics-based model is represented by the category of "velocity-based models", which have been widely used in the game industry and robots <ref type="bibr">[44]</ref><ref type="bibr">[45]</ref><ref type="bibr">[46]</ref><ref type="bibr">[47]</ref>. Technically speaking, velocity-based models rely on differential equations, and their associated speed functions depend on the relative positions of the neighboring pedestrians and obstacles. For example, the reciprocal velocity obstacle model <ref type="bibr">[45]</ref> is able to navigate multiple agents in real time and generates safe and oscillation-free motions. In <ref type="bibr">[48]</ref>, the authors proposed the optimal reciprocal collision avoidance (ORCA) model, which can provide local collision-free motions for a large number of agents within a time interval.</p><p>Deep learning-based trajectory prediction Inspired by the success of deep learning models [22, <ref type="bibr">[49]</ref><ref type="bibr">[50]</ref><ref type="bibr">[51]</ref><ref type="bibr">[52]</ref><ref type="bibr">[53]</ref>, numerous research studies have focused on utilizing deep learning models for the task of pedestrian trajectory prediction. Social long short-term memory (Social LSTM) <ref type="bibr">[27]</ref> is one of them. The Social LSTM model employs a type of recurrent network to learn the sequential movement of each pedestrian. To predict the trajectory afterwards, the "social pooling" mechanism is utilized to aggregate the output of the RNNs. Specifically, the model pools the neighbor hidden states of a pedestrian within a distance threshold. Later, based on LSTMs and Generative Adversarial Networks <ref type="bibr">[30]</ref>, Social GAN <ref type="bibr">[34]</ref> was proposed. Social GAN designed a novel pooling mechanism that calculates interactions according to the relative distances among pedestrians.</p><p>Subsequent works have built upon Social LSTM and Social GAN <ref type="bibr">[28,</ref><ref type="bibr">29,</ref><ref type="bibr">35,</ref><ref type="bibr">54]</ref>. For example, State-Refinement LSTM (SR-LSTM) <ref type="bibr">[29]</ref> proposed a new pooling mechanism, which leverages the intentions of neighboring pedestrians. This approach iteratively refines the states of all pedestrians using a mechanism known as "social-aware information selection". Peek Into The Future (PITF) <ref type="bibr">[28]</ref> incorporates visual features and proposes modules that take pedestrian-scene interactions into consideration. The Sophie framework <ref type="bibr">[35]</ref> is an LSTM-based generative adversarial network. It utilizes convolutional neural networks (CNNs) to extract scene features, followed by a dual attention mechanism. Subsequently, Sophie combines the attention outputs with the scene features.</p><p>Given that graph structures are able to explicitly represent the interactions of pedestrians, there has been a growing research interest in graph-based approaches <ref type="bibr">[55,</ref><ref type="bibr">56]</ref>. Graph attention networks (GATs) <ref type="bibr">[55]</ref> leverage the architecture of Bicycle-GAN and capture the pedestrian interactions with the help of the graph attention mechanism <ref type="bibr">[55]</ref>. Recursive Social Behavior Graphs <ref type="bibr">[56]</ref> utilize graph convolution networks, combined with additional social information from expert sociologists.</p><p>To directly utilize spatial and temporal information together, ref. <ref type="bibr">[57]</ref> proposed the Social Spatial-Temporal Graph Convolutional Neural Network (Social-STGCNN). Social-STGCNN models pedestrian trajectories as a spatial-temporal graph, where the spatial edges represent social interactions between pedestrians, weighted by their relative distances. However, the graph kernel function in Social-STGCNN is still based on some predefined rules, e.g., pedestrians with shorter distances have higher weights. By incorporating the spatial-temporal attention mechanism, we are able to move beyond such "predefined rules". Our proposed model learns to assign varying levels of importances to different neighbor nodes based on their features, while also taking the predefined rules into consideration.</p><p>In summary, while most previous works focus on modeling pedestrian interactions, the future trajectory of a pedestrian is usually uncertain and different pedestrians may exhibit distinct preferences regarding their behaviors. Therefore, it is not only crucial to model pedestrian interactions, but also imperative to explicitly quantify the inherent heterogeneity present within pedestrian trajectories. To bridge this gap, we introduce the InfoSTGCAN framework in this study. The proposed framework encapsulates pedestrian trajectories within a spatial-temporal graph, leveraging both convolutional and attention mechanisms across spatial and temporal dimensions. Furthermore, we model the behavioral patterns of each pedestrian as a latent distribution derived from their trajectories. This approach intuitively assigns unique latent codes to pedestrians, corresponding with distinct trajectory styles. Figure <ref type="figure">1</ref> illustrates such concept and Table <ref type="table">1</ref> summarizes the differences between the proposed framework with the previous methods. Convolutional graph neural networks Basically, there are two major types of convolutional graph neural networks: spectral-based approaches and spatial-based approaches. Spectral-based approaches develop convolution operations based on the graph Fourier transform, e.g., GCNs <ref type="bibr">[60]</ref>,and ChebNet <ref type="bibr">[62]</ref>. Spatial-based approaches perform convolution directly on the edges, making them suitable for asymmetric adjacency matrices, e.g., GraphSage <ref type="bibr">[63]</ref> and DGCNN <ref type="bibr">[64]</ref>. To deal with the spatial-temporal data, ST-GCN <ref type="bibr">[70]</ref> expands the spatial GCN into a spatial-temporal version for the task of skeleton-based action recognition, incorporating information from a localized spatial-temporal context.</p><p>Graph attention networks Attention has been widely used in a series of tasks, e.g., machine translation <ref type="bibr">[71]</ref>, entity resolution <ref type="bibr">[72]</ref> and so on. Proposed by <ref type="bibr">[73]</ref>, graph attention networks bring attention mechanisms to graph neural networks, which calculate the relative weights between two connected nodes by the attention scores. Later, <ref type="bibr">[74]</ref> proposed GeniePath, which controls the flow of information by some LSTM-style gating mechanisms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Contributions</head><p>Specifically, the main contributions of this paper are highlighted as follows:</p><p>1.</p><p>We formulate the task of pedestrian trajectory prediction as a spatial-temporal graph and propose a novel trajectory prediction model, InfoSTGCAN. This model takes both pedestrian interactions and heterogeneous behavior choice modeling into consideration. Through a comprehensive list of experiments, we demonstrate the superiority of InfoSTGCAN in comparison to existing baseline methods. 2.</p><p>Our proposed method integrates spatial-temporal graph convolution and spatialtemporal graph attention. This fusion enables our method to more effectively model pedestrian interactions by evaluating pedestrian importance using a combination of prior knowledge and data-driven features.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.</head><p>Based on the technique of variational mutual information maximization, our model generates an individual-level latent code for each pedestrian. These distinct latent codes facilitate the generation of trajectories with heterogeneous behavior choices.</p><p>The remainder of this paper is organized as follows: Section 2 presents the preliminaries and major notations and defines the problem of human trajectory prediction. In Section 3, we explicate the proposed method, focusing on the spatial-temporal graph network, the variational mutual information maximization, and the multi-objective loss function. The proposed model is then evaluated in Section 4, and the design is validated through a process that includes performance comparison, results visualization, and ablation studies. Finally, we draw conclusions and discuss future research directions in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Problem Statement</head><p>Suppose there are N pedestrians within a scene. Given their past observed trajectories tr 1:N obs during a period of time T obs , our objective is to predict their future trajectories tr 1:N pred over the forthcoming time period T pred . In this study, we jointly predict the future trajectories of all pedestrians.</p><p>To clarify, we begin by defining the observed trajectory of a pedestrian n (n &#8712; {1, . . . , N}). Specifically, for pedestrian n, its observed trajectory tr n obs can be formulated as follows:</p><p>where (x n t , y n t ) are a pair of random variables that indicate the location distribution of the n th pedestrian at time step t. Following a similar formulation in <ref type="bibr">[27,</ref><ref type="bibr">57]</ref>, the probability to predict an individual-level latent code for each pedestrian that possesses high mutual information with the future predicted trajectory. As a result, it enables the model to effectively capture the complexity and inherent heterogeneity of the pedestrian trajectories.</p><p>To summarize, the proposed approach is not only able to present a holistic view of their interactions but also able to capture the heterogeneity in pedestrian movements. Details regarding the optimization of the proposed method, including the multi-objective loss function, can be found in Section 3.3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Spatial-Temporal Graph Convolutional Attention Network</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.1.">Spatial-Temporal Graph Representation of Pedestrian Trajectories</head><p>To begin with, we need to construct the spatial-temporal graph representation from pedestrian trajectories. Essentially, a spatial-temporal graph is an attributed graph where the node attributes evolve through time <ref type="bibr">[59]</ref>. The key idea behind the spatial-temporal graph is its capacity to simultaneously account for both spatial and temporal dependencies.</p><p>To build a spatial-temporal graph, we commence by formulating a sequence of spatial graphs. For every step t, a spatial graph G t is constructed to represent the locations of pedestrians within a given scene. Each G t is composed of three parts: a set of vertices V t , a set of edges E t , and an adjacency matrix A t , i.e., G t = (V t , E t , A t ).</p><p>Elaborating further, the vertex set is given by</p><p>{v n t | &#8704;n &#8712; {1, . . . , N}}, where each vertex v n t represents a pedestrian. The edge set E t is composed of edges between two vertices, i.e., E t = {e mn t | &#8704;m, n &#8712; {1, . . . , N}}. Specifically, the edge e mn t represents whether the vertex v m t and the vertex v n t are connected or not. If v m t and v n t are connected, e mn t = 1; if v m t and v n t are not connected, e mn t = 0. The adjacency matrix A t &#8712; R N&#215;N is defined as follows: A t = &#63726; &#63727; &#63727; &#63727; &#63728; a 11 t a 12 t . . . a 1N t a 21 t a 22 t . . . a 2N t . . . . . . . . . . . . a N1 t a N2 t . . . a NN t &#63737; &#63738; &#63738; &#63738; &#63739;</p><p>Each item a mn t (m, n &#8712; {1, . . . , N}) models the influence between the vertex v m t and the vertex v n t . As such, a spatial-temporal graph G 1:T is finally constructed, consisting of a series of spatial graphs G 1:T = {G 1 , . . . , G T }.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.2.">Spatial-Temporal Graph Convolution</head><p>Based on the spatial-temporal graph representation developed in Section 3.1.1, we introduce the spatial-temporal graph convolution (ST-GC) operation in this subsection. Before diving into the complete technical details of ST-GC, we start with a more general version of the convolution operation, which is on a two-dimensional grid or feature map.</p><p>In deep learning, neural networks that employ convolution operations are referred to as "convolutional neural networks (CNNs)" <ref type="bibr">[49,</ref><ref type="bibr">53]</ref>. Typically, a CNN consists of multiple convolutional layers, and within each layer, multiple learnable filters or kernels are applied to the input feature map. These filters help to capture local patterns or features, enabling CNNs to learn hierarchical representations from the input data. Additionally, CNNs significantly benefit from the idea of "parameter sharing", which reduces the number of parameters compared to fully connected layers.</p><p>For example, suppose the kernel size is equal to k, feature (l) denotes the feature map at layer l, and feature (l+1) denotes the feature map at layer l + 1. Through the training process, convolution operations are able to learn to aggregate the information from the neighbors centering around each location in feature (l) . Therefore, the convolution operation on layer l can be summarized as:</p><p>Typically, an attention mechanism consists of three major components: a query, a key, and a value <ref type="bibr">[72]</ref>. Intuitively speaking, the query is a representation of the current item or context that the model is trying to process. It serves as a reference for determining essential elements within the input data. The key represents an item in the input sequence, which the model compares with the query to determine their similarity or relevance. The value represents the significant information associated with each element in the input data.</p><p>The attention mechanism computes a score for each key-query pair, typically using a dot product, scaled exponential, or some other similarity function <ref type="bibr">[71]</ref><ref type="bibr">[72]</ref><ref type="bibr">[73]</ref><ref type="bibr">79,</ref><ref type="bibr">80]</ref>. These scores are then passed through a softmax function and are converted into a probability distribution, which represents "the attention weights". Intuitively speaking, when a weight between a query and a key is higher, the corresponding key-value pair is more important; thus, the attention mechanism pays more "attention" to this pair.</p><p>Finally, the attention mechanism computes a weighted sum of the values using the obtained attention weights, effectively selecting and aggregating the relevant information from different elements. This aggregated context vector is then used in subsequent layers of the model to make predictions. The process can be summarized into the following equation:</p><p>where d k is the dimension of each query, and the term 1/ &#8730; d k can enhance the numerical stability of attention mechanisms.</p><p>In the context of GNNs, the previously mentioned three components, namely, a query, a key, and a value, are employed to learn meaningful representations of nodes, based on their local features and the structure of the underlying graph. As discussed in Section 3.1.1, each vertex represents a pedestrian, and v n t represents pedestrian n at step t. We represent his or her corresponding query vector as qry n t = f Qry (v n t ), the key vector as key n t = f Key (v n t ), and the value vector as val n t = f Val (v n t ). As illustrated in Figure <ref type="figure">4</ref>, there are two major kinds of attention mechanisms in our framework: spatial attention and temporal attention. Both of them can be viewed as a way of message passing on a connected graph <ref type="bibr">[73,</ref><ref type="bibr">81]</ref>.</p><p>Spatial attention focuses more on message exchanges among nodes within the same time step. Intuitively speaking, it helps to generate feasible trajectories by aggregating information from other pedestrians. Suppose the message passed from node v m t to v n t is msg m&#8594;n t , which is defined as:</p><p>and based on Equation (8), we may formulate spatial attention as:</p><p>On the other hand, temporal attention focuses more on the process of temporal message passing by aggregating information through the relevant time steps. Intuitively speaking, it helps to generate feasible trajectories by incorporating the temporal significant features of pedestrians.</p><p>Here, we show how the temporal message of v n t passed from t &#8242; to t:</p><p>and based on Equation ( <ref type="formula">8</ref>), we may formulate temporal attention as follows:  stands for the message passed from node v m t to v n t , and msg n t &#8242; &#8594;t stands for the temporal message of v n t passed from t &#8242; to t. There are two primary types of attention mechanisms in our framework: spatial attention and temporal attention. (a) Spatial attention models the crowd as a graph and helps to predict a pedestrian's trajectory based on the movements of her or his neighboring pedestrians. (b) Temporal attention, on the other hand, focuses on each individual pedestrian and primarily assists in capturing her or his trajectory trends over time.</p><p>Difference Between ST-GC and ST-GAT Spatial-temporal graph convolution (ST-GC) and spatial-temporal graph attention (ST-GAT) are two commonly used techniques in GNNs. Both of them can learn meaningful representations of pedestrian trajectories through the spatial and temporal dimensions. Their major difference lies in how they aggregate information from neighboring nodes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8226;</head><p>In ST-GC, information from neighboring nodes is communicated by applying convolution filters or kernels on the graph, which typically involves a weighted sum of features across neighboring nodes. Usually, those weights can be identical (e.g., GraphSAGE <ref type="bibr">[63]</ref>), predetermined, or learnable ( <ref type="bibr">[60,</ref><ref type="bibr">70]</ref>). Therefore, the weights are considered to be "explicitly" assigned to the neighborhoods of the focused node during the aggregation process <ref type="bibr">[59]</ref>. &#8226; However, in ST-GAT, the weights between two connected nodes are considered to be "implicitly" computed. Specifically, those weights are learned based on the similarity of their feature representations, which takes into account the relative importance for different node pairs <ref type="bibr">[59,</ref><ref type="bibr">73]</ref>. Typically, more important nodes tend to have higher similarity scores, resulting in them being assigned larger weights.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Social Interaction Modeling</head><p>In this subsection, based on the discussed ST-GC and ST-GAT, we aim to clarify how they capture the modeling of pedestrian social interactions.</p><p>As introduced in Section 3.1.1, the adjacency matrix A t can be considered as a representation of the graph edge attributes. In the spatial-temporal graph convolution part, we incorporate our prior knowledge about the social relations among different pedestrians into a kernel function, e.g., pedestrians closer in distance tend to be more important. The kernel function maps node attributes at v n t and v m t to the attribute value a mn t , which is defined as follows:</p><p>Equation ( <ref type="formula">13</ref>) is consistent with the intuition that pedestrians are more likely to be influenced by each other if they are closer. Additionally, the kernel function is symmetric:</p><p>However, since the form of the kernel function is predetermined, and some pedestrian interactions are asymmetric, the interaction modeling in a purely spatial-temporal graph convolution-based model might be insufficient. Therefore, we integrate the spatial-temporal graph attention mechanism into our model. As discussed in Section 3.1.3, the relative importance of pedestrian m to pedestrian n is data-dependent, which is determined by the value of msg m&#8594;n t = qry n t &#8226; key m t . Therefore, there is not any predetermined kernel in the the spatial-temporal graph attention mechanism. Generally, the spatial-temporal graph attention mechanism is able to capture the asymmetric importance:</p><p>which stems from the fact that, in general, qry m t &#824; = qry n t and key m t &#824; = key n t .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Variational Mutual Information Maximization</head><p>In real-world scenarios, when a pedestrian encounters other pedestrians, their reactions can vary from person to person. This variance can be influenced by factors like age, with different age groups having distinct walking behaviors <ref type="bibr">[24]</ref>. Furthermore, an individual's walking preference might have notable changes depending on whether walking alone or in a group <ref type="bibr">[22,</ref><ref type="bibr">25]</ref>. Although few frameworks have been employed to produce such diverse trajectory styles (e.g., the variety loss <ref type="bibr">[34]</ref>), there is still a need to understand how to effectively capture the intrinsic heterogeneity within the pedestrian trajectories.</p><p>In this subsection, to solve the mentioned issue, we introduce the technique of variational mutual information maximization. We begin by considering the principles of information theory <ref type="bibr">[82]</ref><ref type="bibr">[83]</ref><ref type="bibr">[84]</ref><ref type="bibr">[85]</ref>. Suppose X and Y are random variables. If we want to measure the "amount of information" learned about Y by providing the knowledge of X or vice versa, mutual information I(X; Y) is utilized. The mutual information I(X; Y) can be expressed as the difference between the self-entropy of Y and the conditional entropy of Y given X:</p><p>where H(X) denotes the self-entropy of X, and H(Y) denotes the self-entropy of Y. H(X | Y) denotes the conditional entropy of X given Y, and H(Y | X) denotes the conditional entropy of Y given X.</p><p>The conditional mutual information is defined as below:</p><p>Intuitively speaking, I(X; Y | Z) can be treated as how much uncertainty is reduced in X when Y is observed, given Z. If X and Y are independent, the knowledge of X does not provide any information about Y and vice versa. As a result, the mutual information between X and Y is zero. However, given Z, if the knowledge of X provides extensive information about Y, then I(X; Y | Z) can be extremely high.</p><p>This interpretation helps to formulate the idea: given the past trajectory tr n obs , in order to learn meaningful representations for the future pedestrian trajectory, there should be high conditional mutual information between the latent code c n and the generated trajectory G(tr n obs , c n ). In other words, I(c n ; G(tr n obs , c n ) | tr n obs ) should be high. As such, we propose an information-theoretic loss:</p><p>Based on Equations ( <ref type="formula">19</ref>) and <ref type="bibr">(20)</ref>, we are able to derive the following equation:</p><p>c|tr n obs ,tr n pred ) log P c &#8242; | tr n obs , tr n pred + H(c n ) =E tr n pred &#8764;G D KL (P(&#8226; | tr n obs , tr n pred ) &#8741; Q &#966; (&#8226; | tr n obs , tr n pred )) + E c &#8242; &#8764;P(c|tr n obs ,tr n pred ) log Q &#966; (c &#8242; | tr n obs , tr n pred ) + H(c n ) &#8805;E tr n pred &#8764;G E c &#8242; &#8764;P(c|tr n obs ,tr n pred ) log Q &#966; (c &#8242; | tr n obs , tr n pred ) + H(c n ) (22) However, in practice, directly maximizing the mutual information I(c n ; G(tr n obs , c n ) | tr n obs ) is extremely challenging, as it requires the truth unknown posterior P(c n | tr n obs , tr n pred ). Therefore, we utilize a common technique in statistics and machine learning to address this problem, i.e., variational inference [85-88]. By defining an approximate posterior Q &#966; (c n | tr n obs , tr n pred ) over the original unknown posterior P(c n | tr n obs , tr n pred ), we are able to construct a lower bound over the original quantity -H(c n | G(tr n obs , c n ), tr n obs ): -H(c n | G, tr n obs ) = E tr n pred &#8764;G E c n &#8764;P(c n |tr n obs ,tr n pred ) log P c n | tr n obs , tr n pred</p><p>where G is the abbreviation for G(tr n obs , c n ), D KL (&#8226; &#8741; &#8226;) stands for the Kullback-Leibler (KL) divergence, and the last step holds true because KL divergence is always always non-negative <ref type="bibr">[82,</ref><ref type="bibr">84]</ref>. Therefore, we may construct a lower bound L I over the original objective Equation <ref type="bibr">(21)</ref>:</p><p>As the approximate posterior Q &#966; (c n | tr n obs , tr n pred ) approaches the true posterior dis- tribution P(c n | tr n obs , tr n pred ), D KL (P(&#8226; | tr n obs , tr n pred ) &#8741; Q &#966; (&#8226; | tr n obs , tr n pred )) approaches zero. Therefore, the lower bound L I approaches the mutual information I(c n ; G(tr n obs , c n ) | tr n obs ) and becomes tighter. It is worth mentioning that we also optimize the conditional entropy of the latent code H(c n | tr n obs ), so that we the latent variable distribution and the predictor are learned simultaneously.</p><p>To summarize, the final objective of the variational mutual information maximization part can be written as:</p><p>where P &#952; (c n | tr n obs ) is the conditional prior distribution for the latent code c n . The primary differences between Equation ( <ref type="formula">20</ref>) and the mutual information-inspired objective in <ref type="bibr">[85]</ref> are:</p><p>1.</p><p>In ref. <ref type="bibr">[85]</ref>, there is only one latent code for each training example. However, in this paper, there are multiple latent codes for each training example. Different pedestrians may have distinct preferences and walking styles. It is generally infeasible to assume all pedestrians follow the same preference or walking style. Therefore, for each pedestrian n, he or she has its own latent code c n , and different pedestrians generally have different latent codes, allowing the proposed framework to effectively model the latent patterns in pedestrian trajectories.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2.</head><p>In this paper, the proposed information-theoretic loss is based on the conditional mutual information. However, in ref. <ref type="bibr">[85]</ref>, the loss is based on the mutual information.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.</head><p>Different from the previous research taken in <ref type="bibr">[85]</ref>, where the prior latent code distribution is assumed to be fixed, we opt to optimize the prior distribution P &#952; (c n | tr n obs ) as well.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Multi-Objective Loss Function</head><p>To optimize the proposed method, we utilize the multi-objective loss defined below:</p><p>where L pred denotes the prediction loss, L GAN denotes the generative adversarial loss, and L info denotes the information loss. &#955; 1 , &#955; 2 and &#955; 3 control the relative importance of each loss.</p><p>&#8226; L pred : The prediction loss relies on negative log-likelihood, which is defined as:</p><p>where c n &#8764; P &#952; (c n | tr n obs ). Intuitively speaking, when L pred is decreasing, the log- likelihood of tr n pred is increasing. The model G(tr n obs , c n ) and the posterior distribution P &#952; (c n | tr n obs ) together are encouraged to accurately predict the ground-truth future trajectory.</p><p>&#8226; L GAN : The generative adversarial loss relies on the generator G and the discriminator D, in which two models are jointly trained. The generator G captures the distribution for the future trajectory, and the discriminator distinguishes whether a sample comes from the training data or the generator G.</p><p>&#8226; L info : The information-theoretic loss relies on the conditional prior distribution P &#952; (c n | tr n obs ), the model G(tr n obs , c n ), and the approximate posterior Q &#966; (c n | tr n obs , tr n pred ), which has been discussed in Section 3.2.</p><p>UNIV. These datasets consist of real-world pedestrian trajectories with complex human interactions. Specifically, lots of challenging pedestrian behaviors are covered in the datasets, such as pedestrians crossing each other, walking together, avoiding collision, and groups assembling and disbanding <ref type="bibr">[91]</ref>. In accordance with a similar strategy utilized in previous studies <ref type="bibr">[27,</ref><ref type="bibr">34]</ref>, all trajectories are sampled every 0.4 s.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Evaluation Metrics</head><p>In accordance with prior work <ref type="bibr">[27,</ref><ref type="bibr">34,</ref><ref type="bibr">57,</ref><ref type="bibr">93]</ref>, we choose to use the following evaluation metrics:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#8226;</head><p>Average Displacement Error (ADE): The average L 2 distance between the predicted trajectory and the ground truth trajectory across all time steps, which is defined as follows:</p><p>where (x n t , y n t ) are the real locations, and ( xn t , &#375;n t ) are the predicted locations. &#8226; Final Displacement Error (FDE): The L 2 distance between the predicted final destination and the true final destination at the end of the prediction period T pred , which is defined as follows:</p><p>Intuitively speaking, different metrics serve as different purposes. ADE evaluates the average prediction error across the whole trajectory, whereas FDE focuses solely on the prediction error at the destination.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Implementation Details</head><p>In this section, we provide important details on appropriately implementing the proposed model. To facilitate the learning process <ref type="bibr">[60,</ref><ref type="bibr">70]</ref>, we normalize the adjacency matrix A t at each time step t as follows:</p><p>where I is an identity matrix, which serves to add self-connections to all nodes. &#923; t is the diagonal node degree matrix of (A t + I). We use A to denote the stack of all adjacency matrices A 1 + I, . . . , A T + I, and &#923; to denote the stack of matrices &#923; 1 , . . . , &#923; T . Suppose the vertices values at the layer l as V (l) , which is a stack of vertices values across all steps 1, . . . , T. We can now employ the matrices defined to implement the ST-GC layers:</p><p>where W (l) represents the learnable parameters at the l-th layer. The above Equation <ref type="bibr">(38)</ref> follows similar ideas in <ref type="bibr">[57,</ref><ref type="bibr">70]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model Architecture and Training Setup</head><p>The proposed model consists of a series of ST-GC and ST-GAT layers, which helps to extract spatial-temporal node embeddings from the input data. Later, those node embeddings are concatenated with latent codes, and then, several convolutional layers are followed such that the output time dimension is manipulated to match the length of predicted horizon T pred .</p><p>Unless noted otherwise, we choose to use PReLU <ref type="bibr">[94]</ref> as the activation function through our model. During training, we used a batch size of 128 and the default optimizers were chosen to use Stochastic Gradient Descent (SGD). The initial learning rate was 0.01, and it was decreased based on exponential scheduling with a decay factor 0.97. To prevent overfitting to the training data, we randomly dropped out the features at a probability of 0.5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Results Analysis</head><p>In this subsection, we begin by comparing our results with baseline models. Subsequently, we provide a comprehensive qualitative analysis of how our proposed method models pedestrian interactions and takes heterogeneous behavior choices into account. We illustrate cases where InfoSTGCAN successfully predicts collision-free trajectories for scenarios such as pedestrians walking in the same direction, approaching from opposing directions, or merging at angles. Moreover, our model is able to generate socially acceptable trajectories based on the predicted personalized latent codes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1.">Comparison with Baseline Models</head><p>Baselines We compare the proposed method with the following baselines:</p><p>1.</p><p>Linear: A linear regression model characterized by minimizing the least square error.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2.</head><p>Social LSTM (S-LSTM) <ref type="bibr">[27]</ref>: An LSTM approach that incorporates the "social pooling" mechanism for hidden states.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.</head><p>S-GAN-Pooling <ref type="bibr">[34]</ref>: A GAN-based approach that utilizes global pooling for pedestrian interactions. 4.</p><p>SR-LSTM-2 <ref type="bibr">[29]</ref>: An LSTM-based method that leverages a state refinement technique. 5.</p><p>GAT <ref type="bibr">[55]</ref>: A graph attention network leveraging the sequence-to-sequence architecture. 6.</p><p>Sophie <ref type="bibr">[35]</ref>: A GAN-based method that takes both scene and social factors into account through a dual attention mechanism. 7.</p><p>SCAN <ref type="bibr">[58]</ref>: An LSTM-based encoder-decoder framework that incorporates a novel spatial attention mechanism to predict trajectories for all pedestrians. 8.</p><p>Social-STGCNN <ref type="bibr">[57]</ref>: A spatial-temporal graph-based approach that employs a spatialtemporal graph convolutional network to handle complex social interactions.</p><p>The performance of the proposed method is evaluated against other benchmark models on ADE/FDE metrics, as presented in Table <ref type="table">2</ref>. In general, our method outperforms all baseline methods on the two metrics. Our proposed model achieves an error of 0.62 on the average FDE metric, representing an approximate 20% improvement over the previous best performance (0.75). For the average ADE metric, the proposed model is better than the previous best performance by 5%. Interestingly, although our model does not need the vision signal containing scene context information, it can still outperform methods that utilize such information, such as SR-LSTM and Sophie. Table 3. The ablation study on &#955;. Multiple different values are tested to show the model performance on the HOTEL dataset. These findings validate the importance of the GAN loss component and the significance of maintaining a balanced weight between the prediction loss L pred and the information loss L info for optimal performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions and Future Research</head><p>In this paper, we formulate the task of pedestrian trajectory prediction as a spatialtemporal graph and develop a novel pedestrian trajectory prediction model, InfoSTGCAN. The proposed model takes into account both pedestrian interactions and heterogeneous behavior choices. Specifically, to better model pedestrian interactions, our proposed model consists of two parts, spatial-temporal graph convolution and spatial-temporal graph attention, enabling the analysis of interactions through a combination of prior knowledge and data-driven methods. To address the heterogeneity within the pedestrian behavior choices, we utilize the variational mutual information maximization technique, which is primarily composed of a conditional prior distribution and an approximate posterior distribution.</p><p>The proposed method outperforms baseline models across several publicly accessible datasets. Visualization of the generated trajectories reveals our method's capacity to handle various scenarios, including pedestrians going straight from different directions or making a right turn first and then going straight. We also conduct a qualitative analysis of the proposed method in different situations, such as collision avoidance, parallel walking, and pedestrians merging. In these situations, InfoSTGCAN tends to generate realistic collision-free trajectories. Additionally, we show that our framework is able to generate satisfactory trajectories through learning a personalized pedestrian-level latent code.</p><p>Nevertheless, we identify several promising future directions that are worth exploring further. The first aspect involves exploring more metrics related to probabilistic trajectory prediction beyond the standard ADE/FDE for training and evaluation, e.g., Mahalanobis distance <ref type="bibr">[95]</ref>. Secondly, our methodology currently models pedestrian social interactions through ST-GC and ST-GAT; an exciting direction is to integrate more socially aware or physics-based methods <ref type="bibr">[96]</ref>. Lastly, the third aspect refers to an integrative approach that combines heuristic optimization <ref type="bibr">[97]</ref>, causal inference <ref type="bibr">[22,</ref><ref type="bibr">23,</ref><ref type="bibr">26]</ref> or clustering techniques <ref type="bibr">[98]</ref>.</p></div></body>
		</text>
</TEI>
