<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Language2Pose: Natural Language Grounded Pose Forecasting</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>09/16/2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10125510</idno>
					<idno type="doi"></idno>
					<title level='j'>3DV</title>
<idno>0219-6921</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Chaitanya Ahuja</author><author>Louis-Philippe Morency</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Figure 1: Overview of our model which uses joint multimodal space of language and pose to generate an animation conditioned on the input sentence.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Generating animations from natural language descriptions is a first step for movie script visualization <ref type="bibr">[11,</ref><ref type="bibr">20]</ref> which can later be stitched together while maintaining coreferences in the story-line <ref type="bibr">[38]</ref>. These language grounded animations can also be useful in cases like virtual human animation <ref type="bibr">[30,</ref><ref type="bibr">7,</ref><ref type="bibr">6]</ref>, robot motion and task planning <ref type="bibr">[16,</ref><ref type="bibr">2]</ref>.</p><p>An animation consists of a sequence of poses, which can be represented by positions of different joints in the body such as Root (base of spine), head, shoulder, wrist, knee and many more.</p><p>Pose forecasting conditioned on natural language has 3 major challenges. First, pose and natural language are very different modalities. The model needs a joint space where both natural language sentences and poses can be mapped. The model should also be able to decode animations from this embedding space. Second, different Figure <ref type="figure">2</ref>: Overview of our proposed model Joint Language-to-Pose (or JL2P). Language and pose are mapped to a joint embedding space Z, which can now be used by a trained pose decoder q d to generate a pose sequence. At train time both p e and q e are used to create the joint embedding using a training curriculum. But at inference time z &#8712; Z is encoded by p e and decoded by q d , giving us a model which can generate a animation (or sequence of poses) from a free form description (or language). words of a sentence represent different qualities about the animation. Verbs and adverbs describe the action and speed/acceleration of the action; nouns and adjectives describe locations and directions respectively. The model has to map these concepts to small pose sequences and then stitch them to render convincing animations. Third, we want to see if objective metrics correlate with subjective metrics for this task as our models are trained using objective distance metrics, but the quality of generated animations can only be judged by humans.</p><p>In this paper, our two main contributions tackle the modeling challenges of pose and natural language. First, we propose a model Joint Language-to-Pose (or JL2P) that learns a joint embedding space of these two modalities. Second, we use a training curriculum to help the model emphasize more on shorter and and easier sequences first and longer and harder sequences later. Additionally, to make the training regimen robust to the outliers in the dataset, we use Smoooth L1 as the distant metric in our loss function. Through multiple objective and subjective experiments, we show that our model can generate more accurate and natural animations from natural language sentences than other data driven models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Pose Forecasting: Data driven human pose forecasting attempts to understand the behaviours of the subject from its history of poses and generates the next sequence of poses. Short-term predictions <ref type="bibr">[24]</ref> focus on modeling joint angles corresponding to hands, legs, head and torso. Long-term predictions <ref type="bibr">[10,</ref><ref type="bibr">31,</ref><ref type="bibr">24]</ref> additionally model the positions of the human character to generate animations like walking, running, jumping and crawling.</p><p>While some works use different actions (such as running, kicking, and more) as conditioning variables to generate the future pose <ref type="bibr">[31,</ref><ref type="bibr">18]</ref>, others rely solely on the history of poses to predict what kind of motion will follow <ref type="bibr">[8]</ref>. Pose forecasting for locomotion is a more commonly researched topic, where models decide where and when to run/walk based on low-level control parameters such as trajectory and terrain <ref type="bibr">[13]</ref>. Task based locomotion (such as writing on a whiteboard, moving a box, and sitting on a box) add the nuances of transitioning from one task to another, but pose generation is based on task-specific footstep plans that act as motion templates <ref type="bibr">[1]</ref>.</p><p>All these approaches are either action specific, or require a set of low-level control parameters to forecast future pose. In this work, we aim replace low-level control parameters with high-level control parameters (e.g. natural language) to control actions and their speed and direction for the generated pose.</p><p>Image or Speech conditioned pose forecasting: Images with a human can act as a context to forecast what comes next. Chao et. al. <ref type="bibr">[5]</ref> use one image frame to predict the next few poses. These generated poses can now be used to aid the generation of a video <ref type="bibr">[35]</ref> or a sequence of images <ref type="bibr">[19]</ref>. An image, a high-level control parameter, has action information for pose generation, but it does not provide a fine-grained control on the speed and acceleration of the motion trajectory.</p><p>Speech can also be used to control animations of virtual characters. Taylor et. al. <ref type="bibr">[32]</ref> use a data driven approach to model facial animation, while upper body pose forecasting conditioned on speech inputs has been tackled by Takeuchi et. al. <ref type="bibr">[30]</ref>. But, these pose sequences model the non-verbal behaviours (such as head nods, pose switches, hand waving and so on) of the character and do not offer fine-grained control over the characters next movements.</p><p>Language conditioned pose forecasting: Natural language sentences consists of verbs describing the actions, adverbs describing the speed/acceleration of the action, and nouns with adjectives to describe the direction or target. This information can help provide a more fine-grained control over pose generations compared to image or speech.</p><p>Statistical models <ref type="bibr">[29,</ref><ref type="bibr">28]</ref> which use bigram models for natural language have been trained to encode motion sequences from sentences. Ahn et. al. <ref type="bibr">[2]</ref> use around 2100 hours of youtube videos with annotated text descriptions to train a pose generation model. Pose sequences extracted from videos have limited translation and occluded lower bodies, hence their model only predicts the upper body with a static Root joint. Some works use 3D motion capture data instead <ref type="bibr">[26,</ref><ref type="bibr">34]</ref>.</p><p>Human motions generally have translation of the Root joint, hence forecasting trajectory is important to get natural looking animations. Lin et. al <ref type="bibr">[17]</ref> generates pose of all the joints of the body by pretraining a pose2pose autoencoder model before mapping language embeddings on the learned pose space. But the embedding space is not learned jointly <ref type="bibr">[23]</ref> which may limit the generative powers of the pose decoder. In contrast, our proposed approach learns a joint embedding space of language and pose using a curriculum learning training regime.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Problem Statement</head><p>As an example, consider a natural language sentence which describes a human's motion: "A person walks in a circle". The goal of this cross-modal language-to-pose translation task is to generate an animation representing the sentence; i.e. an animation that shows a person following a trajectory of a circle with a walking motion (see figure <ref type="figure">1</ref>).</p><p>Formally, given a sentence, represented by an N-sized sequence of words</p><p>that are coherent with the semantics in the sentence. x i &#8712; R K is the i th word vector with dimension K. y t &#8712; R J&#215;3 is the pose matrix at time t. Rows of y t represent joints of the skeleton and columns are the xyz-coordinates of each joint. Tensors X and Y are elements of sets X and Y respectively.</p><p>Modeling language-to-pose is done by training a model f : R K&#215;N -&#8594; R J&#215;3&#215;T to predict a pose sequence &#374;1:T</p><p>where &#920; are trainable parameters of the model f .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Joint Language-to-Pose</head><p>Language-to-pose models should be able to grasp nuanced concepts like speed, direction of motion and the kind of actions from the language and translate them to pose sequences (or animations). This requires the model to learn a multimodal joint space of language and pose. In doing so, it should also be able to generate sequences that are deemed correlated to the sentence by humans.</p><p>To achieve that objective, we propose Joint Languageto-Pose (or JL2P) model to learn the joint embedding space. Given an input sentence, an animation can be sampled from this model at inference stage.</p><p>In this section, a joint embedding space of language and pose is formalized. This is followed by an algorithm to train for the joint embedding space and a discussion on the practical edge cases at inference time for our Joint Language-to-Pose model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Joint Embedding Space for Language and Pose</head><p>To learn a joint embedding space of language and pose, the sentence X 1:N and pose Y 1:T are first mapped to a latent representation using a sentence encoder p e (X 1:N ; &#934; e ) and a pose encoder q e (Y 1:T ; &#936; e ) respectively. These estimate the latent representation or embeddings z x and z y respectively in the embedding space Z &#8834; R h ,</p><p>z x , z y should lie close to each other in Z as they represent the same concept. To ensure that they do lie close together, a joint translation loss is constructed (refer to Figure <ref type="figure">2</ref>) and trained end to end with a training curriculum.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Joint Loss Function</head><p>Once we have the embedding z x or z y , a pose decoder q d (; &#936; d ) is used to generate an animation from the joint embedding space Z. The output of the pose decoder must now lie close to the pose sequence Y 1:T . Hence, using X 1:N as inputs and Y 1:T as outputs, the cross-modal translation loss is defined as,</p><p>and using Y 1:T as inputs and Y 1:T as outputs, the uni-modal translation (or autoencoder) loss is defined as,</p><p>where d(x, y) is a function to calculate the distance between the predicted values and ground truth of pose. &#934; e , &#936; e and &#936; d are trainable parameters of the sentence encoder, pose encoder and pose decoder respectively.</p><p>Combining equations 4 and 5 we get a joint translation loss,</p><p>Jointly optimizing the loss L j pushes z x and z y closer together improving generalizability and additionally trains the pose decoder which is useful for inference from the joint embedding space. As L j is a mutivariate function in X 1:N and Y 1:T , coordinate descent <ref type="bibr">[33]</ref> for optimizing the loss function is a natural choice and is described in Algorithm 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.">Training Curriculum</head><p>Cross modal pose forecasting can be a challenging task to train <ref type="bibr">[5]</ref>. Starting with simpler examples before moving on to tougher ones can be beneficial to the training process <ref type="bibr">[4,</ref><ref type="bibr">37,</ref><ref type="bibr">36]</ref>.</p><p>The curriculum design commonly used for pose forecasting <ref type="bibr">[5]</ref> is adapted for our joint model. We first optimize the model to predict 2 time steps conditioned on the complete sentence. This easy task helps the model learn very short pose sequences like leg motions for walking, hand motions for waving and torso motions for bending. Once the loss on the validation set starts increasing, we move on to the next stage in the curriculum. The model is now given twice the amount of poses for prediction. The complexity of the task is increased in every stage till the maximum time-steps (T ) of prediction is reached. We describe the complete training process in Algorithm 1.</p><p>Algorithm 1 Learning language-pose joint embedding</p><p>for all X 1:N , Y 1:t &#8712; X train , Y train do z &#8592; q e (Y 1:t ; &#936; e ) //Encoder MaxValLoss &#8592; inf</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4.">Optimization</head><p>For the distance metric d(x, y) in Equation <ref type="formula">4</ref>, 5 and 6, Smooth L1 loss (similar to Huber Loss <ref type="bibr">[15]</ref>) is used which is defined as, SmoothL1(x, y) = 0.5(x -y)<ref type="foot">foot_1</ref> for |x -y| &lt; 1 |x -y| -0.5 otherwise (7) In contrast, Lin et. al. <ref type="bibr">[17]</ref> uses L2 loss for d(x, y). L2 loss is more sensitive to outliers than L1 loss due to its linearly proportional gradient with respect to the error, while L1 loss has a constant gradient of 1 or -1. But L1 Loss can become unstable when |x -y| &#8776; 0, due to oscillating gradients between 1 and -1. On the other hand, Smooth L1 is continuous and smooth near 0 and more generally for all x, y &#8712; R, hence it is more stable than L1 as a loss function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head><p>Joint language to pose modeling can be broken down into three core challenges, 1. Prediction Accuracy by Joint Space: How accurate is pose prediction from the joint embedding ?</p><p>2. Human Judgment: Which of the generated animation is more representative of the input sentence? Does the subjective evaluation correlate with the results from the objective evaluations?</p><p>3. Modeling nuanced language concepts: Is the model able to map nuanced concepts such as speed, direction and action in the generated animations?</p><p>Experiments are designed to evaluate these challenges of language grounded pose forecasting.</p><p>In the following subsections, the dataset and its preprocessing is briefly which is followed by the evaluation metrics for both objective and subjective evaluations. Finally, design choices of the encoder and decoder models are described which are used to construct the baselines in the final subsection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Dataset</head><p>Our models are trained an evaluated on KIT Motion-Language Dataset <ref type="bibr">[25]</ref> which combines human motion with natural language descriptions. It consists of 3911 recordings (approximately 11.23 hours) which are re-targeted to a kinematic model of a human skeleton with 50 DoFs (6 DoF for the Root joint's orientation and position, while remaining 44 DoFs for arms legs, head and torso). The dataset also consists of 6278 English sentences (approximately 8 words per sentence) describing the recordings. This is more than the number of recordings as each recording has one or more descriptions which are annotated by human volunteers. We use 20% of the data as a randomly sampled held-out set for evaluating all models.</p><p>There is wide variety of motions in this dataset ranging from locomotion (e.g. walking, running, jogging), performing (e.g. playing violin/guitar), and gesticulation (e.g. waving). Many recording have adjectives to further describe the motion like speed (e.g. fast and slow), direction (left, right and forward), and number for periodic motions (e.g. walk 4 steps).</p><p>We use the pre-processing steps used in Holden et. al. <ref type="bibr">[14]</ref>. All the frames of the motion are transformed such that body always faces the Z-axis. Joint rotation angles are transformed to 3D positions is the skeleton's local frame of reference with Root as the origin. Root's position on XZplane and orientation along Y-axis is represented as velocity instead of absolute values.</p><p>Motion sequences are then sub-sampled to a frequency of 12.5 Hz down from 100Hz. This is low enough to bring enough variance between 2 time-steps for the decoder to train for a regression task, while not compromising on the human's perception of the animation <ref type="bibr">[5]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Implementation Details</head><p>For pose encoder (q e ) a network of Gated Recurrent Units (GRUs) <ref type="bibr">[9]</ref> is used in our model JL2P. The pose decoder (q d ) is the same except it has residual connection from the input to the output layer. This is similar to the pose decoder in Lin et. al. <ref type="bibr">[17]</ref>, except an extra layer to predict the trajectory (or Trajectory Predictor) is discarded in our model.</p><p>For langauge encoder (p e ), a network of Long-Short Term Memory Units (LSTMs) <ref type="bibr">[12]</ref> is used. Each token of the sentence is converted into a distributional embedding using a pre-trained Word2Vec model <ref type="bibr">[22]</ref>. <ref type="foot">1</ref></p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Baselines</head><p>There has been limited work done in the domain of datadriven cross-modal translation from natural language descriptions to pose sequence generation. The closest work to our proposed approach is by Lin et. al. <ref type="bibr">[17]</ref> 2 . As mentioned in Section 2, their model does not follow a training curriculum and uses L2 loss as the loss function. Thier model also maps the language embeddings to an existing embedding space of poses instead of jointly learning it.</p><p>We also compare our model JL2P (see Section 4) with three ablations derived from itself. These ablations study the 3 main components of the model, joint embedding space, curriculum learning and Smooth L1 loss: &#8226; JL2P w/o L1 -L2 loss is used instead of Smooth L1 loss as the distance metric d(x, y).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Models</head><p>&#8226; JL2P w/o Joint Emb. -Instead of joint training as described in Section 4.2, autoencoder loss L u minimized first followed by optimization of the cross-translation loss L c .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4.">Objective Evaluation Metrics</head><p>All models are evaluated on the held-out set with a metric Average Position Error (APE). Given a particular joint j, it can be denoted as APE(j),</p><p>where y t [j] is the true location and &#375;t [j] &#8712; Y is the predicted location of joint j at time t Another metric, Probability of Correct Keypoints (PCK) <ref type="bibr">[3,</ref><ref type="bibr">27]</ref>, is also used as an evaluation metric.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5.">User Study: Subjective Evaluation Metric</head><p>Joint language to pose generation is a subjective task, hence a human's subjective judgment on the quality of prediction is an important metric for this task.</p><p>To achieve this, we design a user study which asked human annotators to rank two videos generated by 2 different models but with same sentence as the input. One of the videos is generated by Lin et. al. and the other is either ground truth or generated by JL2P , JL2P w/o Curriculum, JL2P w/o Joint Emb., or JL2P L1. The annotators answer the following question for each pair of videos, Which of the 2 generated animations is better described by "&lt;sentence&gt;"?. To ensure that annotators spend enough time to decide, any annotations which took less than 20 seconds <ref type="foot">3</ref> were rejected. This study subjectively evaluates the preference of humans for generated animations by different models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Results and Discussion</head><p>In this section we first use objective measures and then conduct a user study to get a subjective evaluation. Finally, we probe some qualitative examples to understand the effectiveness of the model in tackling the core challenges described in Section 5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1.">Prediction Accuracy by Joint Space</head><p>JL2P demonstrates at least a 9% improvement over Lin et. al. (see Table <ref type="table">1</ref>) for all joints. The maximum improvement around 15% is seen in the Root joint. Errors in Root prediction can lead to a "sliding-effect" of the feet; when the generation is translating faster than the frequency of the feet. Improvements in APE scores for long-term prediction, especially for Root, can help get rid of these artifacts in the generated animation.</p><p>When compared to its variants, JL2P loses maximum APE value when it is trained without curriculum (or JL2P w/o Curriculum). As discussed in Section 4.3, learning to predict shorter sequences before moving on to longer ones proves beneficial for pose generation. APE scores go down by 4%, if L2 loss is used instead of Smooth L1. In an output space as diverse as pose sequences, it becomes important to ignore outliers which may drive model to overfit. APE scores go down only by 1%, if the embedding space is not trained with the joint loss L j APE values across time for JL2P of different parts of the body (Root, Legs, Arms, Torso and Head) are plotted in Figure <ref type="figure">6</ref>. Root's APE scores have the fastest rate of increase, followed by Arms, Legs and then Head, Torso. Two out of three coordinates of Root are represented as velocity which accumulates errors when integrated back to absolute positions; this is probably a contributing factor to the rapid increase of prediction error over time.</p><p>Our final objective metric is PCK. PCK values (for 35&#8804; &#963; &#8804;55) on generated animations are compared among JL2P , its variants and Lin et. al. in Figure <ref type="figure">7</ref>. JL2P and its ablations show a consistent improvement over Lin et. al. which further strengthen the claim about the prediction accuracy by our model's joint space.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.">Human Judgment</head><p>Human judgment is quantified by preference scores in Figure <ref type="figure">5</ref>. Human preference of all our baseline models and ground truth are compared against Lin et. al. animations. JL2P has a preference of 75% which is shy of ground truth by 10%. Preference scores consistently drop with all the other variants of JL2P . JL2P w/o Joint Emb. has the lowest preference score of 60% when ranked against Lin et. al. . It is still more preferred than Lin et. al. but far more unlikely to be picked when pitted against JL2P . This is an interesting change in trend, as removing joint loss from JL2P did not affect the objective scores significantly, but have lowered its hu- man preference by a significant fraction. This leads us to conclude that objective metrics are not enough to judge the performance of a model. Instead a combination of human judgment and objective metrics is necessary for evaluating pose generation models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3.">Modeling nuanced language concepts</head><p>Root joint decides the trajectory of the animation which is crucial for translating concepts like speed (e.g. fast, and slow), direction (e.g. left, right, forward and backward) from natural language to animation. We plot the trajectories generated by JL2P, ground truth and Lin et. al. for different sentences in Figure <ref type="figure">3</ref>.</p><p>Modeling direction: Animations' trajectory for these sentences for JL2P is similar to that of the ground truth trajectories. In contrast, Lin et. al.'s trajectories tend to be semantically incorrect and have a slightly curved forward motion for these sentences.</p><p>Modeling speed: In the sentence, "A person runs very fast forward", JL2P is able to understand that the animation has to move faster. It is able to walk approximately the same distance as the ground truth in the same amount of time, hence has the same speed. In contrast, even though Lin et. al.'s motion is in the forward direction, it is not able to maintain the same speed as required by the sentence.</p><p>Modeling actions: In figure <ref type="figure">4</ref>, we plot animations generated by a diverse set of sentences. JL2P is able to understand the action from the sentences, and is able to generate an animation corresponding to the action. JL2P is able to handle many actions ranging from kneeling (with complex leg motions) to jogging (with periodic hand and leg motion).</p><p>We show, via qualitative examples, that our model JL2P is able to model nuanced language concepts which are then reproduced in the animations generated at inference time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Conclusions</head><p>In this paper, we proposed a neural architecture called Joint Language-to-Pose (or JL2P), which integrates language and pose to learn a joint embedding space in an endto-end training paradigm. This embedding space can now be used to generate animations conditioned on an input description. We also proposed the use of curriculum learning approach which forces the model to generate shorter sequences before moving on to longer ones. We evaluated our proposed model on a parallel corpus of 3D pose data and human-annotated sentences with objective metrics to measure prediction accuracy, as well as with a user study to measure human judgment. Our results confirm that our approach, to learn a joint embedding in a curriculum learning paradigm by JL2P, was able to generate more accurate animations and are deemed more visually represented by humans than the state-of-the-art model.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>We also train a variant of the model with BERT as the language encoder, but it did not show any significant improvements.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>As we could not find code or pre-trained models for this work, we use our own implementation and training on the same data as all other baselines</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2"><p>each video is 8 seconds long at an average. We set a threshold of 20 seconds to give annotators</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3"><p>seconds to make their decision.</p></note>
		</body>
		</text>
</TEI>
