<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Video2Commonsense: Generating Commonsense Descriptions to Enrich Video Captioning</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>01/01/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10276932</idno>
					<idno type="doi">10.18653/v1/2020.emnlp-main.61</idno>
					<title level='j'>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Zhiyuan Fang</author><author>Tejas Gokhale</author><author>Pratyay Banerjee</author><author>Chitta Baral</author><author>Yezhou Yang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent{'}s actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating \textit{commonsense} captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset {``}Video-to-Commonsense (V2C){''} that contains {\textasciitilde}9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>When humans watch videos they can typically understand and reason about various aspects of the scene beyond the visible objects and actions. This involves understanding that some objects are active agents that not only perform actions and manipulate objects, but are motivated by intentions, have pre-conditions, and that their actions have an effect on the world and their own mental states. For instance, in analyzing the video clip in Figure <ref type="figure">1</ref>, * Equal Contribution humans employ various capabilities such as perception, reasoning, inference, and speculation, to come up with a description for the observable sequence of events, but also reason about latent aspects such as the intention of the group of runners "to win the medal", the effect of being "congratulated at the finish line", and the attribute "athletic".</p><p>The above example also illustrates that recognition of objects, actions, and events is often not enough; understanding causal relationships, social interactions, and commonsense aspects behind them provides context and a more semantic interpretation of the video <ref type="bibr">(Gupta et al., 2009)</ref>. A model that can provide such detailed interpretations facilitates answering inferential questions, such as "Will the player get angry later?". However, existing visual understanding systems are unable to perform such tasks that require speculative reasoning. A critical missing element in complex video understanding is the capability of performing commonsense inference, especially a generative model. Existing efforts seek to find textual explanations or intentions of human activities as a classification task <ref type="bibr">(Vondrick et al., 2016)</ref> or a vision-to-text alignment problem <ref type="bibr">(Zhu et al., 2015)</ref>.</p><p>In this paper we propose the Video to Commonsense (V2C) framework to generate visually grounded commonsense descriptions about the underlying event in the video, enriching the factual description provided by a caption. Under this framework a system is expected to generate captions as well as three types of commonsense <ref type="bibr">descriptions (intention, effect, attribute)</ref> directly from an input video. The V2C model can also be used as a building block for downstream tasks such as video question answering for questions requiring commonsense. Inspired by <ref type="bibr">(Bosselut et al., 2019)</ref>, our model -the "V2C-Transformer" utilizes: (1) a video encoder to extract global representations of the video, (2) a transformer decoder that generates captions and commonsense descriptions, and (3) a cross-modal self-attention module that exploits joint visual-textual embeddings.</p><p>We curate the V2C dataset for training and benchmarking models on this task. We adopt the MSR-VTT video description dataset <ref type="bibr">(Xu et al., 2016)</ref> as a source of videos and captions. We first utilize the ATOMIC machine commonsense dataset <ref type="bibr">(Sap et al., 2018)</ref> to get a list of candidate commonsense texts (intentions, effects, and attributes), and rank these using a <ref type="bibr">BERT-based (Devlin et al., 2019)</ref> model. Since these candidates are retrieved without using the video and may not be accurate, we instruct humans to watch the videos and select, remove, or rewrite the texts retrieved from ATOMIC. The text retrieved by ATOMIC helps our human annotators to understand the format of desired annotations, and also gives them a list of suggestions. The human component in our annotation procedure makes our data visually grounded and relevant, linguistically diverse, and natural.</p><p>We additionally explore the use of our V2C-Transformer architecture for a open-ended video question answering task, where the questions are about commonsense aspects from the video. For this, we create a QA addendum of the V2C dataset called V2C-QA. By asking questions about the latent aspects in the video, our models are able to enrich caption generation with three specific types of commonsense knowledge.</p><p>Our contributions are summarized below: 1. We formulate the "V2C" task for enriching video captioning by generating descriptions of commonsense aspects. 2. We curate a video dataset annotated with captions and commonsense descriptions.</p><p>3. We present our V2C-Transformer architecture that generates relevant commonsense descriptions, and serves as a strong baseline. 4. We pose V2C as a video question answering task and show that it can assist commonsense caption generation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Video to Commonsense (V2C)</head><p>Problem Formulation: Consider a video V consisting of N v frames described by sentence S . Our Video-to-Commonsense (V2C) framework can be used for generating commonsense descriptions C under two settings. In the first setting (V2C-Completion), we use ground-truth captions to guide commonsense-enriched caption generation. This task can be viewed as providing supplementary explanations to the caption. In the second setting (V2C-Generation), we first learn to generate captions from videos, g(V ), and then use them to generate commonsense descriptions.</p><p>(1)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">V2C-Transformer</head><p>The proposed Video2Commonsense Transformer is a cross-modal model that generates captions and commonsense-enriched descriptions from videos. Our approach (Figure <ref type="figure">2</ref>) adopts the "encoderdecoder" design: a video encoder that extracts global representations of the input video, and a transformer decoder that produces relevant commonsense knowledge along with captions.</p><p>Video Encoder: We obtain per-frame ResNet-152 <ref type="bibr">(He et al., 2016)</ref> features for video V and process them using an LSTM model (Sundermeyer  We concatenate all previous hidden states from each LSTM module as a final global video encoding v, to provide the model with explicit context using the temporal attention mechanism.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Decoder:</head><p>The video encoding is used as input to two decoder networks that use a transformer language model <ref type="bibr">(Radford et al., 2018)</ref> to generate a caption and commonsense description, using an inference mechanism similar to <ref type="bibr">Bosselut et al. (2019)</ref>.</p><p>Our model is a two-stage process that first predicts the current events directly from videos, and then produces the corresponding commonsense captions.</p><p>During training, the caption decoder D CAP takes the video encoding (v) and ground truth caption (s) as input to generate caption encoding (&#349;), while the commonsense decoder D CMS uses the concatenation of video and caption encoding to obtain the commonsense description (c), as shown in Figure <ref type="figure">1</ref> (b). This arrangement enables the attention module in commonsense decoder to attend to both the video and caption context.</p><p>Transformer Decoder is composed of a stack of transformer blocks (dashed area in (c) Figure <ref type="figure">2</ref>), whose main component is a self-attention architecture. It takes as input the summation of word embedding and the positional encoding offset by 1 position through masked multi-head attention, which prevents the future words been seen. In our model, we deploy two stacked decoder architectures for both caption decoding and commonsense knowledge decoding. The Transformer Block consists of consecutive linear transformation: a multi-head attention module (denoted as H M-ATT ), a two-layer feed forward network (H FFN ), a layer normalization operation, and a residual connection.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Multi-head Attention module</head><p>To enable our transformer decoder to generate commonsense descriptions by using both the visual and textual content, we modify the multi-head attention module (which acts as the basic unit in recent transformer based language generation models <ref type="bibr">(Radford et al., 2018</ref><ref type="bibr">(Radford et al., , 2019</ref>)) as a cross-modal module. H M-ATT takes the input of the embedding of key (K), value (V) and query (Q). The key and value in transformer block are the video encoding (caption decoder) or concatenation of video/caption encoding (commonsense decoder), while the query is the output from the previous transformer block. In the masked multi-head attention module, K, V and Q are the identical vectors of input embedding. For a self-attention block with h heads,</p><p>Caption:</p><p>A soldier fights with his enemy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Events:</head><p>1.  where x i is computed by scaled dot-product attention operation, for head-index i, key-dimension d k n, and transformation parameters W i .</p><p>for DCAP, xi = SOFTMAX(</p><p>3 The V2C Dataset</p><p>For the V2C task we need video clips annotated with commonsense descriptions about the agents in the video, as shown in Figure <ref type="figure">1</ref>. While there are video captioning datasets such as MSR-VTT <ref type="bibr">(Xu et al., 2016)</ref>, the captions in these datasets describe only the observable objects in the image, but do not describe latent and commonsense aspects. We are the first to curate such a dataset with annotations describing the intention of agent to perform an action, the effect of the action and the attribute of the agent given the action. video, thus making it inappropriate to just evaluate caption generation using BLEU scores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MSR-VTT</head><p>ATOMIC <ref type="bibr">(Sap et al., 2018)</ref> is an atlas of everyday commonsense knowledge and contains 880k triplets about causes and effects of human activities, organized as if-then relations, annotated by crowd-sourced workers. This data can be categorized based on causal relations, thereby giving us the categories "cause", "effect" and "attribute", e.g., "if X wants to relax, then he will play video game."</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Querying from ATOMIC and Re-ranking</head><p>Since inferential knowledge in ATOMIC only covers human activities, we first retain only those captions in Msr-vtt that describe human activities. We then select three queries from ATOMIC most similar to the caption, and extract the commonsense descriptions corresponding to these queries. In order to select a more reasonable subset of commonsense descriptions, we first train a ranking model. We use the BERT <ref type="bibr">(Devlin et al., 2019)</ref> architecture for the ranking model, trained on the ATOMIC dataset for a binary classification task, to predict the relevance of a candidate commonsense description with respect to the event. We select the top three relevant intentions, effects, and attributes for each caption. This allows us to obtain a preliminary set of 9 commonsense annotations per video directly from the ATOMIC dataset, relevant to the caption, albeit with noise and annotations that are not relevant to the video.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Detailed Human Annotation</head><p>Since we do not use the video to retrieve commonsense descriptions from ATOMIC, we employ human workers to annotate our dataset. We recruit Table <ref type="table">2</ref>: Evaluation of V2C completion task using CIDER, BLEU, Perplexity, Rouge, and Meteor metrics. We use only BLEU-1 to evaluate the attribute generation since the average length of the ground truth is just less than 2.</p><p>two sets of human workers to watch the video, read the caption and select/annotate the relevant commonsense descriptions for each video. The first set is Amazon Mechanical Turkers (AMT) who select relevant descriptions. The second set is skilled human annotators, screened from a set of university students proficient in English, who are asked to provide annotations in their own words, and remove or edit irrelevant annotations that were provided by ATOMIC and AMT workers. This makes our annotations not only grounded in the video, but also more descriptive, linguistically diverse, and of higher quality (see Figure <ref type="figure">3</ref>). The descriptions from ATOMIC, although not relevant to the video in some cases, give our workers an idea about the format of annotations desired. The skilled humans reported that 95% of the captions were relevant, and 65% of the ATOMIC descriptions were useful in understanding the annotation task. Through this procedure, we obtain 6819 videos for training and 2906 videos for testing, a total of 121,651 captions (&#8764;12 captions/video), each caption accompanied with 5 commonsense knowledge annotations (V2C-Raw set). In experiment, we use video captioning technique to conduct the V2C completion task on V2C-Raw set. In addition, we instruct human annotators to select and rewrite one raw phrase into complete sentences that complement the captions. In total we have 3 complete sentences per video for intention/effect/attribute respectively, and this yields a subset that allows our model to generate complete story-like sentences (V2C-Clean Set). Table <ref type="table">1</ref> shows examples from the newly compiled dataset. We conduct rigorous human evaluation to evaluate the quality of our V2C dataset ("Gold Annotations" in Table <ref type="table">3</ref>). Details about the dataset creation process and quality control mechanisms can be found in the Appendix.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>In this section we describe the loss function used for training our model, additional details about video pre-processing, hyper-parameters, and baseline models, and the metrics used for evaluation.</p><p>Loss Function: The decoder parameters &#920; are trained to maximize the log-likelihood over the training set given by L = L cap + L cms , where</p><p>log Pr(y t |y t-1 , v; &#920;), and</p><p>y t denotes the one-hot vector probability of each word at time t, and N S , N C denote the length of the caption and commonsense respectively.</p><p>Setting: In order to obtain video representations, we uniformly sample 40 frames from each video and extract features using feed ResNet <ref type="bibr">(He et al., 2016)</ref> pre-trained on Imagenet ILSVRC12 dataset <ref type="bibr">(Deng et al., 2009)</ref> and get a 2048-d output from the last layer. We use one-hot input (1-of-N encoding) of the text input and pass it through an embedding layer to produce a 1028-d hidden vector. We use independent vocabularies for captioning and commonsense generation with sizes 27,603 and 24,010 respectively. Note that, as the generated Hyperparameters: Our decoder is a lightweight transformer decoder consisting of 6 transformer blocks with 8 attention heads each. We use Adam optimizer with 5000 warm-up steps, and learning rate initialized at 1e-4, and a dropout probability of 0.1 after the residual layer. Our model is trained on a machine with single NVIDIA 1080-Ti GPU.  Metrics: We report both the performances evaluated by automatic scores and human evaluations following the protocols from <ref type="bibr">(Bosselut et al., 2019;</ref><ref type="bibr">Sap et al., 2018)</ref>. We evaluate our method using BLEU (n=1-4) <ref type="bibr">(Papineni et al., 2002)</ref>, Meteor <ref type="bibr">(Banerjee and Lavie, 2005)</ref>, Rouge <ref type="bibr">(Lin, 2004)</ref>, and perplexity score of the generation on its corpus. We further conduct human evaluations using AMT workers, who are asked to identity whether the generated commonsense justifiably completes the events (V2C-completion). We follow the setup in <ref type="bibr">(Sap et al., 2018)</ref> and randomly sample 100 videos from test set and collect 10 generations for each. To guarantee the objectiveness of the human evaluations, we hire 5 workers for each sample, yielding 30k ratings in total for each model.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Results</head><p>Natural Language Generation Metrics: We show evaluation of the commonsense comple-tion task in Table <ref type="table">2</ref>. Compared to the baseline model, our method exhibits a consistent and overall improvement on almost all metrics. Our V2C-Transformer significantly outperforms the LSTM based model in <ref type="bibr">(Gao et al., 2017)</ref> by 7.7% at BLEU-4 for the intention prediction. Because the V2C-Transformer and the LSTM model share a similar video encoder, our performance improvement could be attributed to the use of self-attention mechanisms in the transformer block in decoding phase. This observation is consistent with the conclusion from <ref type="bibr">(Bosselut et al., 2019)</ref>, and yields further support to the transformer architecture being suited for commonsense inference tasks. Moreover, when compared with DenseCap which has a similar transformer architecture and parameters, our model exhibits better evaluation scores, verifying it as a strong baseline model for the V2C task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Human Evaluation In Table 3, E2C (Event to</head><p>Commonsense) is the task of commonsense completion given only textual events <ref type="bibr">(Sap et al., 2018;</ref><ref type="bibr">Bosselut et al., 2019)</ref> as opposed to V2C which uses both text and video. 9ENC9DEC <ref type="bibr">(Sap et al., 2018)</ref> is composed of nine GRU based encoderdecoders as a baseline model for commonsense completion on text, and COMET <ref type="bibr">(Bosselut et al., 2019</ref>) is a large-scale generative pre-trained transformer (GPT) model <ref type="bibr">(Radford et al., 2018)</ref>. We would like to highlight that our transformer model is light-weight with only half of the parameters in GPT without any pre-training. We evaluate our model on the tasks of caption generation with human evaluations, and also compare it with the gold annotations. Our gold annotation for ground-truth captions (sourced from the MSR-VTT dataset) points to the fact that a small percentage of captions from MSR-VTT are not relevant to the video, and this is amended by our human workers.</p><p>For the V2C-Completion task, our V2C-Because she wants to serve healthy meals, , and she will have food ready to eat soon. The person is seen as skilled with their hands.</p><p>Because she wants to express themselves, the woman is singing a song and playing piano, she will enjoy playing piano. The woman is artistic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Intention Effect Attribute</head><p>To know how to play soccer, a man is playing a soccer game, and he will cautiously dribble the ball. The man is enthused.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Completion:</head><p>GT Caption: A woman making fish shaped food with bean paste.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Intention Effect Attribute</head><p>To catch a fish, a baby is talking about a fish in the ocean, and he will know more about the ocean. The person is seen as knowledgeable.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Generation:</head><p>Failure Example Generation:</p><p>Generation: Transformer model is substantially better (by 7.73%) than the LSTM-based model from <ref type="bibr">(Gao et al., 2017)</ref>, and shows consistent lead on each dimension. Thus, when the ground-truth caption is given, our model is able to generate much more relevant commonsense descriptions, thereby consolidating it's ability of commonsense generation.</p><p>For the task of V2C-Generation, the difference between human scores for LSTM vs V2C-Transformer is reduced, but our VTC outperforms on average by 2.98%. This may be attributed to the fact that the LSTM-based model is slightly better at generating captions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Generating Textual Stories with Commonsense</head><p>In order to generate story-like textual descriptions that complement the factual captions, we additionally train our model to exploit our diverse completesentence annotations. Specifically, instead of producing the commonsense knowledge given the videos and captions, we finetune our pre-trained V2C-Transformer model on predicting the human rewritten texts, and generate complete story-like captions. Since we do not have enough annotations per sample to compute a fair BLEU score for comparisons, we showcase some sample generated descriptions for qualitative analysis (see Figure <ref type="figure">4</ref>). With that, we observe V2C-Transformer is able to produce complete stories that contain simple, while logically consistent storylines that complement both the visual content and the factual descriptions. We believe that collecting a set of story-like sentences will further enrich our models, and allow us to generate much more contextual, creative, and natural commonsense descriptions from a video.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">V2C-QA</head><p>Another way of generating commonsense descriptions about the video is by asking pointed questions. Consider the example in 1 where we ask the question "What happens next to the runners", about the effect of the action "prepare" performed by the agents "group of runners" observed in the video. We propose a V2C-QA -an open-ended commonsense video question-answering task, where we ask questions about the intents, effects and attributes of the agents in the video. Dataset: We use the caption and commonsense annotations in the V2C dataset to create questionanswer pairs for each video. We first extract the action and subject from the caption using SpaCy linguistic features <ref type="bibr">(Honnibal and Johnson, 2015)</ref>. For each intention, attribute and effect for a video, we use template-based generation to get 7 types of questions -yielding 21 questions per sample, including negative questions as in <ref type="bibr">Gokhale et al. (2020)</ref>. In total, we have 1,250 training videos and 250 test videos, and a total of 37k questions. We have a set of 5,555 unique answers for our questions. Each question can have multiple possible true answers as shown in the example in Figure <ref type="figure">5</ref>. The V2C-QA task asks questions that require commonsense reasoning about internal mental states, motivations, and latent aspects of agents in the video as opposed to the conventional video-QA questions about visible objects and actions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Models:</head><p>We utilize our V2C-Encoder followed by an open-ended answering module. We jointly predict the type of the question and combine it with the V2C encoding using a feed-forward network. For textual features, we use embeddings from BERT-base <ref type="bibr">(Devlin et al., 2019)</ref>. Our models are trained on the open-ended QA task and setup as a multi-label classification task similar to VQA <ref type="bibr">(Antol et al., 2015)</ref>, with an answering module design inspired by LXMERT <ref type="bibr">(Tan and Bansal, 2019)</ref>. Our loss function includes the classification loss for answering, the attention loss for questiontype, and a label-ranking loss.</p><p>Results: MSR-VTT QA <ref type="bibr">(Xu et al., 2017)</ref> is as a good baseline since it is trained on a conventional videoQA task on the MSR-VTT videos, and only takes video and query as input, unlike recent video understanding models <ref type="bibr">(Lei et al., 2018)</ref> that take additional supervision, such as subtitles. However this model is trained for a multiple-choice QA scheme, so we modify it with our open-ended answering module. We compare our models when we use our encoder pretrained on the V2C caption generation task, and then finetune it on the V2C-QA task. We also train models with ground-truth factual captions as input. Our results are shown in Table <ref type="table">4</ref>, where we evaluate on prediction of topk (1,3,5) answers, and report precision and recall.</p><p>Our encoder pre-trained on the V2C task outperforms all other models. Attribute-related questions are easier to answer, while the models struggle the most for questions about intention. Captions help in questions about effects. The overall text-only baseline shows an insignificant bias between the question and answer-options.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Related Work</head><p>Video Captioning: Captioning is crucial for understanding visuals; however it is typically limited to describing observable objects and events <ref type="bibr">(Yang et al., 2011;</ref><ref type="bibr">Thomason et al., 2014;</ref><ref type="bibr">Gan et al., 2017)</ref>), or for generating paragraphs or multisentence captions about the image or video <ref type="bibr">(Krause et al., 2017;</ref><ref type="bibr">Krishna et al., 2017)</ref>. However, for detailed video understanding, one needs to obtain descriptions that go beyond observable visual entities and use background knowledge and commonsense to reason about objects and actions. Work for inferring motivations of human actions in static images by incorporating commonsense knowledge are reflected in <ref type="bibr">Pirsiavash et al. (2014)</ref>; <ref type="bibr">Vondrick et al. (2016)</ref>. Commonsense caption generation has been approached on abstract scenes and clipart images in <ref type="bibr">Vedantam et al. (2015)</ref>. We present the first generative model for commonsense video captioning.</p><p>Video Question Answering: Since caption generation can only describe observable events, recent work seeks to move closer to comprehension, by learning to answer complex questions about videos. However, the datasets used for Video QA <ref type="bibr">(Yang et al., 2003;</ref><ref type="bibr">Xu et al., 2016;</ref><ref type="bibr">Zhu et al., 2017)</ref> focus only on directly evident visual concepts and construct the questions mostly about "where" and "what" aspects. Question answering on movie videos has been explored by <ref type="bibr">Tapaswi et al. (2016)</ref> who collect questions about "why" and "how" aspects. Recently <ref type="bibr">Lei et al. (2018)</ref>  <ref type="bibr">(Yu et al., 2015)</ref> as a "fill-inthe-blanks" task for single-image captioning that contains some categories which require reasoning about internal mental states and future events. <ref type="bibr">Kim et al. (2018)</ref> provide textual explanations for actions in a self-driving scene. <ref type="bibr">Zellers et al. (2019)</ref> propose a visual question answering task that requires commonsense reasoning to answer a question and to provide a rationale behind the answer. Spatial and compositional reasoning is required to answer questions about synthetic images in CLEVR <ref type="bibr">(Johnson et al., 2017)</ref>. Critical aspects of visual reasoning also include the model's ability to conduct object grounding by natural language descriptions <ref type="bibr">(Rohrbach et al., 2016;</ref><ref type="bibr">Fang et al., 2018</ref><ref type="bibr">Fang et al., , 2019))</ref>. Another aspect of visual reasoning is the ability predict a sequence of actions (procedure planning), or to reason about intermediate video frames (walkthrough planning) between two frames, explored in <ref type="bibr">Gokhale et al. (2019)</ref>; <ref type="bibr">Chang et al. (2020)</ref>.</p><p>Textual Commonsense: Commonsense-based question answering is an area of active research with several datasets and challenges requiring reasoning about conceptual commonsense <ref type="bibr">(Talmor et al., 2019)</ref>, physical commonsense <ref type="bibr">(Bisk et al., 2020)</ref>, social commonsense <ref type="bibr">(Sap et al., 2019)</ref>, and abductive commonsense <ref type="bibr">(Bhagavatula et al., 2020)</ref>. On the other hand, challenges such as ProPara <ref type="bibr">(Mishra et al., 2018)</ref> and bAbI <ref type="bibr">(Weston et al., 2015)</ref> require tracking elements, actions, and effects of actions. Commonsense-based text generation has recently been explored via the ATOMIC dataset <ref type="bibr">(Sap et al., 2018)</ref>, a corpus of 877k textual descriptions of inferential knowledge organized as if-then relations. <ref type="bibr">Bosselut et al. (2019)</ref> adopt the ATOMIC dataset to learn a generative model of commonsense knowledge. To the best of our knowledge, ours is the first work on generating commonsense descriptions from visual inputs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Outlook</head><p>A video typically contains one or many objects (sometimes performing actions) in different back-grounds, scenes, or situations. Some objects may be "passive" such as trees or buildings, while some objects may be "active" such as people performing actions like walking, singing, and driving. This paper is focused on describing such active agents in terms of their intentions, effects of their actions, and attributes that characterize these agents. We distinguish V2C from the traditional video captioning task. Video captions describe observable objects, background, and actions, while commonsense descriptions in our task seek to describe the unobservable intentions of the agent (pre-conditions or mental conditions), effects of the action (that happen in the future), and attributes which characterize the agent. Thus commonsense generation goes beyond the visible. Ours is the first attempt at developing a generative video-based commonsense model. We anticipate that our framework can be utilized for many applications in video understanding, comprehension, human-robot interaction, and learning commonsense in a multimodal setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Conclusion</head><p>In this paper, we explore a novel and challenging task to generate video descriptions with rich commonsense descriptions that complement the factual captions. We expand an existing video captioning dataset for the V2C task through automated retrieval from a textual commonsense corpus followed by human labeling, and present a novel V2C-Transformer model to serve as a strong baseline method for the V2C task. Our evaluation verifies the effectiveness of our method, while also indicating a scope for further study, enhancement, and extensions in the future. Our experiments on using the V2C-Transformer as a component for the V2C-QA task show that the model has transfer learning capabilities that can be applied to other vision-andlanguage tasks such as question-answering, that require commonsense reasoning.</p><p>Our dataset creation methodology is a three-step procedure as shown in Figure <ref type="figure">9</ref>. In the first step, we use the caption to query ATOMIC <ref type="bibr">(Sap et al., 2018)</ref> and retrieve the top-3 intentions, effects, and attributes. These are re-ranked by a BERT based model in the second step. The final step involves humans in the annotation process. We ask human annotators to select the most relevant descriptions, and to provide additional descriptions in their own words. The annotators also convert a subset of our dataset into complete sentence descriptions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.1 Querying from ATOMIC</head><p>For every video-caption pair in the MSR-VTT dataset, we select 3 most similar events from ATOMIC. These are then used to retrieve textual descriptions of three types -intentions, effects, attributes from ATOMIC.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.2 BERT Ranking Model</head><p>We implement a Bidirectional Encoder Representations from Transformers (BERT) model <ref type="bibr">(Devlin et al., 2019)</ref> as a ranking model to rank and retrieve top-3 most plausible commonsense aspects to complement the ground truth caption. This is done by treating the ranking task as a binarized next sentence prediction (NSP) task, trained on the ATOMIC <ref type="bibr">(Sap et al., 2018)</ref> dataset. When choosing the sentences A and B for each training pair, for 50% of the training pairs we choose the actual next sentence that follows A, and a random sentence from the ATOMIC as a negative sentence. This setting is consistent with the NSP task in <ref type="bibr">(Devlin et al., 2019)</ref>. We train our model in ATOMIC, and use it to expand video captions from MSR-VTT <ref type="bibr">(Xu et al., 2016)</ref>. Our BERT model consists of 12 transformer blocks, 12 attention heads, and 768 hidden dimensions (110M parameters in total). In total,  <ref type="table">5</ref>. In addition, we also conduct human evaluations to measure the overall quality of the expanded V2C dataset (see "gold annotations" in Table . 3, main paper).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.3 Human Labeling</head><p>With querying from ATOMIC and BERT reranking, we obtain commonsense descriptions that are relevant to the caption. However, we want to make sure that these descriptions are also relevant to the video. Thus we utilize human workers from Amazon Mechanical Turk (AMT) for selecting the most relevant commonsense descriptions. Our annotation interface is shown in Figure <ref type="figure">10</ref>. We ask the annotators to select descriptions that are most relevant to the video and to the caption, and also encourage them to add their own commonsense descriptions. This makes our dataset more natural and human-like. This also allows us to remove noisy annotations that may be produced due to text-only ATOMIC querying. We show additional samples from our V2C dataset in Figure . 11, word cloud in Figure . 7 and word frequency in 8.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A.4 Benefits of the Three-Step Pipeline</head><p>Since our videos are annotated with captions, we use the captions to retrieve commonsense descriptions from ATOMIC. The ATOMIC dataset has comprehensive annotations for human activities, actions, and events and as such covers most of the events in MSR-VTT. Thus using these two datasets together is a natural step for creating our V2C dataset. This purely caption-based retrieval unfortunately does not incorporate the latent aspects of the video, but only those from the caption. Moreover, since the video is not used for retrieving these, the commonsense annotations may be out-of-context. Thus, we bring in human annotators to watch the video, read the caption, and then use the set of descriptions from ATOMIC to select the relevant once and to discard the irrelevant or out of context descriptions. The human annotators then provide annotations about intention, effect, and attribute in their own words. The ATOMIC retrieved descriptions help the human annotators to get an idea about the task and also get a glimpse of the format of the desired annotations. This significantly reduces the noise in human annotations.</p><p>To guarantee and measure the overall quality of our V2C dataset, we have conducted human evaluations on the V2C annotations. Our results shows that 86.29% of the video-caption-commonsense triples are labeled as reasonable samples (see "Gold Annotations" in main paper, Table . 3), verifying the quality of our dataset</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B V2C-QA Dataset</head><p>For the V2C Question Answering task, we repurpose our V2C dataset and convert it to a questionanswering dataset. We choose a subset of 1500 videos: 1250 for training and 250 for testing, following the same train-test split as MSR-VTT. We use SpaCy linguistic features <ref type="bibr">(Honnibal and Montani, 2017)</ref>   nations of the slots in the template. Thus we get 21 types of questions (7 each for intention, effect, and attribute) as shown in Table <ref type="table">6</ref>. Since our task is open-ended question-answering, our questions are annotated with all possible correct answers for that question. To get answers for the "negative" questions as shown in Table <ref type="table">6</ref>, we use the adversarial matching strategy similar to <ref type="bibr">(Zellers et al., 2019)</ref>, by using RoBERTa <ref type="bibr">(Liu et al., 2019)</ref> similarity. We will release our V2C-QA question and  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Qualitative Generation Results</head><p>We show additional V2C-Completion samples by our V2C-Transformer model in Table . 7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Human Evaluation</head><p>Human evaluation is one of the important part to verify the performances of our model and the quality of the V2C dataset. In this section we describe our setup for human evaluation of the captions and commonsense descriptions in our dataset as well as those generated by our models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.1 Amazon Mechanical Turk Interface</head><p>We conduct our human evaluations by crowdsourcing ratings from workers on Amazon Mechanical Turk (AMT). We do these human evaluations on the same test set used for our automated metrics. We show an example of our interface in Figure <ref type="figure">12</ref> and 13 which shows the screenshot of the rating task as seen by the workers. The workers are given explicit instructions about this rating task, and depending on the task are asked to rate the commonsense descriptions and the caption.</p><p>For the V2C-Completion task, the workers are provided with the video and the ground-truth caption and asked to rate the only the generated commonsense (intention, effect or attribute) on a scale of 1 to 5. The workers are asked to provide this rating on the basis of whether the generated text is relevant to the video, i.e whether the caption/commonsense can plausibly complete the given event.</p><p>For the V2C-Generation task, the workers are asked to rate the caption as well as the commonsense texts with respect to the video. The workers are also asked to conduct identical tasks for the gold (ground-truth annotations) in our new V2C dataset. ratings of 4 and 5 are close to each other but 1 and 5 are opposite. So to avoid this, we replace the indicator function with a smooth exponential term. The smooth inter-rater agreement score is given by:</p><p>1 2</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D.3.3 Results</head><p>Table <ref type="table">8</ref> shows our analysis in terms of the three metrics described above. Our V2C-Transformer architecture consistently outperforms the baseline model ATTENCDEC <ref type="bibr">(Gao et al., 2017)</ref> in all three metrics for each type of commonsense. This means that raters are more consistent with their ratings (in terms of deviation or agreement) for commonsense descriptions generated by our model.     </p></div></body>
		</text>
</TEI>
