skip to main content

Title: COSIM: Commonsense Reasoning for Counterfactual Scene Imagination
As humans, we can modify our assumptions about a scene by imagining alternative objects or concepts in our minds. For example, we can easily anticipate the implications of the sun being overcast by rain clouds (e.g., the street will get wet) and accordingly prepare for that. In this paper, we introduce a new task/dataset called Commonsense Reasoning for Counterfactual Scene Imagination (COSIM) which is designed to evaluate the ability of AI systems to reason about scene change imagination. In this task/dataset, models are given an image and an initial question-response pair about the image. Next, a counterfactual imagined scene change (in textual form) is applied, and the model has to predict the new response to the initial question based on this scene change. We collect 3.5K high-quality and challenging data instances, with each instance consisting of an image, a commonsense question with a response, a description of a counterfactual change, a new response to the question, and three distractor responses. Our dataset contains various complex scene change types (such as object addition/removal/state change, event description, environment change, etc.) that require models to imagine many different scenarios and reason about the changed scenes. We present a baseline model based on a vision-language Transformer (i.e., LXMERT) and ablation studies. Through human evaluation, we demonstrate a large human-model performance gap, suggesting room for promising future work on this challenging counterfactual, scene imagination task.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
NAACL 2022: Proceedings of the 2022 Annual Conference of the North American Chapter of the Association for Computational Linguistics
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Interest in physical therapy and individual exercises such as yoga/dance has increased alongside the well-being trend, and people globally enjoy such exercises at home/office via video streaming platforms. However, such exercises are hard to follow without expert guidance. Even if experts can help, it is almost impossible to give personalized feedback to every trainee remotely. Thus, automated pose correction systems are required more than ever, and we introduce a new captioning dataset named FixMyPose to address this need. We collect natural language descriptions of correcting a “current” pose to look like a “target” pose. To support a multilingual setup, we collect descriptions in both English and Hindi. The collected descriptions have interesting linguistic properties such as egocentric relations to the environment objects, analogous references, etc., requiring an understanding of spatial relations and commonsense knowledge about postures. Further, to avoid ML biases, we maintain a balance across characters with diverse demographics, who perform a variety of movements in several interior environments (e.g., homes, offices). From our FixMyPose dataset, we introduce two tasks: the pose-correctional-captioning task and its reverse, the target-pose-retrieval task. During the correctional-captioning task, models must generate the descriptions of how to move from the current to the target pose image, whereas in the retrieval task, models should select the correct target pose given the initial pose and the correctional description. We present strong cross-attention baseline models (uni/multimodal, RL, multilingual) and also show that our baselines are competitive with other models when evaluated on other image-difference datasets. We also propose new task-specific metrics (object-match, body-part-match, direction-match) and conduct human evaluation for more reliable evaluation, and we demonstrate a large human-model performance gap suggesting room for promising future work. Finally, to verify the sim-to-real transfer of our FixMyPose dataset, we collect a set of real images and show promising performance on these images. Data and code are available: 
    more » « less
  2. null (Ed.)
    Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent{'}s actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating \textit{commonsense} captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset {``}Video-to-Commonsense (V2C){''} that contains {\textasciitilde}9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions. 
    more » « less
  3. Reasoning is an important ability that we learn from a very early age. Yet, reasoning is extremely hard for algorithms. Despite impressive recent progress that has been reported on tasks that necessitate reasoning, such as visual question answering and visual dialog, models often exploit biases in datasets. To develop models with better reasoning abilities, recently, the new visual commonsense reasoning (VCR) task has been introduced. Not only do models have to answer questions, but also do they have to provide a reason for the given answer. The proposed baseline achieved compelling results, leveraging a meticulously designed model composed of LSTM modules and attention nets. Here we show that a much simpler model obtained by ablating and pruning the existing intricate baseline can perform better with half the number of trainable parameters. By associating visual features with attribute information and better text to image grounding, we obtain further improvements for our simpler & effective baseline, TAB-VCR. We show that this approach results in a 5.3%, 4.4% and 6.5% absolute improvement over the previous state-of-the-art [103] on question answering, answer justification and holistic VCR. 
    more » « less
  4. Neuroimaging studies of human memory have consistently found that univariate responses in parietal cortex track episodic experience with stimuli (whether stimuli are 'old' or 'new'). More recently, pattern-based fMRI studies have shown that parietal cortex also carries information about the semantic content of remembered experiences. However, it is not well understood how memory-based and content-based signals are integrated within parietal cortex. Here, in humans (males and females), we used voxel-wise encoding models and a recognition memory task to predict the fMRI activity patterns evoked by complex natural scene images based on (1) the episodic history and (2) the semantic content of each image. Models were generated and compared across distinct subregions of parietal cortex and for occipitotemporal cortex. We show that parietal and occipitotemporal regions each encode memory and content information, but they differ in how they combine this information. Among parietal subregions, angular gyrus was characterized by robust and overlapping effects of memory and content. Moreover, subject-specific semantic tuning functions revealed that successful recognition shifted the amplitude of tuning functions in angular gyrus but did not change the selectivity of tuning. In other words, effects of memory and content were additive in angular gyrus. This pattern of data contrasted with occipitotemporal cortex where memory and content effects were interactive: memory effects were preferentially expressed by voxels tuned to the content of a remembered image. Collectively, these findings provide unique insight into how parietal cortex combines information about episodic memory and semantic content.

    SIGNIFICANCE STATEMENTNeuroimaging studies of human memory have identified multiple brain regions that not only carry information about “whether” a visual stimulus is successfully recognized but also “what” the content of that stimulus includes. However, a fundamental and open question concerns how the brain integrates these two types of information (memory and content). Here, using a powerful combination of fMRI analysis methods, we show that parietal cortex, particularly the angular gyrus, robustly combines memory- and content-related information, but these two forms of information are represented via additive, independent signals. In contrast, memory effects in high-level visual cortex critically depend on (and interact with) content representations. Together, these findings reveal multiple and distinct ways in which the brain combines memory- and content-related information.

    more » « less
  5. Understanding narratives requires reasoning about implicit world knowledge related to the causes, effects, and states of situations described in text. At the core of this challenge is how to access contextually relevant knowledge on demand and reason over it. In this paper, we present initial studies toward zero-shot commonsense question answering by formulating the task as inference over dynamically generated commonsense knowledge graphs. In contrast to previous studies for knowledge integration that rely on retrieval of existing knowledge from static knowledge graphs, our study requires commonsense knowledge integration where contextually relevant knowledge is often not present in existing knowledge bases. Therefore, we present a novel approach that generates contextually-relevant symbolic knowledge structures on demand using generative neural commonsense knowledge models. Empirical results on two datasets demonstrate the efficacy of our neuro-symbolic approach for dynamically constructing knowledge graphs for reasoning. Our approach achieves significant performance boosts over pretrained language models and vanilla knowledge models, all while providing interpretable reasoning paths for its predictions. 
    more » « less