Videos convey rich information. Dynamic spatio-temporal relationships between people/objects, and diverse multimodal events are present in a video clip. Hence, it is important to develop automated models that can accurately extract such information from videos. Answering questions on videos is one of the tasks which can evaluate such AI abilities. In this paper, we propose a video question answering model which effectively integrates multi-modal input sources and finds the temporally relevant information to answer questions. Specifically, we first employ dense image captions to help identify objects and their detailed salient regions and actions, and hence give the model useful extra information (in explicit textual format to allow easier matching) for answering questions. Moreover, our model is also comprised of dual-level attention (word/object and frame level), multi-head self/cross-integration for different sources (video and dense captions), and gates which pass more relevant information to the classifier. Finally, we also cast the frame selection problem as a multi-label classification task and introduce two loss functions, In-andOut Frame Score Margin (IOFSM) and Balanced Binary Cross-Entropy (BBCE), to better supervise the model with human importance annotations. We evaluate our model on the challenging TVQA dataset, where each of our model components provides significant gains, and our overall model outperforms the state-of-the-art by a large margin (74.09% versus 70.52%). We also present several word, object, and frame level visualization studies.
more »
« less
Model Transfer for Event Tracking as Transcript Understanding for Videos of Small Group Interaction
Videos of group interactions contain a wealth of information beyond the information directly communicated in a transcript of the discussion. Tracking who has participated throughout an extended interaction and what each of their trajectories has been in relation to one another is the foundation for joint activity understanding, though it comes with some unique challenges in videos of tightly coupled group work. Motivated by insights into the properties of such scenarios, including group composition and the properties of task-oriented, goal-directed tasks, we present a successful proof-of-concept. In particular, we present a transfer experiment to a dyadic robot construction task, an ablation study, and a qualitative analysis.
more »
« less
- PAR ID:
- 10356813
- Date Published:
- Journal Name:
- International Conference on Computational Linguistics
- Volume:
- 29
- Issue:
- 12
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Captioning is a crucial and challenging task for video understanding. In videos that involve active agents such as humans, the agent{'}s actions can bring about myriad changes in the scene. Observable changes such as movements, manipulations, and transformations of the objects in the scene, are reflected in conventional video captioning. Unlike images, actions in videos are also inherently linked to social aspects such as intentions (why the action is taking place), effects (what changes due to the action), and attributes that describe the agent. Thus for video understanding, such as when captioning videos or when answering questions about videos, one must have an understanding of these commonsense aspects. We present the first work on generating \textit{commonsense} captions directly from videos, to describe latent aspects such as intentions, effects, and attributes. We present a new dataset {``}Video-to-Commonsense (V2C){''} that contains {\textasciitilde}9k videos of human agents performing various actions, annotated with 3 types of commonsense descriptions. Additionally we explore the use of open-ended video-based commonsense question answering (V2C-QA) as a way to enrich our captions. Both the generation task and the QA task can be used to enrich video captions.more » « less
-
Relational reasoning is a complex form of human cognition involving the evaluation of relations between mental representations of information. Prior studies have modified stimulus properties of relational reasoning problems and examined differences in difficulty between different problem types. While subsets of these stimulus properties have been addressed in separate studies, there has not been a comprehensive study, to our knowledge, which investigates all of these properties in the same set of stimuli. This investigative gap has resulted in different findings across studies which vary in task design, making it challenging to determine what stimulus properties make relational reasoning—and the putative formation of mental models underlying reasoning—difficult. In this article, we present the Multidimensional Relational Reasoning Task (MRRT), a task which systematically varied an array of stimulus properties within a single set of relational reasoning problems. Using a mixed-effects framework, we demonstrate that reasoning problems containing a greater number of the premises as well as multidimensional relations led to greater task difficulty. The MRRT has been made publicly available for use in future research, along with normative data regarding the relative difficulty of each problem.more » « less
-
Demonstrating veracity of videos is a longstanding problem that has recently become more urgent and acute. It is extremely hard to accurately detect manipulated videos using content analysis, especially in the face of subtle, yet effective, manipulations, such as frame rate changes or skin tone adjustments. In this paper, we present Vronicle, a method for generating provenance information for videos captured by mobile devices and using that information to verify authenticity of videos. A key feature of Vronicle is the use of Trusted Execution Environments (TEEs) for video capture and post-processing. This aids in constructing fine-grained provenance information that allows the consumer to verify various aspects of the video, thereby defeating numerous fake-video creation methods. Another important feature is the use of fixed-function post-processing units that facilitate verification of provenance information. These units can be deployed in any TEE, either in the mobile device that captures the video or in powerful servers. We present a prototype of Vronicle, which uses ARM TrustZone and Intel SGX for on-device and server-side post-processing, respectively. Moreover, we introduce two methods (and prototype the latter) for secure video capture on mobile devices: one using ARM TrustZone, and another using Google SafetyNet, providing a trade-off between security and immediate deployment. Our evaluation demonstrates that: (1) Vronicle's performance is well-suited for non-real-time use-cases, and (2) offloading post-processing significantly improves Vronicle's performance, matching that of uploading videos to YouTube.more » « less
-
We address the problem of human action classification in drone videos. Due to the high cost of capturing and labeling large-scale drone videos with diverse actions, we present unsupervised and semi-supervised domain adaptation approaches that leverage both the existing fully annotated action recognition datasets and unannotated (or only a few annotated) videos from drones. To study the emerging problem of drone-based action recognition, we create a new dataset, NEC-DRONE, containing 5,250 videos to evaluate the task. We tackle both problem settings with 1) same and 2) different action label sets for the source (e.g., Kinectics dataset) and target domains (drone videos). We present a combination of video and instance-based adaptation methods, paired with either a classifier or an embedding-based framework to transfer the knowledge from source to target. Our results show that the proposed adaptation approach substantially improves the performance on these challenging and practical tasks. We further demonstrate the applicability of our method for learning cross-view action recognition on the Charades-Ego dataset. We provide qualitative analysis to understand the behaviors of our approaches.more » « less
An official website of the United States government

