skip to main content


Title: Learning 2D Temporal Adjacent Networks for Moment Localization with Natural Language
We address the problem of retrieving a specific moment from an untrimmed video by a query sentence. This is a challenging problem because a target moment may take place in relations to other temporal moments in the untrimmed video. Existing methods cannot tackle this challenge well since they consider temporal moments individually and neglect the temporal dependencies. In this paper, we model the temporal relations between video moments by a two-dimensional map, where one dimension indicates the starting time of a moment and the other indicates the end time. This 2D temporal map can cover diverse video moments with different lengths, while representing their adjacent relations. Based on the 2D map, we propose a Temporal Adjacent Network (2D-TAN), a single-shot framework for moment localization. It is capable of encoding the adjacent temporal relation, while learning discriminative features for matching video moments with referring expressions. We evaluate the proposed 2D-TAN on three challenging benchmarks, i.e., Charades-STA, ActivityNet Captions, and TACoS, where our 2D-TAN outperforms the state-of-the-art.  more » « less
Award ID(s):
1813709 1722847
NSF-PAR ID:
10168539
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
ISSN:
2159-5399
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The state-of-the-art of fully-supervised methods for temporal action localization from untrimmed videos has achieved impressive results. Yet, it remains unsatisfactory for the weakly-supervised temporal action localization, where only video-level action labels are given without the timestamp annotation on when the actions occur. The main reason comes from that, the weakly-supervised networks only focus on the highly discriminative frames, but there are some ambiguous frames in both background and action classes. The ambiguous frames in background class are very similar to the real actions, which may be treated as target actions and result in false positives. On the other hand, the ambiguous frames in action class which possibly contain action instances, are prone to be false negatives by the weakly-supervised networks and result in a coarse localization. To solve these problems, we introduce a novel weakly-supervised Action Completeness Modeling with Back- ground Aware Networks (ACM-BANets). Our Background Aware Network (BANet) contains a weight-sharing two-branch architecture, with an action guided Background aware Temporal Attention Module (B-TAM) and an asymmetrical training strategy, to suppress both highly discriminative and ambiguous background frames to remove the false positives. Our action completeness modeling contains multiple BANets, and the BANets are forced to discover different but complementary action instances to completely localize the action instances in both highly discriminative and ambiguous action frames. In the 𝑖-th iteration, the 𝑖-th BANet discovers the discriminative features, which are then erased from the feature map. The partially-erased feature map is fed into the (i+1)-th BANet of the next iteration to force this BANet to discover discriminative features different from the 𝑖-th BANet. Evaluated on two challenging untrimmed video datasets, THUMOS14 and ActivityNet1.3, our approach outperforms all the current weakly-supervised methods for temporal action localization. 
    more » « less
  2. Video anomaly detection (VAD) – commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature – is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal de- pendencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus and XD-Violence). Our source code is available at https:// github.com/joos2010kj/CLIP-TSA. 
    more » « less
  3. Abstract

    Geysers are rare geologic features that intermittently discharge liquid water and steam driven by heating and decompression boiling. The cause of variability in eruptive styles and the associated seismic signals are not well understood. Data collected from five broadband seismometers at Lone Star Geyser, Yellowstone National Park are used to determine the properties, location, and temporal patterns of hydrothermal tremor. The tremor is harmonic at some stages of the eruption cycle and is caused by near‐periodic repetition of discrete seismic events. Using the polarization of ground motion, we identify the location of tremor sources throughout several eruption cycles. During preplay episodes (smaller eruptions preceding the more vigorous major eruption), tremor occurs at depths of 7–10 m and is laterally offset from the geyser's cone by ~5 m. At the onset of the main eruption, tremor sources migrate laterally and become shallower. As the eruption progresses, tremor sources migrate along the same path but in the opposite direction, ending where preplay tremor originates. The upward and then downward migration of tremor sources during eruptions are consistent with warming of the conduit followed by evacuation of water during the main eruption. We identify systematic relations among the two types of preplays, discharge, and the main eruption. A point‐source moment tensor fit to low‐frequency waveforms of an individual tremor event using half‐space velocity models indicates averageVS  0.8 km/s, source depths ~4–20 m, and moment tensors with primarily positive isotropic and compensated linear vector dipole moments.

     
    more » « less
  4. Wang, N. ; Rebolledo-Mendez, G. ; Matsuda, N. ; Santos, O.C. ; Dimitrova, V. (Ed.)
    Research indicates that teachers play an active and important role in classrooms with AI tutors. Yet, our scientific understanding of the way teacher practices around AI tutors mediate student learning is far from complete. In this paper, we investigate spatiotemporal factors of student-teacher interactions by analyzing student engagement and learning with an AI tutor ahead of teacher visits (defined as episodes of a teacher being in close physical proximity to a student) and immediately following teacher visits. To conduct such integrated, temporal analysis around the moments when teachers visit students, we collect fine-grained, time-synchronized data on teacher positions in the physical classroom and student interactions with the AI tutor. Our case study in a K12 math classroom with a veteran math teacher provides some indications on factors that might affect a teacher’s decision to allocate their limited classroom time to their students and what effects these interactions have on students. For instance, teacher visits were associated more with students’ in-the-moment behavioral indicators (e.g., idleness) than a broader, static measure of student needs such as low prior knowledge. While teacher visits were often associated with positive changes in student behavior afterward (e.g., decreased idleness), there could be a potential mismatch between students visited by the teacher and who may have needed it more at that time (e.g., students who were disengaged for much longer). Overall, our findings indicate that teacher visits may yield immediate benefits for students but also that it is challenging for teachers to meet all needs - suggesting the need for better tool support. 
    more » « less
  5. Video Paragraph Captioning aims to generate a multi-sentence description of an untrimmed video with multiple temporal event locations in a coherent storytelling. Following the human perception process, where the scene is effectively understood by decomposing it into visual (e.g. human, animal) and non- visual components (e.g. action, relations) under the mutual influence of vision and language, we first propose a visual- linguistic (VL) feature. In the proposed VL feature, the scene is modeled by three modalities including (i) a global visual environment; (ii) local visual main agents; (iii) linguistic scene elements. We then introduce an autoregressive Transformer- in-Transformer (TinT) to simultaneously capture the semantic coherence of intra- and inter-event contents within a video. Finally, we present a new VL contrastive loss function to guarantee the learnt embedding features are consistent with the captions semantics. Comprehensive experiments and extensive ablation studies on the ActivityNet Captions and YouCookII datasets show that the proposed Visual-Linguistic Transformer- in-Transform (VLTinT) outperforms previous state-of-the-art methods in terms of accuracy and diversity. The source code is made publicly available at: https://github.com/UARK-AICV/ VLTinT. 
    more » « less