skip to main content


This content will become publicly available on October 8, 2024

Title: CLIP-TSA: Clip-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection
Video anomaly detection (VAD) – commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature – is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal de- pendencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus and XD-Violence). Our source code is available at https:// github.com/joos2010kj/CLIP-TSA.  more » « less
Award ID(s):
2223793 1946391
NSF-PAR ID:
10466818
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
IEEE
Date Published:
ISBN:
978-1-7281-9835-4
Page Range / eLocation ID:
3230 to 3234
Format(s):
Medium: X
Location:
Kuala Lumpur, Malaysia
Sponsoring Org:
National Science Foundation
More Like this
  1. Acharya, Bipul R. (Ed.)
    Proper regulation of microtubule (MT) dynamics is critical for cellular processes including cell division and intracellular transport. Plus-end tracking proteins (+TIPs) dynamically track growing MTs and play a key role in MT regulation. +TIPs participate in a complex web of intra- and inter- molecular interactions known as the +TIP network. Hypotheses addressing the purpose of +TIP:+TIP interactions include relieving +TIP autoinhibition and localizing MT regulators to growing MT ends. In addition, we have proposed that the web of +TIP:+TIP interactions has a physical purpose: creating a dynamic scaffold that constrains the structural fluctuations of the fragile MT tip and thus acts as a polymerization chaperone. Here we examine the possibility that this proposed scaffold is a biomolecular condensate (i.e., liquid droplet). Many animal +TIP network proteins are multivalent and have intrinsically disordered regions, features commonly found in biomolecular condensates. Moreover, previous studies have shown that overexpression of the +TIP CLIP-170 induces large “patch” structures containing CLIP-170 and other +TIPs; we hypothesized that these structures might be biomolecular condensates. To test this hypothesis, we used video microscopy, immunofluorescence staining, and Fluorescence Recovery After Photobleaching (FRAP). Our data show that the CLIP-170-induced patches have hallmarks indicative of a biomolecular condensate, one that contains +TIP proteins and excludes other known condensate markers. Moreover, bioinformatic studies demonstrate that the presence of intrinsically disordered regions is conserved in key +TIPs, implying that these regions are functionally significant. Together, these results indicate that the CLIP-170 induced patches in cells are phase-separated liquid condensates and raise the possibility that the endogenous +TIP network might form a liquid droplet at MT ends or other +TIP locations. 
    more » « less
  2. Abstract—We consider the ability of CLIP features to support text-driven image retrieval. Traditional image-based queries sometimes misalign with user intentions due to their focus on irrelevant image components. To overcome this, we explore the potential of text-based image retrieval, specifically using Contrastive Language-Image Pretraining (CLIP) models. CLIP models, trained on large datasets of image-caption pairs, offer a promising approach by allowing natural language descriptions for more targeted queries. We explore the effectiveness of textdriven image retrieval based on CLIP features by evaluating the image similarity for progressively more detailed queries. We find that there is a sweet-spot of detail in the text that gives best results and find that words describing the “tone” of a scene (such as messy, dingy) are quite important in maximizing text-image similarity. 
    more » « less
  3. Ruiz, Francisco ; Dy, Jennifer ; van de Meent, Jan-Willem (Ed.)
    We propose CLIP-Lite, an information efficient method for visual representation learning by feature alignment with textual annotations. Compared to the previously proposed CLIP model, CLIP-Lite requires only one negative image-text sample pair for every positive image-text sample during the optimization of its contrastive learning objective. We accomplish this by taking advantage of an information efficient lower-bound to maximize the mutual information between the two input modalities. This allows CLIP-Lite to be trained with significantly reduced amounts of data and batch sizes while obtaining better performance than CLIP at the same scale. We evaluate CLIP-Lite by pretraining on the COCO-Captions dataset and testing transfer learning to other datasets. CLIP-Lite obtains a +14.0 mAP absolute gain in performance on Pascal VOC classification, and a +22.1 top-1 accuracy gain on ImageNet, while being comparable or superior to other, more complex, text-supervised models. CLIP-Lite is also superior to CLIP on image and text retrieval, zero-shot classification, and visual grounding. Finally, we show that CLIP-Lite can leverage language semantics to encourage bias-free visual representations that can be used in downstream tasks. Implementation: https://github.com/4m4n5/CLIP-Lite 
    more » « less
  4. We address the problem of retrieving a specific moment from an untrimmed video by a query sentence. This is a challenging problem because a target moment may take place in relations to other temporal moments in the untrimmed video. Existing methods cannot tackle this challenge well since they consider temporal moments individually and neglect the temporal dependencies. In this paper, we model the temporal relations between video moments by a two-dimensional map, where one dimension indicates the starting time of a moment and the other indicates the end time. This 2D temporal map can cover diverse video moments with different lengths, while representing their adjacent relations. Based on the 2D map, we propose a Temporal Adjacent Network (2D-TAN), a single-shot framework for moment localization. It is capable of encoding the adjacent temporal relation, while learning discriminative features for matching video moments with referring expressions. We evaluate the proposed 2D-TAN on three challenging benchmarks, i.e., Charades-STA, ActivityNet Captions, and TACoS, where our 2D-TAN outperforms the state-of-the-art. 
    more » « less
  5. The use of video-imaging data for in-line process monitoring applications has become popular in industry. In this framework, spatio-temporal statistical process monitoring methods are needed to capture the relevant information content and signal possible out-of-control states. Video-imaging data are characterized by a spatio-temporal variability structure that depends on the underlying phenomenon, and typical out-of-control patterns are related to events that are localized both in time and space. In this article, we propose an integrated spatio-temporal decomposition and regression approach for anomaly detection in video-imaging data. Out-of-control events are typically sparse, spatially clustered and temporally consistent. The goal is not only to detect the anomaly as quickly as possible (“when”) but also to locate it in space (“where”). The proposed approach works by decomposing the original spatio-temporal data into random natural events, sparse spatially clustered and temporally consistent anomalous events, and random noise. Recursive estimation procedures for spatio-temporal regression are presented to enable the real-time implementation of the proposed methodology. Finally, a likelihood ratio test procedure is proposed to detect when and where the anomaly happens. The proposed approach was applied to the analysis of high-sped video-imaging data to detect and locate local hot-spots during a metal additive manufacturing process. 
    more » « less