skip to main content


This content will become publicly available on October 8, 2024

Title: CLIP-TSA: Clip-Assisted Temporal Self-Attention for Weakly-Supervised Video Anomaly Detection
Video anomaly detection (VAD) – commonly formulated as a multiple-instance learning problem in a weakly-supervised manner due to its labor-intensive nature – is a challenging problem in video surveillance where the frames of anomaly need to be localized in an untrimmed video. In this paper, we first propose to utilize the ViT-encoded visual features from CLIP, in contrast with the conventional C3D or I3D features in the domain, to efficiently extract discriminative representations in the novel technique. We then model temporal de- pendencies and nominate the snippets of interest by leveraging our proposed Temporal Self-Attention (TSA). The ablation study confirms the effectiveness of TSA and ViT feature. The extensive experiments show that our proposed CLIP-TSA outperforms the existing state-of-the-art (SOTA) methods by a large margin on three commonly-used benchmark datasets in the VAD problem (UCF-Crime, ShanghaiTech Campus and XD-Violence). Our source code is available at https:// github.com/joos2010kj/CLIP-TSA.  more » « less
Award ID(s):
2223793 1946391
NSF-PAR ID:
10466818
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
IEEE
Date Published:
Page Range / eLocation ID:
3230 to 3234
Format(s):
Medium: X
Location:
Kuala Lumpur, Malaysia
Sponsoring Org:
National Science Foundation
More Like this
  1. Acharya, Bipul R. (Ed.)
    Proper regulation of microtubule (MT) dynamics is critical for cellular processes including cell division and intracellular transport. Plus-end tracking proteins (+TIPs) dynamically track growing MTs and play a key role in MT regulation. +TIPs participate in a complex web of intra- and inter- molecular interactions known as the +TIP network. Hypotheses addressing the purpose of +TIP:+TIP interactions include relieving +TIP autoinhibition and localizing MT regulators to growing MT ends. In addition, we have proposed that the web of +TIP:+TIP interactions has a physical purpose: creating a dynamic scaffold that constrains the structural fluctuations of the fragile MT tip and thus acts as a polymerization chaperone. Here we examine the possibility that this proposed scaffold is a biomolecular condensate (i.e., liquid droplet). Many animal +TIP network proteins are multivalent and have intrinsically disordered regions, features commonly found in biomolecular condensates. Moreover, previous studies have shown that overexpression of the +TIP CLIP-170 induces large “patch” structures containing CLIP-170 and other +TIPs; we hypothesized that these structures might be biomolecular condensates. To test this hypothesis, we used video microscopy, immunofluorescence staining, and Fluorescence Recovery After Photobleaching (FRAP). Our data show that the CLIP-170-induced patches have hallmarks indicative of a biomolecular condensate, one that contains +TIP proteins and excludes other known condensate markers. Moreover, bioinformatic studies demonstrate that the presence of intrinsically disordered regions is conserved in key +TIPs, implying that these regions are functionally significant. Together, these results indicate that the CLIP-170 induced patches in cells are phase-separated liquid condensates and raise the possibility that the endogenous +TIP network might form a liquid droplet at MT ends or other +TIP locations. 
    more » « less
  2. The unicellular green alga Chlamydomonas reinhardtii is an excellent model organism to investigate many essential cellular processes in photosynthetic eukaryotes. Two commonly used background strains of Chlamydomonas are CC-1690 and CC-5325. CC-1690, also called 21gr, has been used for the Chlamydomonas genome project and several transcriptome analyses. CC-5325 is the background strain for the Chlamydomonas Library Project (CLiP). Photosynthetic performance in CC-5325 has not been evaluated in comparison with CC-1690. Additionally, CC-5325 is often considered to be cell-wall deficient, although detailed analysis is missing. The circadian rhythms in CC-5325 are also unclear. To fill these knowledge gaps and facilitate the use of the CLiP mutant library for various screens, we performed phenotypic comparisons between CC-1690 and CC-5325. Our results showed that CC-5325 grew faster heterotrophically in dark and equally well in mixotrophic liquid medium as compared to CC-1690. CC-5325 had lower photosynthetic efficiency and was more heat-sensitive than CC-1690. Furthermore, CC-5325 had an intact cell wall which had comparable integrity to that in CC-1690 but appeared to have reduced thickness. Additionally, CC-5325 could perform phototaxis, but could not maintain a sustained circadian rhythm of phototaxis as CC1690 did. Finally, in comparison to CC-1690, CC-5325 had longer cilia in the medium with acetate but slower swimming speed in the medium without nitrogen and acetate. Our results will be useful for researchers in the Chlamydomonas community to choose suitable background strains for mutant analysis and employ the CLiP mutant library for genome-wide mutant screens under appropriate conditions, especially in the areas of photosynthesis, thermotolerance, cell wall, and circadian rhythms.

     
    more » « less
  3. Avidan, S. (Ed.)
    Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks. 
    more » « less
  4. Abstract

    One of key features of intrinsically disordered regions (IDRs) is facilitation of protein–protein and protein–nucleic acids interactions. These disordered binding regions include molecular recognition features (MoRFs), short linear motifs (SLiMs) and longer binding domains. Vast majority of current predictors of disordered binding regions target MoRFs, with a handful of methods that predict SLiMs and disordered protein-binding domains. A new and broader class of disordered binding regions, linear interacting peptides (LIPs), was introduced recently and applied in the MobiDB resource. LIPs are segments in protein sequences that undergo disorder-to-order transition upon binding to a protein or a nucleic acid, and they cover MoRFs, SLiMs and disordered protein-binding domains. Although current predictors of MoRFs and disordered protein-binding regions could be used to identify some LIPs, there are no dedicated sequence-based predictors of LIPs. To this end, we introduce CLIP, a new predictor of LIPs that utilizes robust logistic regression model to combine three complementary types of inputs: co-evolutionary information derived from multiple sequence alignments, physicochemical profiles and disorder predictions. Ablation analysis suggests that the co-evolutionary information is particularly useful for this prediction and that combining the three inputs provides substantial improvements when compared to using these inputs individually. Comparative empirical assessments using low-similarity test datasets reveal that CLIP secures area under receiver operating characteristic curve (AUC) of 0.8 and substantially improves over the results produced by the closest current tools that predict MoRFs and disordered protein-binding regions. The webserver of CLIP is freely available at http://biomine.cs.vcu.edu/servers/CLIP/ and the standalone code can be downloaded from http://yanglab.qd.sdu.edu.cn/download/CLIP/.

     
    more » « less
  5. We propose TubeR: a simple solution for spatio-temporal video action detection. Different from existing methods that depend on either an off-line actor detector or hand-designed actor-positional hypotheses like proposals or anchors, we propose to directly detect an action tubelet in a video by simultaneously performing action localization and recognition from a single representation. TubeR learns a set of tubelet queries and utilizes a tubelet-attention module to model the dynamic spatio-temporal nature of a video clip, which effectively reinforces the model capacity compared to using actor-positional hypotheses in the spatio-temporal space. For videos containing transitional states or scene changes, we propose a context aware classification head to utilize short-term and long-term context to strengthen action classification, and an action switch regression head for detecting the precise temporal action extent. TubeR directly produces action tubelets with variable lengths and even maintains good results for long video clips. TubeR outperforms the previous state-of-the-art on commonly used action detection datasets AVA, UCF101-24 and JHMDB51-21. Code will be available on GluonCV(https://cv.gluon.ai/). 
    more » « less