skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Deep Learning-based Action Detection in Untrimmed Videos: A Survey
Understanding human behavior and activity facilitates advancement of numerous real-world applications, and is critical for video analysis. Despite the progress of action recognition algorithms in trimmed videos, the majority of real-world videos are lengthy and untrimmed with sparse segments of interest. The task of temporal activity detection in untrimmed videos aims to localize the temporal boundary of actions and classify the action categories. Temporal activity detection task has been investigated in full and limited supervision settings depending on the availability of action annotations. This paper provides an extensive overview of deep learning-based algorithms to tackle temporal action detection in untrimmed videos with different supervision levels including fully-supervised, weakly-supervised, unsupervised, self-supervised, and semi-supervised. In addition, this paper reviews advances in spatio-temporal action detection where actions are localized in both temporal and spatial dimensions. Action detection in online setting is also reviewed where the goal is to detect actions in each frame without considering any future context in a live video stream. Moreover, the commonly used action detection benchmark datasets and evaluation metrics are described, and the performance of the state-of-the-art methods are compared. Finally, real-world applications of temporal action detection in untrimmed videos and a set of future directions are discussed.  more » « less
Award ID(s):
2041307
PAR ID:
10428813
Author(s) / Creator(s):
;
Date Published:
Journal Name:
IEEE Transactions on Pattern Analysis and Machine Intelligence
Volume:
45
Issue:
4
ISSN:
0162-8828
Page Range / eLocation ID:
4302 - 4320
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The state-of-the-art of fully-supervised methods for temporal action localization from untrimmed videos has achieved impressive results. Yet, it remains unsatisfactory for the weakly-supervised temporal action localization, where only video-level action labels are given without the timestamp annotation on when the actions occur. The main reason comes from that, the weakly-supervised networks only focus on the highly discriminative frames, but there are some ambiguous frames in both background and action classes. The ambiguous frames in background class are very similar to the real actions, which may be treated as target actions and result in false positives. On the other hand, the ambiguous frames in action class which possibly contain action instances, are prone to be false negatives by the weakly-supervised networks and result in a coarse localization. To solve these problems, we introduce a novel weakly-supervised Action Completeness Modeling with Back- ground Aware Networks (ACM-BANets). Our Background Aware Network (BANet) contains a weight-sharing two-branch architecture, with an action guided Background aware Temporal Attention Module (B-TAM) and an asymmetrical training strategy, to suppress both highly discriminative and ambiguous background frames to remove the false positives. Our action completeness modeling contains multiple BANets, and the BANets are forced to discover different but complementary action instances to completely localize the action instances in both highly discriminative and ambiguous action frames. In the 𝑖-th iteration, the 𝑖-th BANet discovers the discriminative features, which are then erased from the feature map. The partially-erased feature map is fed into the (i+1)-th BANet of the next iteration to force this BANet to discover discriminative features different from the 𝑖-th BANet. Evaluated on two challenging untrimmed video datasets, THUMOS14 and ActivityNet1.3, our approach outperforms all the current weakly-supervised methods for temporal action localization. 
    more » « less
  2. This paper introduces a novel transformer network tailored to skeleton-based action detection in untrimmed long video streams. Our approach centers around three innovative mechanisms that collectively enhance the network’s temporal analysis capabilities. First, a new predictive attention mechanism incorporates future frame data into the sequence analysis during the training phase. This mechanism addresses the essential issue of the current action detection models: incomplete temporal modeling in long action sequences, particularly for boundary frames that lie outside the network’s immediate temporal receptive field, while maintaining computational efficiency. Second, we integrate a new adaptive weighted temporal attention system that dynamically evaluates the importance of each frame within an action sequence. In contrast to the existing approaches, the proposed weighting strategy is both adaptive and interpretable, making it highly effective in handling long sequences with numerous non-informative frames. Third, the network incorporates an advanced regression technique. This approach independently identifies the start and end frames based on their relevance to different frames. Unlike existing homogeneous regression methods, the proposed regression method is heterogeneous and based on various temporal relationships, including those in future frames in actions, making it more effective for action detection. Extensive experiments on prominent untrimmed skeleton-based action datasets, PKU-MMD, OAD, and the Charade dataset demonstrate the effectiveness of this network. 
    more » « less
  3. This paper tackles the challenge of point-supervised temporal action detection, wherein only a single frame is annotated for each action instance in the training set. Most of the current methods, hindered by the sparse nature of annotated points, struggle to effectively represent the continuous structure of actions or the inherent temporal and semantic dependencies within action instances. Consequently, these methods frequently learn merely the most distinctive segments of actions, leading to the creation of incomplete action proposals. This paper proposes POTLoc, a Pseudo-label Oriented Transformer for weakly-supervised Action Localization utilizing only point-level annotation. POTLoc is designed to identify and track continuous action structures via a self-training strategy. The base model begins by generating action proposals solely with point-level supervision. These proposals undergo refinement and regression to enhance the precision of the estimated action boundaries, which subsequently results in the production of ‘pseudo-labels’ to serve as supplementary supervisory signals. The architecture of the model integrates a transformer with a temporal feature pyramid to capture video snippet dependencies and model actions of varying duration. The pseudo-labels, providing information about the coarse locations and boundaries of actions, assist in guiding the transformer for enhanced learning of action dynamics. POTLoc outperforms the state-of-the-art point-supervised methods on THUMOS’14 and ActivityNet-v1.2 datasets. 
    more » « less
  4. Lee, Kyoung Mu (Ed.)
    This paper presents a novel spatiotemporal transformer network that introduces several original components to detect actions in untrimmed videos. First, the multi-feature selective semantic attention model calculates the correlations between spatial and motion features to model the spatiotemporal interactions between different action semantics properly. Second, the motion-aware network encodes the locations of action semantics in video frames utilizing the motion-aware 2D positional encoding algorithm. Such a motion-aware mechanism memorizes the dynamic spatiotemporal variations in action frames that current methods cannot exploit. Third, the sequence-based temporal attention model captures the heterogeneous temporal dependencies in action frames. In contrast to standard temporal attention used in natural language processing, primarily aimed at finding similarities between linguistic words, the proposed sequence-based temporal attention is designed to determine both the differences and similarities between video frames that jointly define the meaning of actions. The proposed approach outperforms the state-of-the-art solutions on four spatiotemporal action datasets: AVA 2.2, AVA 2.1, UCF101-24, and EPIC-Kitchens. 
    more » « less
  5. Weakly-supervised Temporal Action Localization (WTAL) aims to classify and localize action instances in untrimmed videos with only video-level labels. Existing methods typically use snippet-level RGB and optical flow features extracted from pre-trained extractors directly. Because of two limitations: the short temporal span of snippets and the inappropriate initial features, these WTAL methods suffer from the lack of effective use of temporal information and have limited performance. In this paper, we propose the Temporal Feature Enhancement Dilated Convolution Network (TFE-DCN) to address these two limitations. The proposed TFE-DCN has an enlarged receptive field that covers a long temporal span to observe the full dynamics of action instances, which makes it powerful to capture temporal dependencies between snippets. Furthermore, we propose the Modality Enhancement Module that can enhance RGB features with the help of enhanced optical flow features, making the overall features appropriate for the WTAL task. Experiments conducted on THUMOS’14 and ActivityNet v1.3 datasets show that our proposed approach far outperforms state-of-the-art WTAL methods. 
    more » « less