skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Video Object Detection with an Aligned Spatial-Temporal Memory
We introduce Spatial-Temporal Memory Networks for video object detection. At its core, a novel Spatial-Temporal Memory module (STMM) serves as the recurrent computation unit to model long-term temporal appearance and motion dynamics. The STMM's design enables full integration of pretrained backbone CNN weights, which we find to be critical for accurate detection. Furthermore, in order to tackle object motion in videos, we propose a novel MatchTrans module to align the spatial-temporal memory from frame to frame. Our method produces state-of-the-art results on the benchmark ImageNet VID dataset, and our ablative studies clearly demonstrate the contribution of our different design choices.  more » « less
Award ID(s):
1751206
PAR ID:
10082237
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the European Conference on Computer Vision (ECCV)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Camera-based systems are increasingly used for collecting information on intersections and arterials. Unlike loop controllers that can generally be only used for detection and movement of vehicles, cameras can provide rich information about the traffic behavior. Vision-based frameworks for multiple-object detection, object tracking, and near-miss detection have been developed to derive this information. However, much of this work currently addresses processing videos offline. In this article, we propose an integrated two-stream convolutional networks architecture that performs real-time detection, tracking, and near-accident detection of road users in traffic video data. The two-stream model consists of a spatial stream network for object detection and a temporal stream network to leverage motion features for multiple-object tracking. We detect near-accidents by incorporating appearance features and motion features from these two networks. Further, we demonstrate that our approaches can be executed in real-time and at a frame rate that is higher than the video frame rate on a variety of videos collected from fisheye and overhead cameras. 
    more » « less
  2. The role of perceptual organization in motion analysis has heretofore been minimal. In this work we demonstrate that the use of perceptual organization principles of temporal coherence (common fate) and spatial proximity can result in a robust motion segmentation algorithm that is able to handle drastic illumination changes, occlusion events, and multiple moving objects, without the use of object models. The adopted algorithm does not employ the traditional frame by frame motion analysis, but rather treats the image sequence as a single 3D spatio-temporal block of data. We describe motion using spatio-temporal surfaces, which we, in turn, describe as compositions of finite planar patches. These planar patches, referred to as temporal envelopes, capture the local nature of the motions. We detect these temporal envelopes using 3D-edge detection followed by Hough transform, and represent them with convex hulls. We present a graph-based method to group these temporal envelopes arising from one object based on Gestalt organizational principles. A probabilistic Bayesian network quantifies the saliencies of the relationships between temporal envelopes. We present results on sequences with multiple moving persons, significant occlusions, and scene illumination changes. 
    more » « less
  3. null (Ed.)
    In this paper, we introduce a practical system for interactive video object mask annotation, which can support multiple back-end methods. To demonstrate the generalization of our system, we introduce a novel approach for video object annotation. Our proposed system takes scribbles at a chosen key-frame from the end-users via a user-friendly interface and produces masks of corresponding objects at the key-frame via the Control-Point-based Scribbles-to-Mask (CPSM) module. The object masks at the key-frame are then propagated to other frames and refined through the Multi-Referenced Guided Segmentation (MRGS) module. Last but not least, the user can correct wrong segmentation at some frames, and the corrected mask is continuously propagated to other frames in the video via the MRGS to produce the object masks at all video frames. 
    more » « less
  4. Integral imaging has proven useful for three-dimensional (3D) object visualization in adverse environmental conditions such as partial occlusion and low light. This paper considers the problem of 3D object tracking. Two-dimensional (2D) object tracking within a scene is an active research area. Several recent algorithms use object detection methods to obtain 2D bounding boxes around objects of interest in each frame. Then, one bounding box can be selected out of many for each object of interest using motion prediction algorithms. Many of these algorithms rely on images obtained using traditional 2D imaging systems. A growing literature demonstrates the advantage of using 3D integral imaging instead of traditional 2D imaging for object detection and visualization in adverse environmental conditions. Integral imaging’s depth sectioning ability has also proven beneficial for object detection and visualization. Integral imaging captures an object’s depth in addition to its 2D spatial position in each frame. A recent study uses integral imaging for the 3D reconstruction of the scene for object classification and utilizes the mutual information between the object’s bounding box in this 3D reconstructed scene and the 2D central perspective to achieve passive depth estimation. We build over this method by using Bayesian optimization to track the object’s depth in as few 3D reconstructions as possible. We study the performance of our approach on laboratory scenes with occluded objects moving in 3D and show that the proposed approach outperforms 2D object tracking. In our experimental setup, mutual information-based depth estimation with Bayesian optimization achieves depth tracking with as few as two 3D reconstructions per frame which corresponds to the theoretical minimum number of 3D reconstructions required for depth estimation. To the best of our knowledge, this is the first report on 3D object tracking using the proposed approach. 
    more » « less
  5. Efficient and low-energy camera signal processing is critical for battery-supported sensing and surveillance applications. In this research, we develop a video object detection and tracking framework which adaptively down-samples frame pixels to minimize computation and memory costs, and thereby the energy consumed, while maintaining a high level of accuracy. Instead of always operating with the highest sensor pixel resolution (compute-intensive), video frame (pixel) content is down-sampled spatially, to adapt to changing camera environments (size of object tracked, peak-signal-tonoise- ratio (i.e, PSNR) of video frames). Object detection and tracking is supported by a novel video resolution-aware adaptive hyperdimensional computing framework. This leverages a low memory overhead non-linear hypervector encoding scheme specifically tailored for handling multiple degrees of resolution. Previous classification decisions of a moving object based on its tracking label are used to improve tracking robustness. Energy savings of up to 1.6 orders of magnitude and up to an order of magnitude compute speedup is obtained on a range of experiments performed on benchmark systems. 
    more » « less