It is a common practice to think of a video as a sequence of images (frames), and re-use deep neural network models that are trained only on images for similar analytics tasks on videos. In this paper, we show that this “leap of faith” that deep learning models that work well on images will also work well on videos is actually flawed.We show that even when a video camera is viewing a scene that is not changing in any humanperceptible way, and we control for external factors like video compression and environment (lighting), the accuracy of video analytics application fluctuates noticeably. These fluctuations occur because successive frames produced by the video camera may look similar visually, but are perceived quite differently by the video analytics applications.We observed that the root cause for these fluctuations is the dynamic camera parameter changes that a video camera automatically makes in order to capture and produce a visually pleasing video. The camera inadvertently acts as an “unintentional adversary” because these slight changes in the image pixel values in consecutive frames, as we show, have a noticeably adverse impact on the accuracy of insights from video analytics tasks that re-use image-trained deep learning models. To address this inadvertent adversarial effect from the camera, we explore the use of transfer learning techniques to improve learning in video analytics tasks through the transfer of knowledge from learning on image analytics tasks. Our experiments with a number of different cameras, and a variety of different video analytics tasks, show that the inadvertent adversarial effect from the camera can be noticeably offset by quickly re-training the deep learning models using transfer learning. In particular, we show that our newly trained Yolov5 model reduces fluctuation in object detection across frames, which leads to better tracking of objects (∼40% fewer mistakes in tracking). Our paper also provides new directions and techniques to mitigate the camera’s adversarial effect on deep learning models used for video analytics applications.
more »
« less
Cloud-Accelerated Analysis of Subsea High-Definition Camera Data
The seafloor high-definition camera (CamHD) installed on the Ocean Observatories Initiative (OOI) Ca- bled Array (CA) provides real-time video of the Mushroom vent at the ASHES hydrothermal field in the Axial Volcano caldera on the Juan de Fuca spreading zone (Figure 1). CamHD performs a pre-programmed 13-minute motion sequence every 3 hours. The video captured during this sequence is stored as a 13GB HD video file in the OOI Cyber-Infrastructure (CI) at Rutgers University. As of July 2017 there are approx. 6700 videos in the CI, all of which are publicly accessible through a conventional HTTP interface. Unfortunately, it is impractical for a researcher (and taxing on the CI bandwidth) to download, store, and process the extent of the video archive for analysis. We describe two elements of our efforts to accelerate CamHD video analysis: a cloud-hosted application which provides a simplified interface for extracting individual frames from CamHD videos in a time- and bandwidth- efficient manner; and a tool for the automatic isolation and identification of video subsets showing a sequence of known camera positions. Automatic identification of these video segments allows rapid and automatic development of e.g., time lapse videos.
more »
« less
- Award ID(s):
- 1700850
- PAR ID:
- 10047264
- Date Published:
- Journal Name:
- IEEE/MTS Oceans 2017 - Anchorage
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
In recent years, streamed 360° videos have gained popularity within Virtual Reality (VR) and Augmented Reality (AR) applications. However, they are of much higher resolutions than 2D videos, causing greater bandwidth consumption when streamed. This increased bandwidth utilization puts tremendous strain on the network capacity of the cloud providers streaming these videos. In this paper, we introduce L3BOU, a novel, three-tier distributed software framework that reduces cloud-edge bandwidth in the backhaul network and lowers average end-to-end latency for 360° video streaming applications. The L3BOU framework achieves low bandwidth and low latency by leveraging edge-based, optimized upscaling techniques. L3BOU accomplishes this by utilizing down-scaled MPEG-DASH-encoded 360° video data, known as Ultra Low Resolution (ULR) data, that the L3BOU edge applies distributed super-resolution (SR) techniques on, providing a high quality video to the client. L3BOU is able to reduce the cloud-edge backhaul bandwidth by up to a factor of 24, and the optimized super-resolution multi-processing of ULR data provides a 10-fold latency decrease in super resolution upscaling at the edge.more » « less
-
We propose a simple, cost-effective method for synchronized phase contrast and fluorescence video acquisition in live samples. Counter-phased pulses of phase contrast illumination and fluorescence excitation light are synchronized with the exposure of the two fields of an interlaced camera sensor. This results in a video sequence in which each frame contains both exposure modes, each in half of its pixels. The method allows real-time acquisition and display of synchronized and spatially aligned phase contrast and fluorescence image sequences that can be separated by de-interlacing in two independent videos. The method can be implemented on any fluorescence microscope with a camera port without needing to modify the optical path.more » « less
-
Bulterman_Dick; Kankanhalli_Mohan; Muehlhaueser_Max; Persia_Fabio; Sheu_Philip; Tsai_Jeffrey (Ed.)The emergence of 360-video streaming systems has brought about new possibilities for immersive video experiences while requiring significantly higher bandwidth than traditional 2D video streaming. Viewport prediction is used to address this problem, but interesting storylines outside the viewport are ignored. To address this limitation, we present SAVG360, a novel viewport guidance system that utilizes global content information available on the server side to enhance streaming with the best saliency-captured storyline of 360-videos. The saliency analysis is performed offline on the media server with powerful GPU, and the saliency-aware guidance information is encoded and shared with clients through the Saliency-aware Guidance Descriptor. This enables the system to proactively guide users to switch between storylines of the video and allow users to follow or break guided storylines through a novel user interface. Additionally, we present a viewing mode prediction algorithms to enhance video delivery in SAVG360. Evaluation of user viewport traces in 360-videos demonstrate that SAVG360 outperforms existing tiled streaming solutions in terms of overall viewport prediction accuracy and the ability to stream high-quality 360 videos under bandwidth constraints. Furthermore, a user study highlights the advantages of our proactive guidance approach over predicting and streaming of where users look.more » « less
-
Pervasive deployment of surveillance cameras today poses enormous scalability challenges to video analytics systems operating over many camera feeds. Currently, there are few indexing tools to organize video feeds beyond what is provided by a standard file system. Recent video analytic systems implement application-specific frame profiling and sampling techniques to reduce the number of raw videos processed, leveraging frame-level redundancy or manually labeled spatial-temporal correlation between cameras. This paper presents Video-zilla, a standalone indexing layer between video query systems and a video store to organize video data. We propose a video data unit abstraction, semantic video stream (SVS), based on a notion of distance between objects in the video. SVS implicitly captures scenes, which is missing from current video content characterization and a middle ground between individual frames and an entire camera feed. We then build a hierarchical index that exposes the semantic similarity both within and across camera feeds, such that Video-zilla can quickly cluster video feeds based on their content semantics without manual labeling. We implement and evaluate Video-zilla in three use cases: object identification queries, clustering for training specialized DNNs, and archival services. In all three cases, Video-zilla reduces the time complexity of inter-camera video analytics from linear with the number of cameras to sublinear, and reduces query resource usage by up to 14x compared to using frame-level or spatial-temporal similarity built into existing query systems.more » « less
An official website of the United States government

