skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: FFNet: Video Fast-Forwarding via Reinforcement Learning
For many applications with limited computation, com- munication, storage and energy resources, there is an im- perative need of computer vision methods that could select an informative subset of the input video for efficient pro- cessing at or near real time. In the literature, there are two relevant groups of approaches: generating a “trailer” for a video or fast-forwarding while watching/processing the video. The first group is supported by video summa- rization techniques, which require processing of the entire video to select an important subset for showing to users. In the second group, current fast-forwarding methods de- pend on either manual control or automatic adaptation of playback speed, which often do not present an accurate rep- resentation and may still require processing of every frame. In this paper, we introduce FastForwardNet (FFNet), a re- inforcement learning agent that gets inspiration from video summarization and does fast-forwarding differently. It is an online framework that automatically fast-forwards a video and presents a representative subset of frames to users on the fly. It does not require processing the entire video, but just the portion that is selected by the fast-forward agent, which makes the process very computationally efficient. The online nature of our proposed method also enables the users to begin fast-forwarding at any point of the video. Experiments on two real-world datasets demonstrate that our method can provide better representation of the input video (about 6%-20% improvement on coverage of impor- tant frames) with much less processing requirement (more than 80% reduction in the number of frames processed).  more » « less
Award ID(s):
1724341
PAR ID:
10067975
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IEEE Conf. on Computer Vision and Pattern Recognition
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Real-time video analytics typically require video frames to be processed by a query to identify objects or activities of interest while adhering to an end-to-end frame processing latency constraint. This imposes a continuous and heavy load on backend compute and network infrastructure. Video data has inherent redundancy and does not always contain an object of interest for a given query. We leverage this property of video streams to propose a lightweight Load Shedder that can be deployed on edge servers or on inexpensive edge devices co-located with cameras. The proposed Load Shedder uses pixel-level color-based features to calculate a utility score for each ingress video frame and a minimum utility threshold to select interesting frames to send for query processing. Dropping unnecessary frames enables the video analytics query in the backend to meet the end-to-end latency constraint with fewer compute and network resources. To guarantee a bounded end-to-end latency at runtime, we introduce a control loop that monitors the backend load and dynamically adjusts the utility threshold. Performance evaluations show that the proposed Load Shedder selects a large portion of frames containing each object of interest while meeting the end-to-end frame processing latency constraint. Furthermore, it does not impose a significant latency overhead when running on edge devices with modest compute resources. 
    more » « less
  2. We introduce the concept of semantic fast-forwarding of video streams for efficient labeling of training data for activity recognition. We show that this concept can be realized by combining deep learning within individual frames, with spatial and temporal entity-relationship reasoning about detected objects. We describe a prototype that implements this concept, and present preliminary experimental results on its feasibility and value. 
    more » « less
  3. Optical flow-based video interpolation is a commonly used method to enrich video details by generating new intermediate frames from existing ones. However, it cannot accurately reproduce the trajectories of irregular and fast-moving objects. To achieve the accurate reconstruction of high-speed scenes, incorporating event information during the interpolation process is an effective method. However, existing methods are not suitable for scenarios where events are sparse. To implement high-quality video interpolation for these scenarios, this study incorporates event information into the interpolation process with three-fold ideas: a) treating the moving target as the foreground and using events to delineate the target area, b) matching foreground to background based on the event information to generate temporal frame between keyframes, and c) generating intermediate frames between keyframe and temporal frame with optical flow interpolation. Empirical studies indicate that owing to its efficient incorporation of event information, the proposed framework outperforms state-of-the-art methods in generating high-quality frames for video interpolation. 
    more » « less
  4. Rewrite rules are critical in equality saturation, an increasingly popular technique in optimizing compilers, synthesizers, and verifiers. Unfortunately, developing high-quality rulesets is difficult and error-prone. Recent work on automatically inferring rewrite rules does not scale to large terms or grammars, and existing rule inference tools are monolithic and opaque. Equality saturation users therefore struggle to guide inference and incrementally construct rulesets. As a result, most users still manually develop and maintain rulesets. This paper proposes Enumo, a new domain-specific language for programmable theory exploration. Enumo provides a small set of core operators that enable users to strategically guide rule inference and incrementally build rulesets. Short Enumo programs easily replicate results from state-of-the-art tools, but Enumo programs can also scale to infer deeper rules from larger grammars than prior approaches. Its composable operators even facilitate developing new strategies for ruleset inference. We introduce a new fast-forwarding strategy that does not require evaluating terms in the target language, and can thus support domains that were out of scope for prior work. We evaluate Enumo and fast-forwarding across a variety of domains. Compared to state-of-the-art techniques, enumo can synthesize better rulesets over a diverse set of domains, in some cases matching the effects of manually-developed rulesets in systems driven by equality saturation. 
    more » « less
  5. null (Ed.)
    First-person-view videos of hands interacting with tools are widely used in the computer vision industry. However, creating a dataset with pixel-wise segmentation of hands is challenging since most videos are captured with fingertips occluded by the hand dorsum and grasped tools. Current methods often rely on manually segmenting hands to create annotations, which is inefficient and costly. To relieve this challenge, we create a method that utilizes thermal information of hands for efficient pixel-wise hand segmentation to create a multi-modal activity video dataset. Our method is not affected by fingertip and joint occlusions and does not require hand pose ground truth. We show our method to be 24 times faster than the traditional polygon labeling method while maintaining high quality. With the segmentation method, we propose a multi-modal hand activity video dataset with 790 sequences and 401,765 frames of "hands using tools" videos captured by thermal and RGB-D cameras with hand segmentation data. We analyze multiple models for hand segmentation performance and benchmark four segmentation networks. We show that our multi-modal dataset with fusing Long-Wave InfraRed (LWIR) and RGB-D frames achieves 5% better hand IoU performance than using RGB frames. 
    more » « less