skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on June 21, 2026

Title: Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation
Video Frame Interpolation aims to recover realistic missing frames between observed frames, generating a highframe- rate video from a low-frame-rate video. However, without additional guidance, the large motion between frames makes this problem ill-posed. Event-based Video Frame Interpolation (EVFI) addresses this challenge by using sparse, high-temporal-resolution event measurements as motion guidance. This guidance allows EVFI methods to significantly outperform frame-only methods. However, to date, EVFI methods have relied on a limited set of paired eventframe training data, severely limiting their performance and generalization capabilities. In this work, we overcome the limited data challenge by adapting pre-trained video diffusion models trained on internet-scale datasets to EVFI. We experimentally validate our approach on real-world EVFI datasets, including a new one that we introduce. Our method outperforms existing methods and generalizes across cameras far better than existing approaches.  more » « less
Award ID(s):
2020624
PAR ID:
10650607
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Publisher / Repository:
IEEE
Date Published:
Edition / Version:
1
Page Range / eLocation ID:
12456-12466
Format(s):
Medium: X Other: pdf
Sponsoring Org:
National Science Foundation
More Like this
  1. Optical flow-based video interpolation is a commonly used method to enrich video details by generating new intermediate frames from existing ones. However, it cannot accurately reproduce the trajectories of irregular and fast-moving objects. To achieve the accurate reconstruction of high-speed scenes, incorporating event information during the interpolation process is an effective method. However, existing methods are not suitable for scenarios where events are sparse. To implement high-quality video interpolation for these scenarios, this study incorporates event information into the interpolation process with three-fold ideas: a) treating the moving target as the foreground and using events to delineate the target area, b) matching foreground to background based on the event information to generate temporal frame between keyframes, and c) generating intermediate frames between keyframe and temporal frame with optical flow interpolation. Empirical studies indicate that owing to its efficient incorporation of event information, the proposed framework outperforms state-of-the-art methods in generating high-quality frames for video interpolation. 
    more » « less
  2. Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos. 
    more » « less
  3. Smartphones have recently become a popular platform for deploying the computation-intensive virtual reality (VR) applications, such as immersive video streaming (a.k.a., 360-degree video streaming). One specific challenge involving the smartphone-based head mounted display (HMD) is to reduce the potentially huge power consumption caused by the immersive video. To address this challenge, we first conduct an empirical power measurement study on a typical smartphone immersive streaming system, which identifies the major power consumption sources. Then, we develop QuRate, a quality-aware and user-centric frame rate adaptation mechanism to tackle the power consumption issue in immersive video streaming. QuRate optimizes the immersive video power consumption by modeling the correlation between the perceivable video quality and the user behavior. Specifically, QuRate builds on top of the user’s reduced level of concentration on the video frames during view switching and dynamically adjusts the frame rate without impacting the perceivable video quality. We evaluate QuRate with a comprehensive set of experiments involving 5 smartphones, 21 users, and 6 immersive videos using empirical user head movement traces. Our experimental results demonstrate that QuRate is capable of extending the smartphone battery life by up to 1.24X while maintaining the perceivable video quality during immersive video streaming. Also, we conduct an Institutional Review Board (IRB)- approved subjective user study to further validate the minimum video quality impact caused by QuRate. 
    more » « less
  4. Deep learning models have significantly improved object detection, which is essential for visual sensing. However, their increasing complexity results in higher latency and resource consumption, making real-time object detection challenging. In order to address the challenge, we propose a new lightweight filtering method called L-filter to predict empty video frames that include no object of interest (e.g., vehicles) with high accuracy via hybrid time series analysis. L-filter drops those frames deemed empty and conducts object detection for nonempty frames only, significantly enhancing the frame processing rate and scalability of real-time object detection. Our evaluation demonstrates that L-filter improves the frame processing rate by 31–47% for a single traffic video stream compared to three standalone state-of-the-art object detection models without L-filter. Additionally, L-filter significantly enhances scalability; it can process up to six concurrent video streams in one commodity GPU, supporting over 57 fps per stream, by working alongside the fastest object detection model among the three models. 
    more » « less
  5. Abstract—Accurately capturing dynamic scenes with wideranging motion and light intensity is crucial for many vision applications. However, acquiring high-speed high dynamic range (HDR) video is challenging because the camera’s frame rate restricts its dynamic range. Existing methods sacrifice speed to acquire multi-exposure frames. Yet, misaligned motion in these frames can still pose complications for HDR fusion algorithms, resulting in artifacts. Instead of frame-based exposures, we sample the videos using individual pixels at varying exposures and phase offsets. Implemented on a monochrome pixel-wise programmable image sensor, our sampling pattern captures fast motion at a high dynamic range. We then transform pixel-wise outputs into an HDR video using end-to-end learned weights from deep neural networks, achieving high spatiotemporal resolution with minimized motion blurring. We demonstrate aliasing-free HDR video acquisition at 1000 FPS, resolving fast motion under low-light conditions and against bright backgrounds — both challenging conditions for conventional cameras. By combining the versatility of pixel-wise sampling patterns with the strength of deep neural networks at decoding complex scenes, our method greatly enhances the vision system’s adaptability and performance in dynamic conditions. Index Terms—High-dynamic-range video, high-speed imaging, CMOS image sensors, programmable sensors, deep learning, convolutional neural networks. 
    more » « less