Denoising diffusion probabilistic models are a promising new class of generative models that mark a milestone in high-quality image generation. This paper showcases their ability to sequentially generate video, surpassing prior methods in perceptual and probabilistic forecasting metrics. We propose an autoregressive, end-to-end optimized video diffusion model inspired by recent advances in neural video compression. The model successively generates future frames by correcting a deterministic next-frame prediction using a stochastic residual generated by an inverse diffusion process. We compare this approach against six baselines on four datasets involving natural and simulation-based videos. We find significant improvements in terms of perceptual quality and probabilistic frame forecasting ability for all datasets. 
                        more » 
                        « less   
                    This content will become publicly available on July 17, 2026
                            
                            History-Guided Video Diffusion
                        
                    
    
            Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2211260
- PAR ID:
- 10635290
- Publisher / Repository:
- 2025 Forty-Second International Conference on Machine Learning
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Kehtarnavaz, Nasser; Shirvaikar, Mukul V (Ed.)Recent diffusion-based generative models employ methods such as one-shot fine-tuning an image diffusion model for video generation. However, this leads to long video generation times and suboptimal efficiency. To resolve this long generation time, zero-shot text-to-video models eliminate the fine-tuning method entirely and can generate novel videos from a text prompt alone. While the zero-shot generation method greatly reduces generation time, many models rely on inefficient cross-frame attention processors, hindering the diffusion model’s utilization for real-time video generation. We address this issue by introducing more efficient attention processors to a video diffusion model. Specifically, we use attention processors (i.e. xFormers, FlashAttention, and HyperAttention) that are highly optimized for efficiency and hardware parallelization. We then apply these processors to a video generator and test with both older diffusion models such as Stable Diffusion 1.5 and newer, high-quality models such as Stable Diffusion XL. Our results show that using efficient attention processors alone can reduce generation time by around 25%, while not resulting in any change in video quality. Combined with the use of higher quality models, this use of efficient attention processors in zero-shot generation presents a substantial efficiency and quality increase, greatly expanding the video diffusion model’s application to real-time video generation.more » « less
- 
            In the era of generative AI, integrating video generation models into robotics opens new possibilities for the general-purpose robot agent. This paper introduces imitation learning with latent video planning (VILP). We propose a latent video diffusion model to generate predictive robot videos that adhere to temporal consistency to a good degree. Our method is able to generate highly time-aligned videos from multiple views, which is crucial for robot policy learning. Our video generation model is highly time-effcient. For example, it can generate videos from two distinct perspectives, each consisting of six frames with a resolution of 96x160 pixels, at a rate of 5 Hz. In the experiments, we demonstrate that VILP outperforms the existing video generation robot policy across several metrics: training costs, inference speed, temporal consistency of generated videos, and the performance of the policy. We also compared our method with other imitation learning methods. Our fndings indicate that VILP can rely less on extensive high-quality task-specifc robot action data while still maintaining robust performance. In addition, VILP possesses robust capabilities in representing multi-modal action distributions. Our paper provides a practical example of how to effectively integrate video generation models into robot policies, potentially offering insights for related felds and directions. For more details, please refer to our open-source repository https://github.com/ZhengtongXu/VILP.more » « less
- 
            This paper addresses the challenge of precisely swapping objects in videos, particularly those involved in hand-object interactions (HOI), using a single user-provided reference object image. While diffusion models have advanced video editing, they struggle with the complexities of HOI, often failing to generate realistic edits when object swaps involve changes in shape or functionality. To overcome this, the authors propose HOI-Swap, a novel diffusion-based video editing framework trained in a self-supervised manner. The framework operates in two stages: (1) single-frame object swapping with HOI awareness, where the model learns to adjust interaction patterns (e.g., hand grasp) based on object property changes; and (2) sequence-wide extension, where motion alignment is achieved by warping a sequence from the edited frame using sampled motion points and conditioning generation on the warped sequence. Extensive qualitative and quantitative evaluations demonstrate that HOI-Swap significantly outperforms prior methods, producing high-quality, realistic HOI video edits.more » « less
- 
            Neural Radiance Field (NeRF) has emerged as a powerful technique for 3D scene representation due to its high rendering quality. Among its applications, mobile NeRF video-on-demand (VoD) is especially promising, beneting from both the scalability of the mobile devices and the immersive experience oered by NeRF. However, streaming NeRF videos over real-world networks presents signi cant challenges, particularly due to limited bandwidth and temporal dynamics. To address these challenges, we propose NeRFlow, a novel framework that enables adaptive streaming for NeRF videos through both bitrate and viewpoint adaptation. NeRFlow solves three fundamental problems: rst, it employs a rendering-adaptive pruning technique to determine voxel importance, selectively reducing data size without sacricing rendering quality. Second, it introduces a viewpoint-aware adaptation module that eciently compensates for uncovered regions in real time by combining preencoded master and sub-frames. Third, it incorporates a QoE-aware bitrate ladder generation framework, leveraging a genetic algorithm to optimize the number and conguration of bitrates while accounting for bandwidth dynamics and ABR algorithms. Through extensive experiments, NeRFlow is demonstrated to eectively improve user Quality of Experience (QoE) by 31.3% to 41.2%, making it an ecient solution for NeRF video streaming.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
