NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

History-Guided Video Diffusion

Song, Kiwhan; Chen, Boyuan; Simchowitz, Max; Du, Yilun; Tedrake, Russ; Sitzmann, Vincent (July 2025, 2025 Forty-Second International Conference on Machine Learning)

Classifier-free guidance (CFG) is a key technique for improving conditional generation in diffusion models, enabling more accurate control while enhancing sample quality. It is natural to extend this technique to video diffusion, which generates video conditioned on a variable number of context frames, collectively referred to as history. However, we find two key challenges to guiding with variable-length history: architectures that only support fixed-size conditioning, and the empirical observation that CFG-style history dropout performs poorly. To address this, we propose the Diffusion Forcing Transformer (DFoT), a video diffusion architecture and theoretically grounded training objective that jointly enable conditioning on a flexible number of history frames. We then introduce History Guidance, a family of guidance methods uniquely enabled by DFoT. We show that its simplest form, vanilla history guidance, already significantly improves video generation quality and temporal consistency. A more advanced method, history guidance across time and frequency further enhances motion dynamics, enables compositional generalization to out-of-distribution history, and can stably roll out extremely long videos.
more » « less
Free, publicly-accessible full text available July 17, 2026
FlowMap: High-Quality Camera Poses, Intrinsics, and Depth via Gradient Descent

https://doi.org/10.1109/3DV66043.2025.00041

Smith, Cameron; Charatan, David; Tewari, Ayush; Sitzmann, Vincent (March 2025, IEEE)

Free, publicly-accessible full text available March 25, 2026
Score Distillation via Reparametrized DDIM

Lukoianov, Artem; Borde, Haitz; Greenewald, Kristjan; Guizilini, Vitor; Bagautdinov, Timur; Sitzmann, Vincent; Solomon, Justin (December 2024, NeurIPS Proceedings)

While 2D diffusion models generate realistic, high-detail images, 3D shape generation methods like Score Distillation Sampling (SDS) built on these 2D diffusion models produce cartoon-like, over-smoothed shapes. To help explain this discrepancy, we show that the image guidance used in Score Distillation can be understood as the velocity field of a 2D denoising generative process, up to the choice of a noise term. In particular, after a change of variables, SDS resembles a high-variance version of Denoising Diffusion Implicit Models (DDIM) with a differently-sampled noise term: SDS introduces noise i.i.d. randomly at each step, while DDIM infers it from the previous noise predictions. This excessive variance can lead to over-smoothing and unrealistic outputs. We show that a better noise approximation can be recovered by inverting DDIM in each SDS update step. This modification makes SDS's generative process for 2D images almost identical to DDIM. In 3D, it removes over-smoothing, preserves higher-frequency detail, and brings the generation quality closer to that of 2D samplers. Experimentally, our method achieves better or similar 3D generation quality compared to other state-of-the-art Score Distillation methods, all without training additional neural networks or multi-view supervision, and providing useful insights into relationship between 2D and 3D asset generation with diffusion models.
more » « less
Full Text Available
Diffusion Forcing: Next-token Prediction Meets Full-Sequence Diffusion

Chen, Boyuan; Martí_Monsó, Diego; Du, Yilun; Simchowitz, Max; Tedrake, Russ; Sitzmann, Vincent (September 2024, Neural Information Processing Systems 2024)

This paper presents Diffusion Forcing, a new training paradigm where a diffusion model is trained to denoise a set of tokens with independent per-token noise levels. We apply Diffusion Forcing to sequence generative modeling by training a causal next-token prediction model to generate one or several future tokens without fully diffusing past ones. Our approach is shown to combine the strengths of next-token prediction models, such as variable-length generation, with the strengths of full-sequence diffusion models, such as the ability to guide sampling to desirable trajectories. Our method offers a range of additional capabilities, such as (1) rolling-out sequences of continuous tokens, such as video, with lengths past the training horizon, where baselines diverge and (2) new sampling and guiding schemes that uniquely profit from Diffusion Forcing's variable-horizon and causal architecture, and which lead to marked performance gains in decision-making and planning tasks. In addition to its empirical success, our method is proven to optimize a variational lower bound on the likelihoods of all subsequences of tokens drawn from the true joint distribution.
more » « less
Full Text Available
pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Charatan, David; Li, Sizhe; Tagliasacchi, Andrea; Sitzmann, Vincent (June 2024, The Conference on Computer Vision and Pattern Recognition)

We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. Our model features real-time and memory-efficient rendering for scalable training as well as fast 3D reconstruction at inference time. To overcome local minima inherent to sparse and locally supported representations, we predict a dense probability distribution over 3D and sample Gaussian means from that probability distribution. We make this sampling operation differentiable via a reparameterization trick, allowing us to back-propagate gradients through the Gaussian splatting representation. We benchmark our method on wide-baseline novel view synthesis on the real-world RealEstate10k and ACID datasets, where we outperform state-of-the-art light field transformers and accelerate rendering by 2.5 orders of magnitude while reconstructing an interpretable and editable 3D radiance field.
more » « less
Full Text Available
pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction

Charatan, David; Li, Sizhe; Tagliasacchi, Andrea; Sitzmann, Vincent (June 2024, The Conference on Computer Vision and Pattern Recognition)

We introduce pixelSplat, a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. Our model features real-time and memory-efficient rendering for scalable training as well as fast 3D reconstruction at inference time. To overcome local minima inherent to sparse and locally supported representations, we predict a dense probability distribution over 3D and sample Gaussian means from that probability distribution. We make this sampling operation differentiable via a reparameterization trick, allowing us to back-propagate gradients through the Gaussian splatting representation. We benchmark our method on wide-baseline novel view synthesis on the real-world RealEstate10k and ACID datasets, where we outperform state-of-the-art light field transformers and accelerate rendering by 2.5 orders of magnitude while reconstructing an interpretable and editable 3D radiance field.
more » « less
Full Text Available
FlowCam: Training generalizable 3D radiance fields without camera poses via pixel-aligned scene flow

Smith, Cameron; Du, Yilun; Tewari, Ayush; Sitzmann, Vincent (December 2023, Neural Information Processing Systems)

Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning. The key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion, which is prohibitively expensive to run at scale. We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass. We estimate poses by first lifting frame-to-frame optical flow to 3D scene flow via differentiable rendering, preserving locality and shift-equivariance of the image processing backbone. SE(3) camera pose estimation is then performed via a weighted least-squares fit to the scene flow field. This formulation enables us to jointly supervise pose estimation and a generalizable neural scene representation via re-rendering the input video, and thus, train end-to-end and fully self-supervised on real-world video datasets. We demonstrate that our method performs robustly on diverse, real-world video, notably on sequences traditionally challenging to optimization-based pose estimation techniques.
more » « less
Full Text Available
FlowCam: Training generalizable 3D radiance fields without camera poses via pixel-aligned scene flow

Smith, Cameron; Du, Yilun; Tewari, Ayush; Sitzmann, Vincent (December 2023, Neural Information Processing Systems)

Reconstruction of 3D neural fields from posed images has emerged as a promising method for self-supervised representation learning. The key challenge preventing the deployment of these 3D scene learners on large-scale video data is their dependence on precise camera poses from structure-from-motion, which is prohibitively expensive to run at scale. We propose a method that jointly reconstructs camera poses and 3D neural scene representations online and in a single forward pass. We estimate poses by first lifting frame-to-frame optical flow to 3D scene flow via differentiable rendering, preserving locality and shift-equivariance of the image processing backbone. SE(3) camera pose estimation is then performed via a weighted least-squares fit to the scene flow field. This formulation enables us to jointly supervise pose estimation and a generalizable neural scene representation via re-rendering the input video, and thus, train end-to-end and fully self-supervised on real-world video datasets. We demonstrate that our method performs robustly on diverse, real-world video, notably on sequences traditionally challenging to optimization-based pose estimation techniques.
more » « less
Full Text Available
Variational Barycentric Coordinates

Dodik, Ana; Stein, Oded; Sitzmann, Vincent; Solomon, Justin (December 2023, ACM Transactions on Graphics)

We propose a variational technique to optimize for generalized barycentric coordinates that offers additional control compared to existing models. Prior work represents barycentric coordinates using meshes or closed-form formulae, limiting the choice of objective function. In contrast, we directly parameterize the continuous function mapping any coordinate in a polytope’s interior to its barycentric coordinates using a neural field. Enabled by our theoretical characterization of barycentric coordinates, we construct neural fields parameterizing valid coordinates. We demonstrate flexibility using various objective functions, validate our algorithm, and present several applications.
more » « less
Full Text Available
Variational Barycentric Coordinates

Dodik, Ana; Stein, Oded; Sitzmann, Vincent; Solomon, Justin (December 2023, ACM Transactions on Graphics)

We propose a variational technique to optimize for generalized barycentric coordinates that offers additional control compared to existing models. Prior work represents barycentric coordinates using meshes or closed-form formulae, limiting the choice of objective function. In contrast, we directly parameterize the continuous function mapping any coordinate in a polytope’s interior to its barycentric coordinates using a neural field. Enabled by our theoretical characterization of barycentric coordinates, we construct neural fields parameterizing valid coordinates. We demonstrate flexibility using various objective functions, validate our algorithm, and present several applications.
more » « less
Full Text Available

« Prev Next »

Search for: All records