This content will become publicly available on May 1, 2025
- Award ID(s):
- 2212418
- PAR ID:
- 10536743
- Publisher / Repository:
- International Conference on Learning Representations
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Diffusion models are powerful, but they require a lot of time and data to train. We propose Patch Diffusion, a generic patch-wise training framework, to significantly reduce the training time costs while improving data efficiency, which thus helps democratize diffusion model training to broader users. At the core of our innovations is a new conditional score function at the patch level, where the patch location in the original image is included as additional coordinate channels, while the patch size is randomized and diversified throughout training to encode the cross-region dependency at multiple scales. Sampling with our method is as easy as in the original diffusion model. Through Patch Diffusion, we could achieve ≥2× faster training, while maintaining comparable or better generation quality. Patch Diffusion meanwhile improves the performance of diffusion models trained on relatively small datasets, e.g., as few as 5,000 images to train from scratch. We achieve outstanding FID scores in line with state-of-the-art benchmarks: 1.77 on CelebA-64×64, 1.93 on AFHQv2-Wild-64×64, and 2.72 on ImageNet-256×256. We share our code and pre-trained models in GitHub.more » « less
-
null (Ed.)Training competitive deep video models is an order of magnitude slower than training their counterpart image models. Slow training causes long research cycles, which hinders progress in video understanding research. Following standard practice for training image models, video model training has used a fixed mini-batch shape: a specific number of clips, frames, and spatial size. However, what is the optimal shape? High resolution models perform well, but train slowly. Low resolution models train faster, but are less accurate. Inspired by multigrid methods in numerical optimization, we propose to use variable mini-batch shapes with different spatial-temporal resolutions that are varied according to a schedule. The different shapes arise from resampling the training data on multiple sampling grids. Training is accelerated by scaling up the mini-batch size and learning rate when shrinking the other dimensions. We empirically demonstrate a general and robust grid schedule that yields a significant out-of-the-box training speedup without a loss in accuracy for different models (I3D, non-local, SlowFast), datasets (Kinetics, Something-Something, Charades), and training settings (with and without pre-training, 128 GPUs or 1 GPU). As an illustrative example, the proposed multigrid method trains a ResNet-50 SlowFast network 4.5 x faster (wall-clock time, same hardware) while also improving accuracy (+ 0.8% absolute) on Kinetics-400 compared to baseline training. Code is available online.more » « less
-
The inference stage of diffusion models involves running a reverse-time diffusion stochastic differential equation, transforming samples from a Gaussian latent distribution into samples from a target distribution on a low-dimensional manifold. The intermediate values can be interpreted as noisy images, with the amount of noise determined by the forward diffusion process noise schedule. Boomerang is an approach for local sampling of image manifolds, which involves adding noise to an input image, moving it closer to the latent space, and mapping it back to the image manifold through a partial reverse diffusion process. Boomerang can be used with any pretrained diffusion model without adjustments to the reverse diffusion process, and we present three applications: constructing privacy-preserving datasets with controllable anonymity, increasing generalization performance with Boomerang for data augmentation, and enhancing resolution with a perceptual image enhancement framework.more » « less
-
Diffusion models, emerging as powerful deep generative tools, excel in various applications. They operate through a two-steps process: introducing noise into training samples and then employing a model to convert random noise into new samples (e.g., images). However, their remarkable generative performance is hindered by slow training and sampling. This is due to the necessity of tracking extensive forward and reverse diffusion trajectories, and employing a large model with numerous parameters across multiple timesteps (i.e., noise levels). To tackle these challenges, we present a multi-stage framework inspired by our empirical findings. These observations indicate the advantages of employing distinct parameters tailored to each timestep while retaining universal parameters shared across all time steps. Our approach involves segmenting the time interval into multiple stages where we employ custom multi-decoder U-net architecture that blends time-dependent models with a universally shared encoder. Our framework enables the efficient distribution of computational resources and mitigates inter-stage interference, which substantially improves training efficiency. Extensive numerical experiments affirm the effectiveness of our framework, showcasing significant training and sampling efficiency enhancements on three state-of-the-art diffusion models, including large-scale latent diffusion models. Furthermore, our ablation studies illustrate the impact of two important components in our framework: (i) a novel timestep clustering algorithm for stage division, and (ii) an innovative multi-decoder U-net architecture, seamlessly integrating universal and customized hyperparameters.more » « less
-
Artificial intelligence has recently been widely used in computational imaging. The deep neural network (DNN) improves the signal-to-noise ratio of the retrieved images, whose quality is otherwise corrupted due to the low sampling ratio or noisy environments. This work proposes a new computational imaging scheme based on the sequence transduction mechanism with the transformer network. The simulation database assists the network in achieving signal translation ability. The experimental single-pixel detector’s signal will be ‘translated’ into a 2D image in an end-to-end manner. High-quality images with no background noise can be retrieved at a sampling ratio as low as 2%. The illumination patterns can be either well-designed speckle patterns for sub-Nyquist imaging or random speckle patterns. Moreover, our method is robust to noise interference. This translation mechanism opens a new direction for DNN-assisted ghost imaging and can be used in various computational imaging scenarios.