skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Hierarchical Autoregressive Modeling for Neural Video Compression
Recent work by Marino et al. (2020) showed improved performance in sequential density estimation by combining masked autoregressive flows with hierarchical latent variable models. We draw a connection between such autoregressive generative models and the task of lossy video compression. Specifically, we view recent neural video compression methods (Lu et al., 2019; Yang et al., 2020b; Agustsson et al., 2020) as instances of a generalized stochastic temporal autoregressive transform, and propose avenues for enhancement based on this insight. Comprehensive evaluations on large-scale video data show improved rate-distortion performance over both state-of-the-art neural and conventional video compression methods.  more » « less
Award ID(s):
2007719 2003237
PAR ID:
10301042
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
International Conference on Learning Representations
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Denoising diffusion probabilistic models are a promising new class of generative models that mark a milestone in high-quality image generation. This paper showcases their ability to sequentially generate video, surpassing prior methods in perceptual and probabilistic forecasting metrics. We propose an autoregressive, end-to-end optimized video diffusion model inspired by recent advances in neural video compression. The model successively generates future frames by correcting a deterministic next-frame prediction using a stochastic residual generated by an inverse diffusion process. We compare this approach against six baselines on four datasets involving natural and simulation-based videos. We find significant improvements in terms of perceptual quality and probabilistic frame forecasting ability for all datasets. 
    more » « less
  2. This work introduces a transformer-based image and video tokenizer leveraging Binary Spherical Quantization (BSQ). The method projects high-dimensional visual embeddings onto a lower-dimensional hypersphere followed by binary quantization. BSQ offers three key benefits: (1) parameter efficiency without requiring an explicit codebook, (2) scalability to arbitrary token dimensions, and (3) high compression capability—up to 100× compression of visual data with minimal distortion. The tokenizer architecture includes a transformer encoder-decoder with block-wise causal masking to handle variable-length video inputs. The resulting model, BSQ-ViT, achieves state-of-the-art visual reconstruction performance on image and video benchmarks while delivering 2.4× higher throughput compared to previous best methods. Additionally, BSQ-ViT supports video compression via autoregressive priors for adaptive arithmetic coding, achieving results comparable to leading video compression standards. Furthermore, it enables masked language models to achieve competitive image synthesis quality relative to GAN- and diffusion-based approaches. 
    more » « less
  3. null (Ed.)
    We consider the problem of lossy image compression with deep latent variable models. State-of-the-art methods [Ballé et al., 2018, Minnen et al., 2018, Lee et al., 2019] build on hierarchical variational autoencoders (VAEs) and learn inference networks to predict a compressible latent representation of each data point. Drawing on the variational inference perspective on compression [Alemi et al., 2018], we identify three approximation gaps which limit performance in the conventional approach: an amortization gap, a discretization gap, and a marginalization gap. We propose remedies for each of these three limitations based on ideas related to iterative inference, stochastic annealing for discrete optimization, and bits-back coding, resulting in the first application of bits-back coding to lossy compression. In our experiments, which include extensive baseline comparisons and ablation studies, we achieve new state-of-the-art performance on lossy image compression using an established VAE architecture, by changing only the inference method. 
    more » « less
  4. Time series forecasting with additional spatial information has attracted a tremendous amount of attention in recent research, due to its importance in various real-world applications on social studies, such as conflict prediction and pandemic forecasting. Conventional machine learning methods either consider temporal dependencies only, or treat spatial and temporal relations as two separate autoregressive models, namely, space-time autoregressive models. Such methods suffer when it comes to long-term forecasting or predictions for large-scale areas, due to the high nonlinearity and complexity of spatio-temporal data. In this paper, we propose to address these challenges using spatio-temporal graph neural networks. Empirical results on Violence Early Warning System (ViEWS) dataset and U.S. Covid-19 dataset indicate that our method significantly improved performance over the baseline approaches. 
    more » « less
  5. With the ever-increasing amount of 3D data being captured and processed, multi-view image compression is essential to various applications, including virtual reality and 3D modeling. Despite the considerable success of learning-based compression models on single images, limited progress has been made in multi-view image compression. In this paper, we propose an efficient approach to multi-view image compression by leveraging the redundant information across different viewpoints without explicitly using warping operations or camera parameters. Our method builds upon the recent advancements in Multi-Reference Entropy Models (MEM), which were initially proposed to capture correlations within an image. We extend the MEM models to employ cross-view correlations in addition to within-image correlations. Specifically, we generate latent representations for each view independently and integrate a cross-view context module within the entropy model. The estimation of entropy parameters for each view follows an autoregressive technique, leveraging correlations with the previous views. We show that adding this view context module further enhances the compression performance when jointly trained with the autoencoder. Experimental results demonstrate superior performance compared to both traditional and learning-based multi-view compression methods. 
    more » « less