Image and Video Tokenization with Binary Spherical Quantization

Zhao, Yue; Xiong, Yuanjun; Krähenbühl, Philipp

Citation Details

This work introduces a transformer-based image and video tokenizer leveraging Binary Spherical Quantization (BSQ). The method projects high-dimensional visual embeddings onto a lower-dimensional hypersphere followed by binary quantization. BSQ offers three key benefits: (1) parameter efficiency without requiring an explicit codebook, (2) scalability to arbitrary token dimensions, and (3) high compression capability—up to 100× compression of visual data with minimal distortion. The tokenizer architecture includes a transformer encoder-decoder with block-wise causal masking to handle variable-length video inputs. The resulting model, BSQ-ViT, achieves state-of-the-art visual reconstruction performance on image and video benchmarks while delivering 2.4× higher throughput compared to previous best methods. Additionally, BSQ-ViT supports video compression via autoregressive priors for adaptive arithmetic coding, achieving results comparable to leading video compression standards. Furthermore, it enables masked language models to achieve competitive image synthesis quality relative to GAN- and diffusion-based approaches. more »

Award ID(s):: 2505865

PAR ID:: 10631957

Author(s) / Creator(s):: Zhao, Yue; Xiong, Yuanjun; Krähenbühl, Philipp

Publisher / Repository:: https://doi.org/10.48550/arXiv.2406.07548

Date Published:: 2024-06-11

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this