<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Variational Bayesian Quantization</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2020 Summer</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10272500</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of Machine Learning Research</title>
<idno>2640-3498</idno>
<biblScope unit="volume">119</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yibo Yang</author><author>Robert Bamler</author><author>Stephan Mandt</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[We propose a novel algorithm for quantizing continuous latent representations in trained models. Our approach applies to deep probabilistic models, such as variational autoencoders (VAEs), and enables both data and model compression. Unlike current end-to-end neural compression methods that cater the model to a fixed quantization scheme, our algorithm separates model design and training from quantization. Consequently,our algorithm enables “plug-and-play” compression with variable rate-distortion trade-off, using a single trained model. Our algorithm can be seen as a novel extension of arithmetic coding to the continuous domain, and uses adaptive quantization accuracy based on estimates of posterior uncertainty. Our experimental results demonstrate the importance of taking into account posterior uncertainties, and show that image compression with the proposed algorithm outperforms JPEG over a wide range of bit rates using only a singlestandard VAE. Further experiments on Bayesian neural word embeddings demonstrate the versatility of the proposed method.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Latent-variable models have become a mainstay of modern machine learning. Scalable approximate Bayesian inference methods, in particular Black Box Variational Inference <ref type="bibr">(Ranganath et al., 2014;</ref><ref type="bibr">Rezende et al., 2014)</ref>, have spurred the development of increasingly large and expressive probabilistic models, including deep generative probabilistic models such as variational autoencoders <ref type="bibr">(Kingma &amp; Welling, 2014b)</ref> and Bayesian neural networks <ref type="bibr">(MacKay, 1992;</ref><ref type="bibr">Blundell et al., 2015)</ref>. One natural application of deep latent variable modeling is data compression, and recent work has focused on end-to-end procedures that optimize a model for a particular compression objective. Here, we study a related but different problem: given a trained model, what is the best way to encode the information contained in its continuous latent variables?</p><p>As we show, our proposed solution provides a new "plugand-play" approach to lossy compression of both data instances (represented by local latent variables, e.g., in a VAE) as well as model parameters (represented by global latent variables that serve as parameters of a Bayesian statistical model). Our approach separates the compression task from model design and training, thus implementing variablebitrate compression as an independent post-processing step in a wide class of existing latent variable models.</p><p>At the heart of our proposed method lies a novel quantization scheme that optimizes a rate-distortion trade-off by exploiting posterior uncertainty estimates. Quantization is central to lossy compression, as continuous-valued data like natural images, videos, and distributed representations ultimately need to be discretized to a finite number of bits for digital storage or transmission. Lossy compression algorithms therefore typically find a discrete approximation of some semantic representation of the data, which is then encoded with a lossless compression method.</p><p>In classical lossy compression methods such as JPEG or MP3, the semantic representation is carefully designed to support compression at variable bitrates. By contrast, stateof-the-art deep learning based approaches to lossy data compression <ref type="bibr">(Ball&#233; et al., 2017;</ref><ref type="bibr">2018;</ref><ref type="bibr">Rippel &amp; Bourdev, 2017;</ref><ref type="bibr">Mentzer et al., 2018;</ref><ref type="bibr">Lombardo et al., 2019)</ref> are trained to minimize a distortion metric at a fixed bitrate. To support variable-bitrate compression in this approach, one has to train several models for different bitrates. While training several models may be viable in many cases, a bigger issue is the increase in decoder size as the decoder has to store the parameters of not one but several deep neural networks for each bitrate setting. In applications like video streaming under fluctuating connectivity, the decoder further has to load a new deep learning model into memory every time a change in bandwidth requires adjusting the bitrate.</p><p>By contrast, we propose a a quantization method for latent variable models that decouples training from compression, and that enables variable-bitrate compression with a single model. We generalize a classical entropy coding algorithm, Arithmetic Coding <ref type="bibr">(Witten et al., 1987;</ref><ref type="bibr">MacKay, 2003)</ref>, from the discrete to continuous domain. Our proposed algorithm, Variational Bayesian Quantization, exploits posterior uncertainty estimates to automatically reduce the quantization accuracy of latent variables for which the model is uncertain anyway. This strategy is analogous to the way humans communicate quantitative information. For example, Wikipedia lists the population of Rome in 2017 with the specific number 2,879,728. By contrast, its population in the year 500 AD is estimated by the round number 100,000 because the high uncertainty would make a more precise number meaningless. Our ablation studies show that this posterior-informed quantization scheme is crucial to obtaining competitive performance.</p><p>In detail, our contributions are as follows:</p><p>&#8226; A new discretization scheme. We present a novel approach to discretizing latent variables in a variational inference framework. Our approach generalizes arithmetic coding from discrete to continuous distributions and takes posterior uncertainty into account.</p><p>&#8226; Single-model compression at variable bitrates. The decoupling of modeling and compression allows us to adjust the trade-off between bitrate and distortion in post-processing. This is in contrast to existing approaches to both data and model compression, which often require specially trained models for each bitrate.</p><p>&#8226; Automatic self-pruning. Deep latent variable models often exhibit posterior collapse, i.e., the variational posterior collapses to the model prior. In our approach, latent dimensions with collapsed posteriors require close to zero bits, thus don't require manual pruning.</p><p>&#8226; Competitive experimental performance. We show that our method outperforms JPEG over a wide range of bitrates using only a single model. We also show that we can successfully compress word embeddings with minimal loss, as evaluated on semantic reasoning task.</p><p>The paper is structured as follows: Section 2 reviews related work in neural compression; Section 3 proposes our Variational Bayesian Quantization algorithm. We give empirical results in Section 4, and conclude in Section 5. Section 6 provides additional theoretical insight about our method.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Work</head><p>Compressing continuous-valued data is a classical problem in the signal processing community. Typically, a distortion measure (often the squared error) and a source distribution are assumed, and the goal is to design a quantizer that optimizes the rate-distortion (R-D) performance <ref type="bibr">(Lloyd, 1982;</ref><ref type="bibr">Berger, 1972;</ref><ref type="bibr">Chou et al., 1989)</ref>. Optimal vector quantization, although theoretically well-motivated <ref type="bibr">(Gallager, 1968)</ref>, is not tractable in high-dimensional spaces <ref type="bibr">(Gersho &amp; Gray, 2012)</ref> and not scalable in practice. Therefore most classical lossy compression algorithms map data to a suitably designed semantic representation, in such a way that coordinate-wise scalar quantization can be fruitfully applied.</p><p>Recent machine-learning-based data compression methods learn such hand-designed representation from data, but similar to classical methods, most such ML methods directly take quantization into account in the generative model design or training. Various approaches have been proposed to approximate the non-differentiable quantization operation during training, such as stochastic binarization <ref type="bibr">(Toderici et al., 2016;</ref><ref type="bibr">2017)</ref>, additive uniform noise <ref type="bibr">(Ball&#233; et al., 2017;</ref><ref type="bibr">2018;</ref><ref type="bibr">Habibian et al., 2019)</ref>, or other differentiable approximation <ref type="bibr">(Agustsson et al., 2017;</ref><ref type="bibr">Theis et al., 2017;</ref><ref type="bibr">Mentzer et al., 2018;</ref><ref type="bibr">Rippel &amp; Bourdev, 2017)</ref>; many such schemes result in quantization with a uniformly-spaced grid, with the exception of <ref type="bibr">(Agustsson et al., 2017)</ref>, which optimizes for quantization grid points. <ref type="bibr">Yang et al. (2020)</ref> considers optimal quantization at compression time, but assumes a fixed quantization scheme of <ref type="bibr">(Ball&#233; et al., 2017)</ref> during training.</p><p>We depart from such approaches by treating quantization as a post-processing step decoupled from model design and training. Crucial to our approach is a new quantization scheme that automatically adapts to different length scales in the representation space based on posterior uncertainty estimates. To our best knowledge, the only prior work that uses posterior uncertainty for compression is in the context of bits-back coding <ref type="bibr">(Honkela &amp; Valpola, 2004;</ref><ref type="bibr">Townsend et al., 2019)</ref>, but these works focus on lossless compression, with the recent exception of <ref type="bibr">(Yang et al., 2020)</ref>.</p><p>Most existing neural image compression methods require training a separate machine learning model for each desired bitrate setting <ref type="bibr">(Ball&#233; et al., 2017;</ref><ref type="bibr">2018;</ref><ref type="bibr">Mentzer et al., 2018;</ref><ref type="bibr">Theis et al., 2017;</ref><ref type="bibr">Lombardo et al., 2019)</ref>. In fact, <ref type="bibr">Alemi et al. (2018)</ref> showed that any particular fitted VAE model only targets one specific point on the rate-distortion curve.</p><p>Our approach has the same benefit of variable-bitrate singlemodel compression as methods based on recurrent VAEs <ref type="bibr">(Gregor et al., 2016;</ref><ref type="bibr">Toderici et al., 2016;</ref><ref type="bibr">2017;</ref><ref type="bibr">Johnston et al., 2018)</ref>; but unlike these methods, which use dedicated model architecture for progressive image reconstruction, we instead focus more broadly on quantizing latent representations in a given generative model, designed and trained for specific application purposes (possibly other than compression, e.g., modeling complex scientific observations).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Posterior-Informed Variable-Bitrate Compression</head><p>We now propose an algorithm for quantizing latent variables in trained models. After describing the problem setup and assumptions (Subsection 3.1), we briefly review Arithmetic Coding (Subection 3.2). Subsection 3.3 describes our proposed lossy compression algorithm, which generalizes Arithmetic Coding to the continuous domain.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Problem Setup</head><p>Generative Model and Variational Inference. We consider a wide class of generative probabilistic models with data x and unknown (or "latent") variables z &#8712; R K from some continuous latent space with dimension K. The generative model is defined by a joint probability distribution,</p><p>with a prior p(z) and a likelihood p(x|z). Although our presentation focuses on unsupervised representation learning, our framework also captures the supervised setup.<ref type="foot">foot_0</ref> </p><p>Our proposed compression method uses z as a proxy to describe the data x. This requires "solving" Eq. 1 for z given x, i.e., inferring the posterior p(z|x) = p(x, z)/ p(x, z) dz.</p><p>Since exact Bayesian inference is often intractable, we resort to Variational Inference (VI) <ref type="bibr">(Jordan et al., 1999;</ref><ref type="bibr">Blei et al., 2017;</ref><ref type="bibr">Zhang et al., 2019)</ref>, which approximates the posterior by a so-called variational distribution q &#966; (z|x) by minimizing the Kullback-Leibler divergence D KL (q &#966; (z|x) || p(z|x)) over a set of variational parameters &#966;.</p><p>Factorization Assumptions. We assume that both the prior p(z) and the variational distribution q &#966; (z|x) are fully factorized (mean-field assumption). For concreteness, our examples use a Gaussian variational distribution. Thus,</p><p>(2)</p><p>where p(z i ) is a prior for the i th component of z, and the means &#181; i and standard deviations &#963; i together comprise the variational parameters &#966; over which VI optimizes.<ref type="foot">foot_1</ref> </p><p>Prominently, the model class defined by Eqs. 1-3 includes variational autoencoders (VAEs) <ref type="bibr">(Kingma &amp; Welling, 2014a)</ref> for data compression, but we stress that the class is much wider, capturing also Bayesian neural nets <ref type="bibr">(MacKay, 2003)</ref>, probabilistic word embeddings <ref type="bibr">(Barkan, 2017;</ref><ref type="bibr">Bamler &amp; Mandt, 2017)</ref>, matrix factorization <ref type="bibr">(Mnih &amp; Salakhutdinov, 2008)</ref>, and topic models <ref type="bibr">(Blei et al., 2003)</ref>.</p><p>Protocol Overview. We consider two parties in communication, a sender and a receiver. In the case of data compression, both parties have access to the model, but only the sender has access to the data point x, which it uses to fit a variational distribution q &#966; (z|x). It then uses the algorithm proposed below to select a latent variable vector &#7825; that has high probability under q &#966; , and that can be encoded into a compressed bitstring, which gets transmitted to the receiver. The receiver losslessly decodes the compressed bitstring back into &#7825; and uses the likelihood p(x|&#7825;) to generate a reconstructed data point x, typically setting x = arg max x p(x|&#7825;). In the case of model compression, the sender infers a distribution q &#966; (z|x) over model parameters z given training data x, and uses our algorithm to select a suitable vector &#7825; of quantized model parameters. The receiver receives &#7825; and uses it to reconstruct the model.</p><p>The rest of this section describes how the proposed algorithm selects &#7825; and encodes it into a compressed bitstring.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Background: Arithmetic Coding</head><p>Our quantization algorithm, introduced in Section 3.3 below, is inspired by a lossless compression algorithm, arithmetic coding (AC) <ref type="bibr">(Witten et al., 1987;</ref><ref type="bibr">MacKay, 2003)</ref>, which we generalize from discrete data to the continuous space of latent variables z &#8712; R K . To get there, we first review the main idea of AC that our proposed algorithm borrows.</p><p>AC is an instance of so-called entropy coding. It uniquely maps messages m &#8712; M from a discrete set M to a compressed bitstring of some length R m (the "bitrate"). Entropy coding exploits prior knowledge of the distribution p(m) of messages to map probable messages to short bitstrings while spending more bits on improbable messages. This way, entropy coding algorithms aim to minimize the expected rate</p><p>For lossless compression, the expected rate has a fundamental lower bound, the entroy</p><p>where h(m) =log 2 p(m) is the Shannon information content of m. AC provides near optimal lossless compression as it maps each message m &#8712; M to a bitstring of length R m = &#8968;h(m)&#8969;, where &#8968;&#8226;&#8969; denotes the ceiling function.</p><p>AC is usually discussed in the context of streaming compression where m is a sequence of symbols from a finite alphabet, as AC improves on this task over the more widely known Huffman coding <ref type="bibr">(Huffman, 1952)</ref>. In our work, we focus on a different aspect of AC: its use of a cumulative probability distribution function to map a nonuniformly distributed random variable m &#8764; p(m) to a number &#958; that is nearly uniformly distributed over the interval [0, 1).</p><p>decreasing bitrate, increasing distortion MNIST Frey Faces </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.">Image Compression</head><p>While Section 4.1 demonstrated the proposed VBQ method for model compression, we now apply the same method to data compression using a variational autoencoder (VAE).</p><p>We first provide qualitative insight on small-scale images, and then quantitative results on full resolution color images.</p><p>Model. For simplicity, we consider regular VAEs with a standard normal prior and Gaussian variational posterior.</p><p>The generative network parameterizes a factorized categorical or Gaussian likelihood model in experiments in Sec. 4.2.1 or 4.2.2, respectively. Network architectures are described below and in more detail in Supplementary Material.</p><p>Baselines. We consider the following baselines:</p><p>&#8226; Uniform quantization: for a given image x, we quantize each dimension of the posterior mean vector &#181;(x) to a uniform grid. We report the bitrate for encoding the resulting quantized latent representation via standard entropy coding (e.g., arithmetic coding). Entropy coding requires prior knowledge of the probabilities of each grid point. Here, we use the empirical frequencies of grid points over a subset of the training set;</p><p>&#8226; k-means quantization: similar to "uniform quantization", but with the placement of grid points optimized via k-means clustering on a subset of the training data;</p><p>&#8226; Quantization with generalized Lloyd algorithm: similar to above, but the grid points are optimized using generalized Lloyd algorithm <ref type="bibr">(Chou et al., 1989)</ref>, a widelyused state-of-the-art classical quantization method;</p><p>&#8226; JPEG: we used the libjpeg implementation packaged with the Python Pillow library, using default configurations (e.g., 4:2:0 subsampling), and we adjust the quality parameter to vary the rate-distortion trade-off; &#8226; Deep learning baseline: we compare to <ref type="bibr">Ball&#233; et al. (2017)</ref>, who directly optimized for the rate and distortion, training a separate model for each point on the R-D curve. In our large-scale experiment, we adopte their model architecture, so their performance essentially represents the end-to-end optimized performance upper bound for our method (which uses a single model).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1.">QUALITATIVE ANALYSIS ON TOY DATASETS</head><p>We trained a VAE on the MNIST dataset and the Frey Faces dataset, using 5 and 4-dimensional latent spaces, respectively. See Supplemental Material for experimental details.</p><p>Figure <ref type="figure">4</ref> shows example image reconstructions from our VBQ algorithm with increasing &#955;, and thus decreasing bitrate. The right-most column is the extreme case &#955; &#8594; &#8734;, resulting in the shortest possible bistring encoding &#958;i = (0.1) 2 = 1 2 (i.e., &#7825;i being the median of the prior p(z i )) for every dimension i. As the bitrate decreases (as R( &#958;) &#8594; 0), our method gradually "confuses" the original image with a generic image (roughly in the center of the embedding space), while preserving approximately the same level of sharpness. This is in contrast to JPEG which typically introduces blocky and/or pixel-level artifacts at lower bitrates.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2.">FULL-RESOLUTION COLOR IMAGES</head><p>We apply our VBQ method to a VAE trained on color images, and obtain practical image compression performance rivaling JPEG, while outperforming baselines that ignore posterior uncertainty and directly quantize latent variables.</p><p>Model and Dataset. The inference and generative networks of our VAE are identical to the analysis and synthesis networks of <ref type="bibr">Ball&#233; et al. (2017)</ref>, using 3 layers of 256 filters each in a convolutional architecture. We used a diagonal Gaussian likelihood model, whose mean is computed by the generative net and the variance &#963; 2 is fixed as a hyper-parameter, similar to a &#946;-VAE <ref type="bibr">(Higgins et al., 2017)</ref> approach (&#963; 2 was tuned to 0.001 to ensure the VAE achieved overall good R-D trade-off; see <ref type="bibr">(Alemi et al., 2018)</ref>). We trained the model on the same subset of the ImageNet dataset as used in <ref type="bibr">(Ball&#233; et al., 2017)</ref>. We evaluated performance on the standard Kodak (Kodak) dataset, a separate set of 24 uncompressed color images. As in the word embedding experiment, we also observed that using an empirical prior for our method improved the bitrate; for this, we used the flexible density model of <ref type="bibr">Ball&#233; et al. (2018)</ref>, fitting a different distribution per latent channel, on samples of posterior means &#181; (treating spatial dimensions as i.i.d.).</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>For supervised learning with labels y, we would consider a conditional generative model p(y, z|x) = p(y|z, x) p(z) with conditional likelihood p(y|z, x), where z are the model parameters, treated as a Bayesian latent variable with associated prior p(z).</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>These parameters are often amortized by a neural network (in which case &#181;i and &#963;i depend on x), but don't have to (in which case &#181;i and &#963;i do not depend on x and are directly optimized).</p></note>
		</body>
		</text>
</TEI>
