<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Optimizing JPEG Quantization for Classification Networks</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10192532</idno>
					<idno type="doi"></idno>
					<title level='j'>Resource-Constrained Machine Learning (ReCoML)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>ZhijingDe Sa Li</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Deep learning for computer vision depends on lossy image compression: it reduces the storage required for training and test data and lowers transfer costs in deployment. Mainstream datasets and imaging pipelines all rely on standard JPEG compression. In JPEG, the degree of quantization of frequency coefficients controls the lossiness: an 8x8 quantization table (Q-table) decides both the quality of the encoded image and the compression ratio. While a long history of work has sought better Q-tables, existing work either seeks to minimize image distortion or to optimize for models of the human visual system. This work asks whether JPEG Q-tables exist that are “better” for specific vision networks and can offer better quality–size trade-offs than ones designed for human perception or minimal distortion.We reconstruct an ImageNet test set with higher resolution to explore the effect of JPEG compression under novel Q-tables. We attempt several approaches to tune a Q-table for a vision task. We find that a simple sorted random sampling method can exceed the performance of the standard JPEG Q-table. We also use hyper-parameter tuning techniques including bounded random search, Bayesian optimization, and composite heuristic optimization methods. The new Q-tables we obtained can improve the compression rate by 10% to 200% when the accuracy is fixed, or improve accuracy up to 2% at the same compression rate.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Deep neural networks for vision have extreme storage requirements. For example, ImageNet is 150 GB <ref type="bibr">(Deng et al., 2009)</ref> and the Open Image Dataset requires 500 GB <ref type="bibr">(Kuznetsova et al., 2018)</ref>. If a model can achieve equivalent performance with images at a higher compression ratio, it can be cheaper to acquire, transmit, and store the necessary datasets.</p><p>Nearly all DNN datasets use the ubiquitous JPEG compression standard, including ImageNet <ref type="bibr">(Deng et al., 2009)</ref>, PAS-CAL VOC <ref type="bibr">(Everingham et al.)</ref>, COCO <ref type="bibr">(Lin et al., 2014)</ref>, etc. Because JPEG is a loss compression format, DNNs cope with some quality distortion in these images <ref type="bibr">(Dodge &amp; Karam, 2016)</ref>. However, the JPEG standard is tuned for the human visual system (HVS) and preserves the components that are most relevant to human observers, rather than to computer vision algorithms. This work asks whether machine learning pipelines would perform better if JPEG were instead tuned specifically for a given DNN. Correspondence to: Zhijing Li &lt;zl679@cornell.edu&gt;. algorithm. Each spatial component (i.e., each YCbCr channel) is first partitioned into 8&#215;8 non-overlapping blocks. Then the spatial components are transformed to frequency components with the 2D discrete cosine transform (DCT). Next, the algorithm quantizes the frequency components using a quantization table or Q-table. The Q-table's scale is determined by the quality factor ranging from 1 to 100 <ref type="bibr">(Lu-aDist, 2015)</ref>. For instance, a quality factor of 100 scales Q-table coefficients to 0 and the frequency components are rounded to integers. Then, the quantized coefficients are serialized using a zig-zag ordering and losslessly compressed with entropy coding and Huffman coding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Resource-Constrained Machine</head><p>Quantization is the lossy part of the algorithm. The Q-table decides how many bits to use for each frequency bucket, potentially playing a critical role for DNNs that use JPEGcompressed images. However, existing research on designing Q-tables mostly prioritizes human-precepible distortion as the compression target <ref type="bibr">(Liu et al., 2018)</ref>, which may compress and quantize features important to DNNs <ref type="bibr">(Wright et al., 2009)</ref>. Prior work instead targets DNN accuracy <ref type="bibr">(Liu et al., 2018)</ref>, but it only considers recompressing existing JPEG-compressed datasets rather than changing the way data is compressed in the first place.</p><p>In this work, use a high-resolution dataset to simulate the effect of compressing raw pixels with novel, task-specific quantization. We use a range of optimization methods to tune a customized Q-table for a specific DNN that max-dataset to simulate raw pixel data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Constructing a Dataset</head><p>ImageNet is already downsized and lossily compressed. For instance, the ImageNet 2013 classification dataset has an average resolution of 482&#215;415 pixels <ref type="bibr">(Russakovsky et al., 2015)</ref>. Instead, we turn to ImageNetV2, a new test dataset for ImageNet <ref type="bibr">(Recht et al., 2019)</ref>. In ImageNetV2, the images are also downsized and compressed, but the authors have open-sourced the code to generate the dataset as well as the IDs that let us re-download the original, large images from <ref type="bibr">Flickr (Fli)</ref>. The large images are also JPEG-compressed, but they have an average resolution of 1933&#215;1592 pixels. We then downsize them to a smaller size compatible with ImageNet. The resizing removes JPEG compression artifacts and simulates uncompressed images at those sizes.</p><p>ImageNetV2 contains three test sets representing different sampling strategies: MatchedFrequency, TopImages, and Threshold0.7. Each test set has 1000 classes with 10 images per class. Specifically, we use the MatchedFrequency as our training dataset, which is sampled to match the MTurk selection frequency distribution of the original ImageNet validation set for each class.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Tuning Methodology</head><p>Hyper-parameter tuning for all kinds of DNNs is unrealistic. We focus on one particular DNN: ResNet50 <ref type="bibr">(He et al., 2016)</ref> as implemented in PyTorch <ref type="bibr">(Paszke et al., 2017)</ref>. We use 500 classes with 5 images each from ImageNetV2's MatchedFrequency dataset to speed up compression. The optimization targets are top-1 accuracy and the compression rate, calculated as the ratio of the size of raw bitmap data to compressed JPEG images.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">SORTED RANDOM SEARCH</head><p>We start with a simple method that randomly samples Qtables. The idea is to get a baseline to see how hard it is to find, among the space of all Q-tables, new ones that mimic the performance of the standard JPEG tables-or that outperform standard JPEG for a specific DNN for a specific vision task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Method</head><p>The search space for a uniform random search is 256 64 &#8776; 1.34 &#8226; 10 154 . We need to decrease our search space. Lower frequencies probably matter more than higher frequencies. That is why the upper left of the standard table are large and the lower right are small, as shown in Fig. <ref type="figure">1</ref>. A simple sorting strategy can restrict sampling to tables that preserve this ordering. We generate 64 random numbers in the range [s, e], where s, e &#8712; Z and 1 &#8804; s &lt; e &#8804; 255, sort them, and build an 8&#215;8 table using the zig-zag ordering. We refer to this Q-table sampling strategy as sorted random search.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Performance</head><p>Fig. <ref type="figure">2</ref> compares 4000 Q-tables sampled using sorted random search, 1000 from uniform random search, and the standard Q-tables with quality factor from 10 to 100 at an interval of 5 in terms of compression rate and accuracy. We also plot the Pareto frontier for sorted random search. With uniform random search, we cannot find any Q-table that outperforms the standard JPEG Q-table. Sorted random search finds Pareto-optimal points that outperform standard JPEG by 1% to 2% in top-1 accuracy given the same compression rate when the quality factor of standard JPEG is in the range <ref type="bibr">[15,</ref><ref type="bibr">90]</ref>. Given the same accuracy, sorted random search can find Q-tables with compression rates 10% to 200% higher than standard JPEG depending on the baseline compression rate. We further take a close look at points with closest compression rate and accuracy to the quality-50 Q-table of standard JPEG, where the standard table is scaled at 100%. The improvements of accuracy and compression rate are 1.5% and 12.9%. Fig. <ref type="figure">3</ref> shows the distribution of sorted random points with compression rate between 22 to 22.2. Among 52 points, there are 48 points that perform at least as well as the standard JPEG Q-table, of which 93.75% perform better. Taking into account the strong randomness of sorted random search, the result indicates sorted random search can easily find many Q-tables that outperform standard JPEG.</p><p>To demonstrate the difference between the PSNR and DNN optimization goals, Fig. <ref type="figure">4</ref> plots the PSNR of sorted random search points and we highlight the Pareto frontier in terms of compression rate and accuracy. Although PSNR is closely related to DNN accuracy, it cannot be taken as the sole indicator of high accuracy.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">BOUNDED SEARCH</head><p>The random search method from the previous section samples from a tremendously large search space but uses an ordering constraint to make the search tractable. In this section, we hypothesize that, if we limit our search space of Q-table components to a smaller range, we can eliminate ordering constraint and find better tables using more sophisticated optimization methods.</p><p>To limit the search space, we need to focus on a particular compression rate range. We focus on a regime close to standard JPEG at quality 50, where the compression rate is approximately 22. Let P be the set of Q-tables on the Pareto frontier and P &#8242; &#8838; P be the set of Q-tables with com- bounded random search. Overall, after cross validation, the complex algorithms do not show any advantage over our simple baseline algorithm, sorted random search.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Significance of Improvement</head><p>To check whether the improvement of our methods over standard JPEG is significant, we perform statistical analysis of accuracy improvement given the closest compression rate. We test one Q-table obtained by each method against standard JPEG with quality set to 50. To avoid unintentionally cheating, we choose the point with the closest compression rate that is larger than that for the quality-50 standard Qtable, i.e., we allow accuracy to be a little worse. For each Q-table, we randomly sample 100 datasets from the original ones with 700 classes and 4 images per class. Then we compare the difference in mean accuracy and check significance using Student's t-test.</p><p>Table <ref type="table">1</ref> shows the difference of mean top-1 accuracy between standard JPEG and our Q-table obtained by different methods, on the MatchedFrequency and ImageNet Validation dataset. All methods find Q-tables with statistically significant accuracy improvement.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Efficiency</head><p>Dataset compression is expensive. For instance, the time to compress our training dataset and measure accuracy is approximately 3 minutes. A good method should find optimal Q-tables in as few trials as possible and the decision time for each trial should be short.</p><p>Table <ref type="table">2</ref> shows the results of profiling each method on a server with two Intel Xeon(R) E5-2620 v4 CPUs with 8 cores per socket and 2 threads per core. The table lists two factors affecting efficiency: the decision time for an algorithm to decide the next point to search, and the number of trials required to generate the first 10 good points. "Good" points are defined as those with Acc-fitness(CR) &gt; -0.001. The decision time is given as the average over 100 trials, rounded to two significant digits. Among all methods, </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">CONCLUSION AND FUTURE WORK</head><p>Our research shows the potential for accuracy and compression rate improvement by redesigning the JPEG Q-table, a topic that lacks attention from researchers in both of the image compression and DNN communities. All our proposed methods obtain better Q-tables compared to the standard one with an accuracy gain of 1% to 2% when the compression rate is fixed and a compression rate increase of 10% to 200% given the same accuracy. Although a composite heuristic optimizer stands out in efficiency and accuracy, we recommend a simple sampling technique, sorted random search, which can find good quantization tables at a wide range of compression-accuracy trade-offs. The resulting improved Q-tables outperform standard JPEG for our vision task in cross validation.</p><p>In this work, we consider the DNN fixed and only tune compression. But the network was trained on data using standard JPEG and therefore might perform worse on images using our new Q-tables. Fine-tuning retraining might help adjust the networks to match the new image characteristics, further improving the accuracy with nonstandard quantization. We have done a preliminary retraining experiment, which shows a positive result. Avenues for future work include further work on DNN retraining and applying our techniques to more vision applications such as object detection, semantic segmentation, and other tasks.</p></div></body>
		</text>
</TEI>
