Does Quantization Improve Inference Speed? It Depends

Farrukh, Ahmed; Saeed, Mohamed; Fund, Fraida

Citation Details

This content will become publicly available on May 19, 2026

Does Quantization Improve Inference Speed? It Depends

Quantization is often cited as a technique for reducing model size and accelerating deep learning. However, past literature suggests that the effect of quantization on latency varies significantly across different settings, in some cases even increasing inference time rather than reducing it. To address this discrepancy, we conduct a series of systematic experiments on the Chameleon testbed to investigate the impact of three key variables on the effect of post-training quantization: the machine learning framework, the compute hardware, and the model itself. Our experiments demonstrate that each of these has a substantial impact on the overall inference time of a quantized model. Furthermore, we make experiment materials and artifacts publicly available so that others can validate our findings on the same hardware using Chameleon, and we share open educational resources on this topic that may be adopted in formal and informal education settings. more »

Award ID(s):: 2230079

PAR ID:: 10636811

Author(s) / Creator(s):: Farrukh, Ahmed; Saeed, Mohamed; Fund, Fraida

Publisher / Repository:: IEEE

Date Published:: 2025-05-19

ISBN:: 979-8-3315-0938-5

Page Range / eLocation ID:: 187-193

Format(s):: Medium: X

Location:: Tromsø, Norway

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on May 19, 2026
Conference Paper:
The DOI is not currently available.

More Like this