Efficient LLM Inference via Chunked Prefills

Agrawal, Arney; Kedia, Nitin; Panwar, Ashish; Mohan, Jayashree; Kwatra, Nipun; Gulavani, Bhargav S; Tumanov, Alexey; Ramjee, Ramachandran

doi:10.1145/3759441.3759444

Citation Details

This content will become publicly available on August 4, 2026

Efficient LLM Inference via Chunked Prefills

Large Language Model (LLM) inference serving faces a fundamental challenge due to the distinct characteristics of its two phases: compute-intensive pre fill and memory-intensive decode. Existing scheduling strategies often prioritize one phase over the other, leading to a difficult tradeoff between system throughput and request latency. Prefill-prioritizing schedulers improve throughput but introduce significant latency jitter (generation stalls) by interfering with ongoing decodes. Conversely, decode-prioritizing schedulers maintain low latency but underutilize GPU resources, resulting in low throughput. This paper revisits the technique of chunked prefills, demonstrating its efficacy in mitigating this tradeoff. By splitting large prefill computations into smaller, manageable chunks and interleaving them with decode operations using stall-free batching, we can leverage the compute slack inherent in the decode phase. This approach significantly improves serving capacity under strict latency constraints, minimizes generation stalls, and reduces pipeline bubbles in distributed deployments, enabling efficient and responsive inference. more »

Award ID(s):: 2420977

PAR ID:: 10656320

Author(s) / Creator(s):: Agrawal, Arney; Kedia, Nitin; Panwar, Ashish; Mohan, Jayashree; Kwatra, Nipun; Gulavani, Bhargav S; Tumanov, Alexey; Ramjee, Ramachandran

Publisher / Repository:: Association for Computing Machinery

Date Published:: 2025-08-04

Journal Name:: ACM SIGOPS Operating Systems Review

Volume:: 59

Issue:: 1

ISSN:: 0163-5980

Page Range / eLocation ID:: 9 to 16

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on August 4, 2026
Journal Article:
https://doi.org/10.1145/3759441.3759444

More Like this