Improving the Performance of Out-of-Core LLM Inference Using Heterogeneous Host Memory

Gupta, Sudhanshu; Dwarkadas, Sandhya

Citation Details

This content will become publicly available on October 14, 2026

Improving the Performance of Out-of-Core LLM Inference Using Heterogeneous Host Memory

The memory footprint of modern applications like large language models (LLMs) far exceeds the memory capacity of accelerators they run on and often spills over to host memory. As model sizes continue to grow, DRAM-based memory is no longer sufficient to contain these models, resulting in further spill-over to storage and necessitating the use of technologies like Intel Optane and CXL-enabled memory expansion. While such technologies provide more capacity, their higher latency and lower bandwidth has given rise to heterogeneous memory configurations that attempt to strike a balance between capacity and performance. This paper evaluates the impact of such memory configurations on a GPU running out-of-core LLMs. Starting with basic host/device bandwidth measurements using an Optane and Nvidia A100 equipped NUMA system, we present a comprehensive performance analysis of serving OPT-30B and OPT-175B models using FlexGen, a state-of-the-art serving framework. Our characterization shows that FlexGen's weight placement algorithm is a key bottleneck limiting performance. Based on this observation, we evaluate two alternate weight placement strategies, one each optimizing for inference latency and throughput. When combined with model quantization, our strategies improve latency and throughput by 27% and 5x, respectively. These figures are within 9% and 6% of an all-DRAM system, demonstrating how careful data placement can effectively enable the substitution of DRAM with high-capacity but slower memory, improving overall system energy efficiency. more »

Award ID(s):: 1900803

PAR ID:: 10648251

Author(s) / Creator(s):: Gupta, Sudhanshu; Dwarkadas, Sandhya

Publisher / Repository:: 2025 IEEE International Symposium on Workload Characterization

Date Published:: 2025-10-14

Subject(s) / Keyword(s):: Compute express link heterogeneous memory GPU inference large language models non-volatile memory

Format(s):: Medium: X

Location:: Irvine, California

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on October 14, 2026
Conference Paper:
The DOI is not currently available.

More Like this