skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 4, 2027

Title: ZipLLM: Efficient LLM Storage via Model-Aware Synergistic Data Deduplication and Compression
Modern model hubs, such as Hugging Face, store tens of petabytes of LLMs, with fine-tuned variants vastly outnumbering base models and dominating storage consumption. Existing storage reduction techniques---such as deduplication and compression---are either LLM-oblivious or not compatible with each other, limiting data reduction effectiveness. Our large-scale characterization study across all publicly available Hugging Face LLM repositories reveals several key insights: (1) fine-tuned models within the same family exhibit highly structured, sparse parameter differences suitable for delta compression; (2) bitwise similarity enables LLM family clustering; and (3) tensor-level deduplication is better aligned with model storage workloads, achieving high data reduction with low metadata overhead. Building on these insights, we design BitX, an effective, fast, lossless delta compression algorithm that compresses XORed difference between fine-tuned and base LLMs. We build ZipLLM, a model storage reduction pipeline that unifies tensor-level deduplication and lossless BitX compression. By synergizing deduplication and compression around LLM family clustering, ZipLLM reduces model storage consumption by 54%, over 20% higher than state-of-the-art deduplication and compression approaches.  more » « less
Award ID(s):
2403313 2411009 2322860
PAR ID:
10642981
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
23rd USENIX USENIX Symposium on Networked Systems Design and Implementation (NSDI '26)
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. With the growing adoption of privacy-preserving machine learning algorithms, such as Differentially Private Stochastic Gradient Descent (DP-SGD), training or fine-tuning models on private datasets has become increasingly prevalent. This shift has led to the need for models offering varying privacy guarantees and utility levels to satisfy diverse user requirements. Managing numerous versions of large models introduces significant operational challenges, including increased inference latency, higher resource consumption, and elevated costs. Model deduplication is a technique widely used by many model serving and database systems to support high-performance and low-cost inference queries and model diagnosis queries. However, none of the existing model deduplication works has considered privacy, leading to unbounded aggregation of privacy costs for certain deduplicated models and inefficiencies when applied to deduplicate DP-trained models. We formalize the problem of deduplicating DP-trained models for the first time and propose a novel privacy- and accuracy-aware deduplication mechanism to address the problem. We developed a greedy strategy to select and assign base models to target models to minimize storage and privacy costs. When deduplicating a target model, we dynamically schedule accuracy validations and apply the Sparse Vector Technique to reduce the privacy costs associated with private validation data. Compared to baselines, our approach improved the compression ratio by up to 35× for individual models (including large language models and vision transformers). We also observed up to 43× inference speedup due to the reduction of I/O operations. 
    more » « less
  2. Scientific simulations generate large amounts of floating-point data, which are often not very compressible using the traditional reduction schemes, such as deduplication or lossless compression. The emergence of lossy floating-point compression holds promise to satisfy the data reduction demand from HPC applications; however, lossy compression has not been widely adopted in science production. We believe a fundamental reason is that there is a lack of understanding of the benefits, pitfalls, and performance of lossy compression on scientific data. In this paper, we conduct a comprehensive study on state-of- the-art lossy compression, including ZFP, SZ, and ISABELA, using real and representative HPC datasets. Our evaluation reveals the complex interplay between compressor design, data features and compression performance. The impact of reduced accuracy on data analytics is also examined through a case study of fusion blob detection, offering domain scientists with the insights of what to expect from fidelity loss. Furthermore, the trial and error approach to understanding compression performance involves substantial compute and storage overhead. To this end, we propose a sampling based estimation method that extrapolates the reduction ratio from data samples, to guide domain scientists to make more informed data reduction decisions. 
    more » « less
  3. Large language models (LLMs) with billions of parameters and pretrained on massive amounts of data are now capable of near or better than state-of-the-art performance in a variety of downstream natural language processing tasks. Neural machine translation (NMT) is one such task that LLMs have been applied to with great success. However, little research has focused on applying LLMs to the more difficult subset of NMT called simultaneous translation (SimulMT), where translation begins before the entire source context is available to the model. In this paper, we address key challenges facing LLMs fine-tuned for SimulMT, validate classical SimulMT concepts and practices in the context of LLMs, explore adapting LLMs that are fine-tuned for NMT to the task of SimulMT, and introduce Simul-LLM, the first open-source fine-tuning and evaluation pipeline development framework for LLMs focused on SimulMT. 
    more » « less
  4. Smart contracts underpin decentralized applications but face significant security risks from vulnerabilities, while traditional analysis methods have limitations. Large Language Models (LLMs) offer promise for vulnerability detection, yet adapting these powerful models efficiently, particularly generative ones, remains challenging. This paper investigates two key strategies for the efficient adaptation of LLMs for Solidity smart contract vulnerability detection: (1) replacing token-level generation with a dedicated classification head during fine-tuning, and (2) selectively freezing lower transformer layers using Low-Rank Adaptation (LoRA). Our empirical evaluation demonstrates that the classification head approach enables models like Llama 3.2 3B to achieve high accuracy (77.5%), rivaling the performance of significantly larger models such as the fine-tuned GPT-3.5. Furthermore, we show that selectively freezing bottom layers reduces training time and memory usage by approximately 10-20% with minimal impact on accuracy. Notably, larger models (3B vs. 1B parameters) exhibit greater resilience to layer freezing, maintaining high accuracy even with a large proportion of layers frozen, suggesting a localization of general code understanding in lower layers versus task-specific vulnerability patterns in upper layers. These findings present practical insights for developing and deploying performant LLM-based vulnerability detection systems efficiently, particularly in resource-constrained settings. 
    more » « less
  5. Large language models (LLMs) have demonstrated revolutionary capabilities in understanding complex contexts and performing a wide range of tasks. However, LLMs can also answer questions that are unethical or harmful, raising concerns about their applications. To regulate LLMs' responses to such questions, a training strategy called alignment can help. Yet, alignment can be unexpectedly compromised when fine-tuning an LLM for downstream tasks. This paper focuses on recovering the alignment lost during fine-tuning. We observe that there are two distinct directions inherent in an aligned LLM: the aligned direction and the harmful direction. An LLM is inclined to answer questions in the aligned direction while refusing queries in the harmful direction. Therefore, we propose to recover the harmful direction of the fine-tuned model that has been compromised. Specifically, we restore a small subset of the fine-tuned model's weight parameters from the original aligned model using gradient descent. We also introduce a rollback mechanism to avoid aggressive recovery and maintain downstream task performance. Our evaluation on 125 fine-tuned LLMs demonstrates that our method can reduce their harmful rate (percentage of answering harmful questions) from 33.25% to 1.74%, without sacrificing task performance much. In contrast, the existing methods either only reduce the harmful rate to a limited extent or significantly impact the normal functionality. Our code is available at https://github.com/kangyangWHU/LLMAlignment 
    more » « less