CPI: A Collaborative Partial Indexing Design for Large-Scale Deduplication Systems

Wei, Yixun; Cao, Zhichao; Du, David_H C

doi:10.1109/TC.2024.3485238

Citation Details

This content will become publicly available on February 1, 2026

CPI: A Collaborative Partial Indexing Design for Large-Scale Deduplication Systems

Data deduplication relies on a chunk index to identify the redundancy of incoming chunks. As backup data scales, it is impractical to maintain the entire chunk index in memory. Consequently, an index lookup needs to search the portion of the on-storage index, causing a dramatic regression of index lookup throughput. Existing studies propose to search a subset of the whole index (partial index) to limit the storage I/Os and guarantee a high index lookup throughput. However, several core factors of designing partial indexing are not fully exploited. In this paper, we first comprehensively investigate the trade-offs of using different meta-groups, sampling methods, and meta-group selection policies for a partial index. We then propose a Collaborative Partial Index (CPI) which takes advantage of two meta-groups including recipe-segment and container-catalog to achieve more efficient and effective unique chunk identification. CPI further introduces a hook-entry sharing technology and a two-stage eviction policy to reduce memory usage without hurting the deduplication ratio. According to evaluation, with the same constraints of memory usage and storage I/O, CPI achieves a 1.21x-2.17x higher deduplication ratio than the state-of-the-art partial indexing schemes. Alternatively, CPI achieves 1.8X-4.98x higher index lookup throughput than others when the same deduplication ratio is achieved. Compared with full indexing, CPI's maximum deduplication ratio is only 4.07% lower but its throughput is 37.1x - 122.2x of that of full indexing depending on different storage I/O constraints in our evaluation cases. more »

Award ID(s):: 2412436

PAR ID:: 10612119

Author(s) / Creator(s):: Wei, Yixun; Cao, Zhichao; Du, David_H C

Publisher / Repository:: IEEE Transactions on Computers

Date Published:: 2025-02-01

Journal Name:: IEEE Transactions on Computers

Volume:: 74

Issue:: 2

ISSN:: 0018-9340

Page Range / eLocation ID:: 483 to 494

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
This content will become publicly available on February 1, 2026
Journal Article:
https://doi.org/10.1109/TC.2024.3485238

More Like this