skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on February 1, 2026

Title: CPI: A Collaborative Partial Indexing Design for Large-Scale Deduplication Systems
Data deduplication relies on a chunk index to identify the redundancy of incoming chunks. As backup data scales, it is impractical to maintain the entire chunk index in memory. Consequently, an index lookup needs to search the portion of the on-storage index, causing a dramatic regression of index lookup throughput. Existing studies propose to search a subset of the whole index (partial index) to limit the storage I/Os and guarantee a high index lookup throughput. However, several core factors of designing partial indexing are not fully exploited. In this paper, we first comprehensively investigate the trade-offs of using different meta-groups, sampling methods, and meta-group selection policies for a partial index. We then propose a Collaborative Partial Index (CPI) which takes advantage of two meta-groups including recipe-segment and container-catalog to achieve more efficient and effective unique chunk identification. CPI further introduces a hook-entry sharing technology and a two-stage eviction policy to reduce memory usage without hurting the deduplication ratio. According to evaluation, with the same constraints of memory usage and storage I/O, CPI achieves a 1.21x-2.17x higher deduplication ratio than the state-of-the-art partial indexing schemes. Alternatively, CPI achieves 1.8X-4.98x higher index lookup throughput than others when the same deduplication ratio is achieved. Compared with full indexing, CPI's maximum deduplication ratio is only 4.07% lower but its throughput is 37.1x - 122.2x of that of full indexing depending on different storage I/O constraints in our evaluation cases.  more » « less
Award ID(s):
2412436
PAR ID:
10612119
Author(s) / Creator(s):
; ;
Publisher / Repository:
IEEE Transactions on Computers
Date Published:
Journal Name:
IEEE Transactions on Computers
Volume:
74
Issue:
2
ISSN:
0018-9340
Page Range / eLocation ID:
483 to 494
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In-memory data management systems, such as key-value stores, have become an essential infrastructure in today's big-data processing and cloud computing. They rely on efficient index structures to access data. While unordered indexes, such as hash tables, can perform point search with O(1) time, they cannot be used in many scenarios where range queries must be supported. Many ordered indexes, such as B+ tree and skip list, have a O(log N) lookup cost, where N is number of keys in an index. For an ordered index hosting billions of keys, it may take more than 30 key-comparisons in a lookup, which is an order of magnitude more expensive than that on a hash table. With availability of large memory and fast network in today's data centers, this O(log N) time is taking a heavy toll on applications that rely on ordered indexes. In this paper we introduce a new ordered index structure, named Wormhole, that takes O(log L) worst-case time for looking up a key with a length of L. The low cost is achieved by simultaneously leveraging strengths of three indexing structures, namely hash table, prefix tree, and B+ tree, to orchestrate a single fast ordered index. Wormhole's range operations can be performed by a linear scan of a list after an initial lookup. This improvement of access efficiency does not come at a price of compromised space efficiency. Instead, Wormhole's index space is comparable to those of B+ tree and skip list. Experiment results show that Wormhole outperforms skip list, B+ tree, ART, and Masstree by up to 8.4x, 4.9x, 4.3x, and 6.6x in terms of key lookup throughput, respectively. 
    more » « less
  2. The freshness of web page indices is the key to improving searching quality of search engines. In Baidu, the major search engine in China, we have developed DirectLoad, an index updating system for efficiently delivering the webscale indices to nationwide data centers. However, the web-scale index updating suffers from increasingly high data volumes during network transmission and inefficient I/O transactions due to slow disk operations. DirectLoad accelerates the index updating streams from two aspects: 1) DirectLoad effectively cuts down the overwhelmingly high volume of indices in transmission by removing the redundant data across versions, and mutates regular operations in a key-value storage system for successful accesses to the deduplicated datasets. 2) DirectLoad significantly improves the I/O efficiency by replacing the LSMTree with a memory-resident table (memtable) and appendingonly- files (AOFs) on disk. Specifically, the write amplification stemming from sorting operations on disk is eliminated, and a lazy garbage collection policy further improves the I/O performance at the software level. In addition, DirectLoad directly manipulates the SSD native interfaces to remove the write amplification at the hardware level. In practice, 63% updating bandwidth has been saved due to the deduplication, and the write throughput to SSDs is increased by 3x. The index updating cycle of our production workloads has been compressed from 15 days to 3 days after deploying DirectLoad. In this paper, we show the effectiveness and efficiency of an in-memory index updating system, which is disruptive to the framework in a conventional memory hierarchy. We hope that this work contributes a strong case study in the system research literature. 
    more » « less
  3. Data deduplication has been widely used in storage systems to improve storage efficiency and I/O performance. In particular, content-defined variable-size chunking (CDC) is often used in data deduplication systems for its capability to detect and remove duplicate data in modified files. However, the CDC algorithm is very compute-intensive and inherently sequential. Efforts on accelerating it by segmenting a file and running the algorithm independently on each segment in parallel come at a cost of substantial degradation of deduplication ratio. In this paper, we propose SS-CDC, a two-stage parallel CDC, that enables (almost) full parallelism on chunking of a file without compromising deduplication ratio. Further, SS-CDC exploits instruction-level SIMD parallelism available in today's processors. As a case study, by using Intel AVX-512 instructions, SS-CDC consistently obtains superlinear speedups on a multi-core server. Our experiments using real-world datasets show that, compared to existing parallel CDC methods which only achieve up to a 7.7X speedup on an 8-core processor with the deduplication ratio degraded by up to 40%, SS-CDC can achieve up to a 25.6X speedup with no loss of deduplication ratio. 
    more » « less
  4. The InterPlanetary File System (IPFS) is a pioneering effort for Web 3.0, well-known for its decentralized infrastructure. However, some recent studies have shown that IPFS exhibits a high degree of centralization and has integrated centralized components for improved performance. While this change contradicts the core decentralized ethos of IPFS and introduces risks of hurting the data replication level and thus availability, it also opens some opportunities for better data management and cost savings through deduplication. To explore these challenges and opportunities, we start by collecting an extensive dataset of IPFS internal traffic spanning the last three years with 20+ billion messages. By analyzing this long- term trace, we obtain a more complete and accurate view of how the status of centralization evolves over an extended period. In particular, our study reveals that (1) IPFS shows a low replication level, with only 2.71% of data files replicated more than 5 times. While increasing replication enhances lookup performance and data availability, it adversely affects downloading throughput due to the overhead involved in managing peer connections, (2) there is a clear growing trend in centralization within IPFS in the last 3 years, with just 5% of peers now hosting over 80% of the content, significantly decreasing from 21.38% 3 years ago, which is largely driven by the increase of cloud nodes, (3) the default deduplication strategy of IPFS using Fixed-Size Chunking (FSC) is largely inefficient, especially with the default 256KB chunk size, showing near-zero duplication being detected. Although Content-Defined Chunking (CDC) with smaller chunks could save ∼1.8 petabytes (PB) storage space, it could impact user performance negatively. We thus design and evaluate a new metadata format that optimizes deduplication without compromising performance. 
    more » « less
  5. We introduce CrossPrefetch, a novel cross-layered I/O prefetching mechanism that operates across the OS and a user-level runtime to achieve optimal performance. Existing OS prefetching mechanisms suffer from rigid interfaces that do not provide information to applications on the prefetch effectiveness, suffer from high concurrency bottlenecks, and are inefficient in utilizing available system memory. CrossPrefetch addresses these limitations by dividing responsibilities between the OS and runtime, minimizing overhead, and achieving low cache misses, lock contentions, and higher I/O performance. CrossPrefetch tackles the limitations of rigid OS prefetching interfaces by maintaining and exporting cache state and prefetch effectiveness to user-level runtimes. It also addresses scalability and concurrency bottlenecks by distinguishing between regular I/O and prefetch operations paths and introduces fine-grained prefetch indexing for shared files. Finally, CrossPrefetch designs low-interference access pattern prediction combined with support for adaptive and aggressive techniques to exploit memory capacity and storage bandwidth. Our evaluation of CrossPrefetch, encompassing microbenchmarks, macrobenchmarks, and real-world workloads, illustrates performance gains of up to 1.22x-3.7x in I/O throughput.We also evaluate CrossPrefetch across different file systems and local and remote storage configurations. 
    more » « less