SS-CDC: a two-stage parallel content-defined chunking for deduplicating backup storage

Ni, Fan; Lin, Xing; Jiang, Song

doi:10.1145/3319647.3325834

Citation Details

SS-CDC: a two-stage parallel content-defined chunking for deduplicating backup storage

Data deduplication has been widely used in storage systems to improve storage efficiency and I/O performance. In particular, content-defined variable-size chunking (CDC) is often used in data deduplication systems for its capability to detect and remove duplicate data in modified files. However, the CDC algorithm is very compute-intensive and inherently sequential. Efforts on accelerating it by segmenting a file and running the algorithm independently on each segment in parallel come at a cost of substantial degradation of deduplication ratio. In this paper, we propose SS-CDC, a two-stage parallel CDC, that enables (almost) full parallelism on chunking of a file without compromising deduplication ratio. Further, SS-CDC exploits instruction-level SIMD parallelism available in today's processors. As a case study, by using Intel AVX-512 instructions, SS-CDC consistently obtains superlinear speedups on a multi-core server. Our experiments using real-world datasets show that, compared to existing parallel CDC methods which only achieve up to a 7.7X speedup on an 8-core processor with the deduplication ratio degraded by up to 40%, SS-CDC can achieve up to a 25.6X speedup with no loss of deduplication ratio. more »

Award ID(s):: 1815303

NSF-PAR ID:: 10119149

Author(s) / Creator(s):: Ni, Fan; Lin, Xing; Jiang, Song

Date Published:: 2019-06-03

Journal Name:: SYSTOR '19 Proceedings of the 12th ACM International Conference on Systems and Storage

Page Range / eLocation ID:: 86 to 96

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3319647.3325834

More Like this