Near-Duplicate Text Alignment with One Permutation Hashing

Peng, Zhencan; Zhang, Yuheng; Deng, Dong

doi:10.1145/3677136

This paper studies the near-duplicate text alignment problem under the constraint of Jaccard similarity. Specifically, given a collection of long texts and a short query text, this problem finds all thesubsequencesin each text whose Jaccard similarities to the query are no smaller than a given threshold. Near-duplicate text alignment is computationally intensive. This is because there are O(n²) subsequences in a text withntokens. To remedy this issue, a few recent studies propose to first generate the min-hash sketch of every subsequence in each text and then find all the subsequences whose min-hash sketches are similar to that of the query. They introduce the concept of compact windows and show that the O(n²k) min-hashes in a text withntokens can be losslessly compressed in compact windows using O(nk) space, wherekis the sketch size. However, the space cost O(nk) is still too high for long texts, especially when the sketch sizekis large. To address this issue, we propose to use One Permutation Hashing (OPH) to generate the min-hash sketch and introduce the concept of OPH compact windows. Although the size of each sketch remains the same, which is O(k), we prove that all the O(n²k) min-hashes generated by OPH in a text withntokens can be losslessly compressed in OPH compact windows using only O(n+k) space. Note the generation of OPH compact windows does not necessitate the enumeration of the O(n²k) min-hashes. Moreover, we develop an algorithm to find all the sketches in a text similar to that of the query directly from OPH compact windows, along with three optimizations.We conduct extensive experiments on three real-world datasets. Empirical results show our proposed algorithms significantly outperformed existing methods in terms of index cost and query latency and scaled well.

More Like this