skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Near-Duplicate Text Alignment with One Permutation Hashing
This paper studies the near-duplicate text alignment problem under the constraint of Jaccard similarity. Specifically, given a collection of long texts and a short query text, this problem finds all thesubsequencesin each text whose Jaccard similarities to the query are no smaller than a given threshold. Near-duplicate text alignment is computationally intensive. This is because there are O(n2) subsequences in a text withntokens. To remedy this issue, a few recent studies propose to first generate the min-hash sketch of every subsequence in each text and then find all the subsequences whose min-hash sketches are similar to that of the query. They introduce the concept of compact windows and show that the O(n2k) min-hashes in a text withntokens can be losslessly compressed in compact windows using O(nk) space, wherekis the sketch size. However, the space cost O(nk) is still too high for long texts, especially when the sketch sizekis large. To address this issue, we propose to use One Permutation Hashing (OPH) to generate the min-hash sketch and introduce the concept of OPH compact windows. Although the size of each sketch remains the same, which is O(k), we prove that all the O(n2k) min-hashes generated by OPH in a text withntokens can be losslessly compressed in OPH compact windows using only O(n+k) space. Note the generation of OPH compact windows does not necessitate the enumeration of the O(n2k) min-hashes. Moreover, we develop an algorithm to find all the sketches in a text similar to that of the query directly from OPH compact windows, along with three optimizations.We conduct extensive experiments on three real-world datasets. Empirical results show our proposed algorithms significantly outperformed existing methods in terms of index cost and query latency and scaled well.  more » « less
Award ID(s):
2212629
PAR ID:
10636938
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
Proceedings of the ACM on Management of Data
Volume:
2
Issue:
4
ISSN:
2836-6573
Page Range / eLocation ID:
1 to 26
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent studies show that large language models (LLM) unintendedly memorize part of the training data, which brings serious privacy risks. For example, it has been shown that over 1% of tokens generated unprompted by an LLM are part of sequences in the training data. However, current studies mainly focus on the exact memorization behaviors. In this paper, we propose to evaluate how many generated texts have near-duplicates (e.g., only differ by a couple of tokens out of 100) in the training corpus. A major challenge of conducting this evaluation is the huge computation cost incurred by near-duplicate sequence searches. This is because modern LLMs are trained on larger and larger corpora with up to 1 trillion tokens. What's worse is that the number of sequences in a text is quadratic to the text length. To address this issue, we develop an efficient and scalable near-duplicate sequence search algorithm in this paper. It can find (almost) all the near-duplicate sequences of the query sequence in a large corpus with guarantees. Specifically, the algorithm generates and groups the min-hash values of all the sequences with at least t tokens (as very short near-duplicates are often irrelevant noise) in the corpus in linear time to the corpus size. We formally prove that only 2 n+1/t+1 -1 min-hash values are generated for a text with n tokens in expectation. Thus the index time and size are reasonable. When a query arrives, we find all the sequences sharing enough min-hash values with the query using inverted indexes and prefix filtering. Extensive experiments on a few large real-world LLM training corpora show that our near-duplicate sequence search algorithm is efficient and scalable. 
    more » « less
  2. This paper is about semantic regular expressions (SemREs). This is a concept that was recently proposed by Smore (Chen et al. 2023) in which classical regular expressions are extended with a primitive to query external oracles such as databases and large language models (LLMs). SemREs can be used to identify lines of text containing references to semantic concepts such as cities, celebrities, political entities, etc. The focus in their paper was on automatically synthesizing semantic regular expressions from positive and negative examples. In this paper, we study themembership testing problem. First, we present a two-pass NFA-based algorithm to determine whether a stringwmatches a SemRErinO(|r|2|w|2+ |r| |w|3) time, assuming the oracle responds to each query in unit time. In common situations, where oracle queries are not nested, we show that this procedure runs inO(|r|2|w|2) time. Experiments with a prototype implementation of this algorithm validate our theoretical analysis, and show that the procedure massively outperforms a dynamic programming-based baseline, and incurs a ≈ 2 × overhead over the time needed for interaction with the oracle. Second, we establish connections between SemRE membership testing and the triangle finding problem from graph theory, which suggest that developing algorithms which are simultaneously practical and asymptotically faster might be challenging. Furthermore, algorithms for classical regular expressions primarily aim to optimize their time and memory consumption. In contrast, an important consideration in our setting is to minimize the cost of invoking the oracle. We demonstrate an Ω(|w|2) lower bound on the number of oracle queries necessary to make this determination. 
    more » « less
  3. The min-hash sketch is a well-known technique for low-communication approximation of the Jaccard index between two input sets. Moreover, there is a folklore belief that min-hash sketch-based protocols protect the privacy of the inputs. In this paper, we consider variants of private min-hash sketch based-protocols and investigate this folklore to quantify the privacy of the min-hash sketch. We begin our investigation by presenting a highly-efficient two-party protocol for estimating the Jaccard index while ensuring differential privacy. This protocol adds Laplacian noise to the min-hash sketch counts to provide privacy protection. Then, we aim to understand what privacy, if any, is guaranteed if the results of the min-hash are released without any additional noise, such as in the case of historical data. We begin our investigation by considering the privacy of min-hash in a centralized setting where the hash functions are chosen by the min-hash functionality and are unknown to the participants. We show that in this case the min-hash output satisfies the standard definition of differential privacy (DP) without any additional noise. We next consider a more practical distributed setting, where the hash function must be shared among all parties and is typically public. Unfortunately, we show that in this public hash function setting, the min-hash output is no longer DP. We therefore consider the notion of distributional differential privacy (DDP) introduced by Bassily et al. (FOCS 2013). We show that if the honest party's set has sufficiently high min-entropy, the min-hash output achieves DDP without requiring noise. Our findings provide guidance on how to use the min-hash sketch for private Jaccard index estimation and clarify the extent to which min-hash protocols protect input privacy, refining the common belief in their privacy guarantees. 
    more » « less
  4. We study the query complexity of finding the set of all Nash equilibria\(\mathcal {X}_\ast \times \mathcal {Y}_\ast \)in two-player zero-sum matrix games. Fearnley and Savani [18] showed that for any randomized algorithm, there exists ann×ninput matrix where it needs to queryΩ(n2) entries in expectation to compute asingleNash equilibrium. On the other hand, Bienstock et al. [5] showed that there is a special class of matrices for which one can queryO(n) entries and compute its set of all Nash equilibria. However, these results do not fully characterize the query complexity of finding the set of all Nash equilibria in two-player zero-sum matrix games. In this work, we characterize the query complexity of finding the set of all Nash equilibria\(\mathcal {X}_\ast \times \mathcal {Y}_\ast \)in terms of the number of rowsnof the input matrix\(A \in \mathbb {R}^{n \times n} \), row support size\(k_1 := |\bigcup \limits _{x \in \mathcal {X}_\ast } \text{supp}(x)| \), and column support size\(k_2 := |\bigcup \limits _{y \in \mathcal {Y}_\ast } \text{supp}(y)| \). We design a simple yet non-trivial randomized algorithm that returns the set of all Nash equilibria\(\mathcal {X}_\ast \times \mathcal {Y}_\ast \)by querying at mostO(nk5· polylog(n)) entries of the input matrix\(A \in \mathbb {R}^{n \times n} \)in expectation, wherek≔ max{k1,k2}. This upper bound is tight up to a factor of poly(k), as we show that for any randomized algorithm, there exists ann×ninput matrix with min {k1,k2} = 1, for which it needs to queryΩ(nk) entries in expectation in order to find the set of all Nash equilibria\(\mathcal {X}_\ast \times \mathcal {Y}_\ast \). 
    more » « less
  5. Understanding variations in the routes by which wild animals gain and lose water is challenging, and common methods require longitudinal sampling, which can be prohibitive. However, a new approach usesΔ′17OBW(Δ′17O of animal body water), calculated from measurements ofδ′17O andδ′18O in a single sample, as a natural tracer of water flux.Δ′17OBWis promising, but its relationship to organismal variables such as metabolic rate and water intake have not been validated. Here, we continuously measured oxygen influxes and effluxes of captive deer mice (Peromyscus maniculatus), and manipulated their water intake and metabolic rate. We used these oxygen flux data to predictΔ′17OBWfor the mice and compared these model predictions withΔ′17OBWmeasured in blood plasma samples. As expected,Δ′17OBWpositively correlated with drinking water intake and negatively correlated with metabolic rate. All predictedΔ′17OBW(based on measured oxygen fluxes) values differed from measuredΔ′17OBWvalues by <30 per meg (mean absolute difference: 11 ± 9 per meg), suggesting high accuracy for this modelling approach because studies currently report a range of 300 per meg forΔ′17OBWamong mammals, birds and fish. 
    more » « less