This paper studies the near-duplicate text alignment problem under the constraint of Jaccard similarity. Specifically, given a collection of long texts and a short query text, this problem finds all thesubsequencesin each text whose Jaccard similarities to the query are no smaller than a given threshold. Near-duplicate text alignment is computationally intensive. This is because there are O(n2) subsequences in a text withntokens. To remedy this issue, a few recent studies propose to first generate the min-hash sketch of every subsequence in each text and then find all the subsequences whose min-hash sketches are similar to that of the query. They introduce the concept of compact windows and show that the O(n2k) min-hashes in a text withntokens can be losslessly compressed in compact windows using O(nk) space, wherekis the sketch size. However, the space cost O(nk) is still too high for long texts, especially when the sketch sizekis large. To address this issue, we propose to use One Permutation Hashing (OPH) to generate the min-hash sketch and introduce the concept of OPH compact windows. Although the size of each sketch remains the same, which is O(k), we prove that all the O(n2k) min-hashes generated by OPH in a text withntokens can be losslessly compressed in OPH compact windows using only O(n+k) space. Note the generation of OPH compact windows does not necessitate the enumeration of the O(n2k) min-hashes. Moreover, we develop an algorithm to find all the sketches in a text similar to that of the query directly from OPH compact windows, along with three optimizations.We conduct extensive experiments on three real-world datasets. Empirical results show our proposed algorithms significantly outperformed existing methods in terms of index cost and query latency and scaled well.
more »
« less
This content will become publicly available on June 10, 2026
Membership Testing for Semantic Regular Expressions
This paper is about semantic regular expressions (SemREs). This is a concept that was recently proposed by Smore (Chen et al. 2023) in which classical regular expressions are extended with a primitive to query external oracles such as databases and large language models (LLMs). SemREs can be used to identify lines of text containing references to semantic concepts such as cities, celebrities, political entities, etc. The focus in their paper was on automatically synthesizing semantic regular expressions from positive and negative examples. In this paper, we study themembership testing problem. First, we present a two-pass NFA-based algorithm to determine whether a stringwmatches a SemRErinO(|r|2|w|2+ |r| |w|3) time, assuming the oracle responds to each query in unit time. In common situations, where oracle queries are not nested, we show that this procedure runs inO(|r|2|w|2) time. Experiments with a prototype implementation of this algorithm validate our theoretical analysis, and show that the procedure massively outperforms a dynamic programming-based baseline, and incurs a ≈ 2 × overhead over the time needed for interaction with the oracle. Second, we establish connections between SemRE membership testing and the triangle finding problem from graph theory, which suggest that developing algorithms which are simultaneously practical and asymptotically faster might be challenging. Furthermore, algorithms for classical regular expressions primarily aim to optimize their time and memory consumption. In contrast, an important consideration in our setting is to minimize the cost of invoking the oracle. We demonstrate an Ω(|w|2) lower bound on the number of oracle queries necessary to make this determination.
more »
« less
- PAR ID:
- 10625805
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- Proceedings of the ACM on Programming Languages
- Volume:
- 9
- Issue:
- PLDI
- ISSN:
- 2475-1421
- Page Range / eLocation ID:
- 1245 to 1268
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Range-filtering approximate nearest neighbor search (RFANNS) has gained significant attention recently. Consider a setDof high-dimensional vectors, each associated with a numeric attribute value, e.g., price or timestamp. An RFANNS query consists of a query vectorqand a query range, reporting the approximate nearest neighbors ofqamong data vectors whose attributes fall in the query range. Existing work on RFANNS only considers a static setDof data vectors while in many real-world scenarios, vectors arrive in the system in an arbitrary order. This paper studies dynamic RFANNS where both data vectors and queries arrive in a mixed stream: a query is posed on all the data vectors that have already arrived in the system. Existing work on RFANNS is difficult to be extended to the streaming setting as they construct the index in the order of the attribute values while the vectors arrive in the system in an arbitrary order. The main challenge to the dynamic RFANNS lies in the difference between the two orders. A naive approach to RFANNS maintains multiple hierarchical navigable small-world (HNSW) graphs, one for each of theO(|D|2) possible query ranges - too expensive to construct and maintain. To design an index structure that can integrate new data vectors with a low index size increment for efficient and effective query processing, we propose a structure calleddynamic segment graph.It compresses the set of HNSW graphs of the naive approach, proven to be lossless under certain conditions, with only a linear to log |D| new edges in expectation when inserting a new vector. This dramatically reduces the index size while largely preserving the search performance. We further propose heuristics to significantly reduce the index cost of our dynamic segment graph in practice. Extensive experimental results show that our approach outperforms existing methods for static RFANNS and is scalable in handling dynamic RFANNS.more » « less
-
Lightweight syntactic analysis tools like Semgrep and Comby leverage the tree structure of code, making them more expressive than string and regex search. Unlike traditional language frameworks (e.g., ESLint) that analyze codebases via explicit syntax tree manipulations, these tools use query languages that closely resemble the source language. However, state-of-the-art matching techniques for these tools require queries to be complete and parsable snippets, which makes in-progress query specifications useless. We propose a new search architecture that relies only on tokenizing (not parsing) a query. We introduce a novel language and matching algorithm to support tree-aware wildcards on this architecture by building on tree automata. We also presentstsearch, a syntactic search tool leveraging our approach. In contrast to past work, our approach supports syntactic searcheven for previously unparsable queries.We show empirically that stsea rch can support all tokenizable queries, while still providing results comparable to Semgrep for existing queries. Our work offers evidence that lightweight syntactic code search can accept in-progress specifications, potentially improving support for interactive settings. CCS Concepts: •Software and its engineering→Formal language definitions;Software maintenance tools;•Information systems→Query representation;•Theory of computation→ Tree languages.more » « less
-
We present a novel symbolic reasoning engine for SQL which can efficiently generate an inputIfornqueriesP1, ⋯,Pn, such that their outputs onIsatisfy a given property (expressed in SMT). This is useful in different contexts, such as disproving equivalence of two SQL queries and disambiguating a set of queries. Our first idea is to reason about an under-approximation of eachPi— that is, a subset ofPi’s input-output behaviors. While it makes our approach both semantics-aware and lightweight, this idea alone is incomplete (as a fixed under-approximation might miss some behaviors of interest). Therefore, our second idea is to perform search over an expressive family of under-approximations (which collectively cover all program behaviors of interest), thereby making our approach complete. We have implemented these ideas in a tool, Polygon, and evaluated it on over 30,000 benchmarks across two tasks (namely, SQL equivalence refutation and query disambiguation). Our evaluation results show that Polygon significantly outperforms all prior techniques.more » « less
-
Understanding variations in the routes by which wild animals gain and lose water is challenging, and common methods require longitudinal sampling, which can be prohibitive. However, a new approach usesΔ′17OBW(Δ′17O of animal body water), calculated from measurements ofδ′17O andδ′18O in a single sample, as a natural tracer of water flux.Δ′17OBWis promising, but its relationship to organismal variables such as metabolic rate and water intake have not been validated. Here, we continuously measured oxygen influxes and effluxes of captive deer mice (Peromyscus maniculatus), and manipulated their water intake and metabolic rate. We used these oxygen flux data to predictΔ′17OBWfor the mice and compared these model predictions withΔ′17OBWmeasured in blood plasma samples. As expected,Δ′17OBWpositively correlated with drinking water intake and negatively correlated with metabolic rate. All predictedΔ′17OBW(based on measured oxygen fluxes) values differed from measuredΔ′17OBWvalues by <30 per meg (mean absolute difference: 11 ± 9 per meg), suggesting high accuracy for this modelling approach because studies currently report a range of 300 per meg forΔ′17OBWamong mammals, birds and fish.more » « less
An official website of the United States government
