NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Tag-Filtered Approximate Nearest Neighbor Search

https://doi.org/10.1109/ICDE65448.2025.00272

Luo, Jiarui; Qiao, Miao; Zuo, Chaoji; Deng, Dong (May 2025, IEEE)

Free, publicly-accessible full text available May 19, 2026
Dynamic Range-Filtering Approximate Nearest Neighbor Search

https://doi.org/10.14778/3748191.3748193

Peng, Zhencan; Qiao, Miao; Zhou, Wenchao; Li, Feifei; Deng, Dong (June 2025, Proceedings of the VLDB Endowment)

Range-filtering approximate nearest neighbor search (RFANNS) has gained significant attention recently. Consider a setDof high-dimensional vectors, each associated with a numeric attribute value, e.g., price or timestamp. An RFANNS query consists of a query vectorqand a query range, reporting the approximate nearest neighbors ofqamong data vectors whose attributes fall in the query range. Existing work on RFANNS only considers a static setDof data vectors while in many real-world scenarios, vectors arrive in the system in an arbitrary order. This paper studies dynamic RFANNS where both data vectors and queries arrive in a mixed stream: a query is posed on all the data vectors that have already arrived in the system. Existing work on RFANNS is difficult to be extended to the streaming setting as they construct the index in the order of the attribute values while the vectors arrive in the system in an arbitrary order. The main challenge to the dynamic RFANNS lies in the difference between the two orders. A naive approach to RFANNS maintains multiple hierarchical navigable small-world (HNSW) graphs, one for each of theO(|D|²) possible query ranges - too expensive to construct and maintain. To design an index structure that can integrate new data vectors with a low index size increment for efficient and effective query processing, we propose a structure calleddynamic segment graph.It compresses the set of HNSW graphs of the naive approach, proven to be lossless under certain conditions, with only a linear to log |D| new edges in expectation when inserting a new vector. This dramatically reduces the index size while largely preserving the search performance. We further propose heuristics to significantly reduce the index cost of our dynamic segment graph in practice. Extensive experimental results show that our approach outperforms existing methods for static RFANNS and is scalable in handling dynamic RFANNS.
more » « less
Free, publicly-accessible full text available June 1, 2026
Near-Duplicate Text Alignment with One Permutation Hashing

https://doi.org/10.1145/3677136

Peng, Zhencan; Zhang, Yuheng; Deng, Dong (October 2024, Proceedings of the ACM on Management of Data)

This paper studies the near-duplicate text alignment problem under the constraint of Jaccard similarity. Specifically, given a collection of long texts and a short query text, this problem finds all thesubsequencesin each text whose Jaccard similarities to the query are no smaller than a given threshold. Near-duplicate text alignment is computationally intensive. This is because there are O(n²) subsequences in a text withntokens. To remedy this issue, a few recent studies propose to first generate the min-hash sketch of every subsequence in each text and then find all the subsequences whose min-hash sketches are similar to that of the query. They introduce the concept of compact windows and show that the O(n²k) min-hashes in a text withntokens can be losslessly compressed in compact windows using O(nk) space, wherekis the sketch size. However, the space cost O(nk) is still too high for long texts, especially when the sketch sizekis large. To address this issue, we propose to use One Permutation Hashing (OPH) to generate the min-hash sketch and introduce the concept of OPH compact windows. Although the size of each sketch remains the same, which is O(k), we prove that all the O(n²k) min-hashes generated by OPH in a text withntokens can be losslessly compressed in OPH compact windows using only O(n+k) space. Note the generation of OPH compact windows does not necessitate the enumeration of the O(n²k) min-hashes. Moreover, we develop an algorithm to find all the sketches in a text similar to that of the query directly from OPH compact windows, along with three optimizations.We conduct extensive experiments on three real-world datasets. Empirical results show our proposed algorithms significantly outperformed existing methods in terms of index cost and query latency and scaled well.
more » « less
Full Text Available
SeRF: Segment Graph for Range-Filtering Approximate Nearest Neighbor Search

https://doi.org/10.1145/3639324

Zuo, Chaoji; Qiao, Miao; Zhou, Wenchao; Li, Feifei; Deng, Dong (March 2024, Proceedings of the ACM on Management of Data)

Effective vector representation models, e.g., word2vec and node2vec, embed real-world objects such as images and documents in high dimensional vector space. In the meanwhile, the objects are often associated with attributes such as timestamps and prices. Many scenarios need to jointly query the vector representations of the objects together with their attributes. These queries can be formalized as range-filtering approximate nearest neighbor search (ANNS) queries. Specifically, given a collection of data vectors, each associated with an attribute value whose domain has a total order. The range-filtering ANNS consists of a query range and a query vector. It finds the approximate nearest neighbors of the query vector among all the data vectors whose attribute values fall in the query range. Existing approaches suffer from a rapidly degrading query performance when the query range width shifts. The query performance can be optimized by a solution that builds an ANNS index for every possible query range; however, the index time and index size become prohibitive -- the number of query ranges is quadratic to the number n of data vectors. To overcome these challenges, for the query range contains all attribute values smaller than a user-provided threshold, we design a structure called the segment graph whose index time and size are the same as a single ANNS index, yet can losslessly compress the n ANNS indexes, reducing the indexing cost by a factor of Ω(n). To handle general range queries, we propose a 2D segment graph with average-case index size O(n log n) to compress n segment graphs, breaking the quadratic barrier. Extensive experiments conducted on real-world datasets show that our proposed structures outperformed existing methods significantly; our index also exhibits superior scalability.
more » « less
Full Text Available
Neural Locality Sensitive Hashing for Entity Blocking

Wang, Runhui; Kong, Luyang; Tao, Yefan; Borthwick, Andrew; Golac, Davor; Johnson, Henrik; Hijazi, Shadie; Deng, Dong; Zhang, Yongfeng (April 2024, Proceedings of the 2024 SIAM International Conference on Data Mining (SDM))

Full Text Available
ARKGraph: All-Range Approximate K-Nearest-Neighbor Graph

https://doi.org/10.14778/3603581.3603601

Zuo, Chaoji; Deng, Dong (June 2023, Proceedings of the VLDB Endowment)

Given a collection of vectors, the approximate K-nearest-neighbor graph (KGraph for short) connects every vector to its approximate K-nearest-neighbors (KNN for short). KGraph plays an important role in high dimensional data visualization, semantic search, manifold learning, and machine learning. The vectors are typically vector representations of real-world objects (e.g., images and documents), which often come with a few structured attributes, such as times-tamps and locations. In this paper, we study the all-range approximate K-nearest-neighbor graph (ARKGraph) problem. Specifically, given a collection of vectors, each associated with a numerical search key (e.g., a timestamp), we aim to build an index that takes a search key range as the query and returns the KGraph of vectors whose search keys are within the query range. ARKGraph can facilitate interactive high dimensional data visualization, data mining, etc. A key challenge of this problem is the huge index size. This is because, given n vectors, a brute-force index stores a KGraph for every search key range, which results in O (K n 3 ) index size as there are O ( n 2 ) search key ranges and each KGraph takes O (K n ) space. We observe that the KNN of a vector in nearby ranges are often the same, which can be grouped together to save space. Based on this observation, we propose a series of novel techniques that reduce the index size significantly to just O (K n log n ) in the average case. Furthermore, we develop an efficient indexing algorithm that constructs the optimized ARKGraph index directly without exhaustively calculating the distance between every pair of vectors. To process a query, for each vector in the query range, we only need O (log log n + K log K) to restore its KNN in the query range from the optimized ARKGraph index. We conducted extensive experiments on real-world datasets. Experimental results show that our optimized ARKGraph index achieved a small index size, low query latency, and good scalability. Specifically, our approach was 1000x faster than the baseline method that builds a KGraph for all the vectors in the query range on-the-fly.
more » « less
Full Text Available
Near-Duplicate Sequence Search at Scale for Large Language Model Memorization Evaluation

https://doi.org/10.1145/3589324

Peng, Zhencan; Wang, Zhizhi; Deng, Dong (June 2023, Proceedings of the ACM on Management of Data)

Recent studies show that large language models (LLM) unintendedly memorize part of the training data, which brings serious privacy risks. For example, it has been shown that over 1% of tokens generated unprompted by an LLM are part of sequences in the training data. However, current studies mainly focus on the exact memorization behaviors. In this paper, we propose to evaluate how many generated texts have near-duplicates (e.g., only differ by a couple of tokens out of 100) in the training corpus. A major challenge of conducting this evaluation is the huge computation cost incurred by near-duplicate sequence searches. This is because modern LLMs are trained on larger and larger corpora with up to 1 trillion tokens. What's worse is that the number of sequences in a text is quadratic to the text length. To address this issue, we develop an efficient and scalable near-duplicate sequence search algorithm in this paper. It can find (almost) all the near-duplicate sequences of the query sequence in a large corpus with guarantees. Specifically, the algorithm generates and groups the min-hash values of all the sequences with at least t tokens (as very short near-duplicates are often irrelevant noise) in the corpus in linear time to the corpus size. We formally prove that only 2 n+1/t+1 -1 min-hash values are generated for a text with n tokens in expectation. Thus the index time and size are reasonable. When a query arrives, we find all the sequences sharing enough min-hash values with the query using inverted indexes and prefix filtering. Extensive experiments on a few large real-world LLM training corpora show that our near-duplicate sequence search algorithm is efficient and scalable.
more » « less
Full Text Available
Long-lived topological time-crystalline order on a quantum processor

https://doi.org/10.1038/s41467-024-53077-9

Xiang, Liang; Jiang, Wenjie; Bao, Zehang; Song, Zixuan; Xu, Shibo; Wang, Ke; Chen, Jiachen; Jin, Feitong; Zhu, Xuhao; Zhu, Zitian; et al (December 2024, Nature Communications)

Abstract Topologically ordered phases of matter elude Landau’s symmetry-breaking theory, featuring a variety of intriguing properties such as long-range entanglement and intrinsic robustness against local perturbations. Their extension to periodically driven systems gives rise to exotic new phenomena that are forbidden in thermal equilibrium. Here, we report the observation of signatures of such a phenomenon—a prethermal topologically ordered time crystal—with programmable superconducting qubits arranged on a square lattice. By periodically driving the superconducting qubits with a surface code Hamiltonian, we observe discrete time-translation symmetry breaking dynamics that is only manifested in the subharmonic temporal response of nonlocal logical operators. We further connect the observed dynamics to the underlying topological order by measuring a nonzero topological entanglement entropy and studying its subsequent dynamics. Our results demonstrate the potential to explore exotic topologically ordered nonequilibrium phases of matter with noisy intermediate-scale quantum processors.
more » « less
Free, publicly-accessible full text available December 1, 2025
Extracting Quantum Many-Body Scarred Eigenstates with Matrix Product States

https://doi.org/10.1103/PhysRevLett.131.020402

Zhang, Shun-Yao; Yuan, Dong; Iadecola, Thomas; Xu, Shenglong; Deng, Dong-Ling (July 2023, Physical Review Letters)
Long-lived topological time-crystalline order on a quantum processor

Xiang, Liang; Jiang, Wenjie; Bao, Zehang; Song, Zixuan; Xu, Shibo; Wang, Ke; Chen, Jiachen; Jin, Feitong; Zhu, Xuhao; Zhu, Zitian; et al (January 2024, arXiv)

Topologically ordered phases of matter elude Landau's symmetry-breaking theory, featuring a variety of intriguing properties such as long-range entanglement and intrinsic robustness against local perturbations. Their extension to periodically driven systems gives rise to exotic new phenomena that are forbidden in thermal equilibrium. Here, we report the observation of signatures of such a phenomenon -- a prethermal topologically ordered time crystal -- with programmable superconducting qubits arranged on a square lattice. By periodically driving the superconducting qubits with a surface-code Hamiltonian, we observe discrete time-translation symmetry breaking dynamics that is only manifested in the subharmonic temporal response of nonlocal logical operators. We further connect the observed dynamics to the underlying topological order by measuring a nonzero topological entanglement entropy and studying its subsequent dynamics. Our results demonstrate the potential to explore exotic topologically ordered nonequilibrium phases of matter with noisy intermediate-scale quantum processors.
more » « less
Full Text Available

« Prev Next »

Search for: All records