Effective vector representation models, e.g., word2vec and node2vec, embed realworld objects such as images and documents in high dimensional vector space. In the meanwhile, the objects are often associated with attributes such as timestamps and prices. Many scenarios need to jointly query the vector representations of the objects together with their attributes. These queries can be formalized as rangefiltering approximate nearest neighbor search (ANNS) queries. Specifically, given a collection of data vectors, each associated with an attribute value whose domain has a total order. The rangefiltering ANNS consists of a query range and a query vector. It finds the approximate nearest neighbors of the query vector among all the data vectors whose attribute values fall in the query range. Existing approaches suffer from a rapidly degrading query performance when the query range width shifts. The query performance can be optimized by a solution that builds an ANNS index for every possible query range; however, the index time and index size become prohibitive  the number of query ranges is quadratic to the number n of data vectors. To overcome these challenges, for the query range contains all attribute values smaller than a userprovided threshold, we design a structure called the segment graph whose index time and size are the same as a single ANNS index, yet can losslessly compress the n ANNS indexes, reducing the indexing cost by a factor of Ω(n). To handle general range queries, we propose a 2D segment graph with averagecase index size O(n log n) to compress n segment graphs, breaking the quadratic barrier. Extensive experiments conducted on realworld datasets show that our proposed structures outperformed existing methods significantly; our index also exhibits superior scalability.
 Award ID(s):
 2212629
 NSFPAR ID:
 10447278
 Date Published:
 Journal Name:
 Proceedings of the VLDB Endowment
 Volume:
 16
 Issue:
 10
 ISSN:
 21508097
 Page Range / eLocation ID:
 2645 to 2658
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We consider the (1+ϵ)approximate nearest neighbor search problem: given a set X of n points in a ddimensional space, build a data structure that, given any query point y, finds a point x∈X whose distance to y is at most (1+ϵ)minx∈X ‖x−y‖ for an accuracy parameter ϵ∈(0,1). Our main result is a data structure that occupies only O(ϵ^−2 n log(n)log(1/ϵ)) bits of space, assuming all point coordinates are integers in the range {−n^O(1)…n^O(1)}, i.e., the coordinates have O(logn) bits of precision. This improves over the best previously known space bound of O(ϵ^−2 n log(n)^2), obtained via the randomized dimensionality reduction method of Johnson and Lindenstrauss (1984). We also consider the more general problem of estimating all distances from a collection of query points to all data points X, and provide almost tight upper and lower bounds for the space complexity of this problem.more » « less

We present a set of parallel algorithms for computing exact knearest neighbors in low dimensions. Many knearest neighbor algorithms use either a kdtree or the Morton ordering of the point set; our algorithms combine these approaches using a data structure we call the zdtree. We show that this combination is both theoretically efficient under common assumptions, and fast in practice. For point sets of size n with bounded expansion constant and bounded ratio, the zdtree can be built in O(n) work with O(n^ε) span for constant ε < 1, and searching for the knearest neighbors of a point takes expected O(k log k) time. We benchmark our knearest neighbor algorithms against existing parallel knearest neighbor algorithms, showing that our implementations are generally faster than the state of the art as well as achieving 75x speedup on 144 hyperthreads. Furthermore, the zdtree supports parallel batchdynamic insertions and deletions; to our knowledge, it is the first knearest neighbor data structure to support such updates. On point sets with bounded expansion constant and bounded ratio, a batchdynamic update of size k requires O(k log n/k) work with O(k^ε + polylog(n)) span.more » « less

We study thetajoins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with n denoting the number of tuples in the database, we guarantee for acyclic full join queries with inequality conditions that for every value of k , the k topranked answers are returned in O ( n polylog n + k log k ) time. This is within a polylogarithmic factor of O ( n + k log k ), i.e., the best known complexity for equijoins, and even of O ( n + k ), i.e., the time it takes to look at the input and return k answers in any order. Our guarantees extend to join queries with selections and many types of projections (namely those called "freeconnex" queries and those that use bag semantics). Remarkably, they hold even when the number of join results is n ℓ for a join of ℓ relations. The key ingredient is a novel O ( n polylog n )size factorized representation of the query output , which is constructed onthefly for a given query and database. In addition to providing the first nontrivial theoretical guarantees beyond equijoins, we show in an experimental study that our rankedenumeration approach is also memoryefficient and fast in practice, beating the running time of stateoftheart database systems by orders of magnitude.more » « less

The ranked (or top
k ) document retrieval problem is defined as follows: preprocess a collection{T_{1},T_{2},… ,T_{d}} ofd strings (called documents) of total lengthn into a data structure, such that for any given query(P,k) , whereP is a string (called pattern) of lengthp ≥ 1 andk ∈ [1,d] is an integer, the identifiers of thosek documents that are most relevant toP can be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented anO(n) space (in words) data structure withO(p+k logk) query time. The query time was later improved toO(p+k) [SODA 2012] and further toO(p/ log_{σn+k}) [SIAM Journal on Computing 2017] by Navarro and Nekrich, whereσ is the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takesO(n) space and answer queries inO(p/B + log_{B}n + k/B+ log^{*}(n/B) ) I/Os, whereB is the block size. The second one takesO(n log^{*}(n/B) ) space and answer queries in optimalO(p/B + log_{B}n + k/B) I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted topk document retrieval, we present anO(n log(d/B)) space data structure with optimal query cost.