NSF PAR Search | NSF Public Access Repository

Ranked Document Retrieval in External Memory

https://doi.org/10.1145/3559763

Shah, Rahul; Sheng, Cheng; Thankachan, Sharma; Vitter, Jeffrey (January 2023, ACM Transactions on Algorithms)

The ranked (or top-k) document retrieval problem is defined as follows: preprocess a collection{T₁,T₂,… ,T_d}ofdstrings (called documents) of total lengthninto a data structure, such that for any given query(P,k), wherePis a string (called pattern) of lengthp ≥ 1andk ∈ [1,d]is an integer, the identifiers of thosekdocuments that are most relevant toPcan be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented anO(n)-space (in words) data structure withO(p+klogk)query time. The query time was later improved toO(p+k)[SODA 2012] and further toO(p/log_σn+k)[SIAM Journal on Computing 2017] by Navarro and Nekrich, whereσis the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takesO(n)-space and answer queries inO(p/B+ log_Bn + k/B+log^*(n/B)) I/Os, whereBis the block size. The second one takesO(nlog^*(n/B)) space and answer queries in optimalO(p/B+ log_Bn + k/B)I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted top-kdocument retrieval, we present anO(nlog(d/B))space data structure with optimal query cost.

Full Text Available

Search for: All records