Ranked Document Retrieval in External Memory

Shah, Rahul; Sheng, Cheng; Thankachan, Sharma; Vitter, Jeffrey

doi:10.1145/3559763

Citation Details

Ranked Document Retrieval in External Memory

The ranked (or top-k) document retrieval problem is defined as follows: preprocess a collection{T₁,T₂,… ,T_d}ofdstrings (called documents) of total lengthninto a data structure, such that for any given query(P,k), wherePis a string (called pattern) of lengthp ≥ 1andk ∈ [1,d]is an integer, the identifiers of thosekdocuments that are most relevant toPcan be reported, ideally in the sorted order of their relevance. The seminal work by Hon et al. [FOCS 2009 and Journal of the ACM 2014] presented anO(n)-space (in words) data structure withO(p+klogk)query time. The query time was later improved toO(p+k)[SODA 2012] and further toO(p/log_σn+k)[SIAM Journal on Computing 2017] by Navarro and Nekrich, whereσis the alphabet size. We revisit this problem in the external memory model and present three data structures. The first one takesO(n)-space and answer queries inO(p/B+ log_Bn + k/B+log^*(n/B)) I/Os, whereBis the block size. The second one takesO(nlog^*(n/B)) space and answer queries in optimalO(p/B+ log_Bn + k/B)I/Os. In both cases, the answers are reported in the unsorted order of relevance. To handle sorted top-kdocument retrieval, we present anO(nlog(d/B))space data structure with optimal query cost. more »

Award ID(s):: 2137057

PAR ID:: 10493865

Author(s) / Creator(s):: Shah, Rahul; Sheng, Cheng; Thankachan, Sharma; Vitter, Jeffrey

Publisher / Repository:: ACM

Date Published:: 2023-01-31

Journal Name:: ACM Transactions on Algorithms

Volume:: 19

Issue:: 1

ISSN:: 1549-6325

Page Range / eLocation ID:: 1 to 12

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.1145/3559763

More Like this