Query-driven Segment Selection for Ranking Long Documents

Kim, Youngwoo; Rahimi, Razieh; Bonab, Hamed; Allan, James

doi:10.1145/3459637.3482101

Citation Details

Query-driven Segment Selection for Ranking Long Documents

Transformer-based rankers have shown the state-of-the-art performance, but their self-attention operation is mostly unable to process long sequences. One of the common approaches to train these rankers is to heuristically select some segments of each document, such as the first segment, as training data. However, these segments may not contain the query-related parts of documents. To address this problem, we propose the query-driven segment selection from long documents to build training data for transformer-based rankers. The segment selector provides relevant samples with more accurate labels and non-relevant samples which are harder to be predicted. The experimental results show that the basic BERT-based ranker trained with the proposed segment selector significantly outperforms that trained by the heuristically selected segments, and performs equally to the state-of-the-art model with localized self-attention that can process longer input sequences. We also demonstrate that training with our segment selector, there is not much gain from feeding input sequences larger than 200 words. Our findings open up new opportunities to design efficient transformer-based rankers. more »

Award ID(s):: 1813662

PAR ID:: 10357770

Author(s) / Creator(s):: Kim, Youngwoo; Rahimi, Razieh; Bonab, Hamed; Allan, James

Date Published:: 2021-10-26

Journal Name:: Proceedings of The 30th ACM International Conference on Information and Knowledge Management (CIKM '21)

Page Range / eLocation ID:: 3147 to 3151

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3459637.3482101

More Like this