- Award ID(s):
- 1815528
- PAR ID:
- 10337220
- Date Published:
- Journal Name:
- CIKM '21: Proceedings of the 30th ACM International Conference on Information & Knowledge Management
- Page Range / eLocation ID:
- 3592 to 3596
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
While dense retrieval has been shown to be effective and efficient across tasks and languages, it remains difficult to create effective fully zero-shot dense retrieval systems when no relevance labels are available. In this paper, we recognize the difficulty of zero-shot learning and encoding relevance. Instead, we propose to pivot through Hypothetical Document Embeddings (HyDE). Given a query, HyDE first zero-shot prompts an instruction-following language model (e.g., InstructGPT) to generate a hypothetical document. The document captures relevance patterns but is “fake” and may contain hallucinations. Then, an unsupervised contrastively learned encoder (e.g., Contriever) encodes the document into an embedding vector. This vector identifies a neighborhood in the corpus embedding space, from which similar real documents are retrieved based on vector similarity. This second step grounds the generated document to the actual corpus, with the encoder’s dense bottleneck filtering out the hallucinations. Our experiments show that HyDE significantly outperforms the state-of-the-art unsupervised dense retriever Contriever and shows strong performance comparable to fine-tuned retrievers across various tasks (e.g. web search, QA, fact verification) and in non-English languages (e.g., sw, ko, ja, bn).more » « less
-
Ganguly, Debasis ; Gangopadhyay, Surupendu ; Mitra, Mandar ; Majumder, Prasenjit (Ed.)For most queries, the set of relevant documents spans multiple subtopics. Inspired by the neural ranking models and query-specific neural clustering models, we develop Topic-Mono-BERT which performs both tasks jointly. Based on text embeddings of BERT, our model learns a shared embedding that is optimized for both tasks. The clustering hypothesis would suggest that embeddings which place topically similar text in close proximity will also perform better on ranking tasks. Our model is trained with the Wikimarks approach to obtain training signals for relevance and subtopics on the same queries. Our task is to identify overview passages that can be used to construct a succinct answer to the query. Our empirical evaluation on two publicly available passage retrieval datasets suggests that including the clustering supervision in the ranking model leads to about 16% improvement in identifying text passages that summarize different subtopics within a query.more » « less
-
This paper studies a category of visual question answering tasks, in which accessing external knowledge is necessary for answering the questions. This category is called outside-knowledge visual question answering (OK-VQA). A major step in developing OKVQA systems is to retrieve relevant documents for the given multimodal query. Current state-of-the-art dense retrieval model for this task uses an asymmetric architecture with a multi-modal query encoder and a uni-modal document encoder. Such an architecture requires a large amount of training data for effective performance. We propose an automatic data generation pipeline for pre-training passage retrieval models for OK-VQA tasks. The proposed approach leads to 26.9% Precision@5 improvements compared to the current state-of-the-art. Additionally, the proposed pre-training approach exhibits a good ability in zero-shot retrieval scenarios.more » « less
-
Neural networks provide new possibilities to automatically learn complex language patterns and query-document relations. Neural IR models have achieved promising results in learning query-document relevance patterns, but few explorations have been done on understanding the text content of a query or a document. This paper studies leveraging a recently-proposed contextual neural language model, BERT, to provide deeper text understanding for IR.Experimental results demonstrate that the contextual text representations from BERT are more effective than traditional word embeddings. Compared to bag-of-words retrieval models, the contextual language model can better leverage language structures, bringing large improvements on queries written in natural languages. Combining the text understanding ability with search knowledge leads to an enhanced pre-trained BERT model that can benefit related search tasks where training data are limited.more » « less
-
Abstract Previous work on privacy-aware ranking has addressed the minimization of information leakage when scoring top
k documents, and has not studied on how to retrieve these top documents and their features for ranking. This paper proposes a privacy-aware document retrieval scheme with a two-level inverted index structure. In this scheme, posting records are grouped with bucket tags and runtime query processing produces query-specific tags in order to gather encoded features of matched documents with a privacy protection during index traversal. To thwart leakage-abuse attacks, our design minimizes the chance that a server processes unauthorized queries or identifies document sharing across posting lists through index inspection or across-query association. This paper presents the evaluation and analytic results of the proposed scheme to demonstrate the tradeoffs in its design considerations for privacy, efficiency, and relevance.