skip to main content


Title: CSurF: Sparse Lexical Retrieval through Contextualized Surface Forms
Lexical exact-match systems perform text retrieval efficiently with sparse matching signals and fast retrieval through inverted lists, but naturally suffer from the mismatch between lexical surface form and implicit term semantics. This paper proposes to directly bridge the surface form space and the term semantics space in lexical exact-match retrieval via contextualized surface forms (CSF). Each CSF pairs a lexical surface form with a context source, and is represented by a lexical form weight and a contextualized semantic vector representation. This framework is able to perform sparse lexicon-based retrieval by learning to represent each query and document as a "bag-of-CSFs", simultaneously addressing two key factors in sparse retrieval: vocabulary expansion of surface form and semantic representation of term meaning. At retrieval time, it efficiently matches CSFs through exact-match of learned surface forms, and effectively scores each CSF pair via contextual semantic representations, leading to joint improvement in both term match and term scoring. Multiple experiments show that this approach successfully resolves the main mismatch issues in lexical exact-match retrieval and outperforms state-of-the-art lexical exact-match systems, reaching comparable accuracy as lexical all-to-all soft match systems as an efficient exact-match-based system.  more » « less
Award ID(s):
1815528
PAR ID:
10479604
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
ISBN:
9798400700736
Page Range / eLocation ID:
65 to 75
Format(s):
Medium: X
Location:
Taipei Taiwan
Sponsoring Org:
National Science Foundation
More Like this
  1. Lexical exact match systems that use inverted lists are a fundamental text retrieval architecture. A recent advance in neural IR, COIL, extends this approach with contextualized inverted lists from a deep language model backbone and performs retrieval by comparing contextualized query-document term representation, which is effective but computationally expensive. This paper explores the effectiveness-efficiency tradeoff in COIL-style systems, aiming to reduce the computational complexity of retrieval while preserving term semantics. It proposes COILcr, which explicitly factorizes COIL into intra-context term importance weights and cross-context semantic representations. At indexing time, COILcr further maps term semantic representations to a smaller set of canonical representations. Experiments demonstrate that canonical representations can efficiently preserve term semantics, reducing the storage and computational cost of COIL-based retrieval while maintaining model performance. The paper also discusses and compares multiple heuristics for canonical representation selection and looks into its performance in different retrieval settings. 
    more » « less
  2. null (Ed.)
    Classical information retrieval systems such asBM25 rely on exact lexical match and carryout search efficiently with inverted list index. Recent neural IR models shifts towards soft semantic matching all query document terms,but they lose the computation efficiency of exact match systems.This paper presents COIL, a contextualized exact match retrieval architecture that brings semantic lexical matching. COIL scoring is based on overlapping query document tokens’ contextualized representations. The new architecture stores contextualized token representations in inverted lists, bringing together the efficiency of exact match and the representation power of deep language models. Our experimental results show COIL outperforms classical lexical retrievers and state-of-the-art deep LM retrievers with similar or smaller latency. 
    more » « less
  3. null (Ed.)
    This paper presents CLEAR, a retrieval model that seeks to complement classical lexical exact-match models such as BM25 with semantic matching signals from a neural embedding matching model. CLEAR explicitly trains the neural embedding to encode language structures and semantics that lexical retrieval fails to capture with a novel residual-based embedding learning method. Empirical evaluations demonstrate the advantages of CLEAR over state-of-the-art retrieval models, and that it can substantially improve the end-to-end accuracy and efficiency of reranking pipelines. 
    more » « less
  4. Tulving characterized semantic memory as a vast repository of meaning that underlies language and many other cognitive processes. This perspective on lexical and conceptual knowledge galvanized a new era of research undertaken by numerous fields, each with their own idiosyncratic methods and terminology. For example, “concept” has different meanings in philosophy, linguistics, and psychology. As such, many fundamental constructs used to delineate semantic theories remain underspecified and/or opaque. Weak construct specificity is among the leading causes of the replication crisis now facing psychology and related fields. Term ambiguity hinders cross-disciplinary communication, falsifiability, and incremental theory-building. Numerous cognitive subdisciplines (e.g., vision, affective neuroscience) have recently addressed these limitations via the development of consensus-based guidelines and definitions. The project to follow represents our effort to produce a multidisciplinary semantic glossary consisting of succinct definitions, background, principled dissenting views, ratings of agreement, and subjective confidence for 17 target constructs (e.g., abstractness, abstraction, concreteness, concept, embodied cognition, event semantics, lexical-semantic, modality, representation, semantic control, semantic feature, simulation, semantic distance, semantic dimension).We discuss potential benefits and pitfalls (e.g., implicit bias, prescriptiveness) of these efforts to specify a common nomenclature that other researchers might index in specifying their own theoretical perspectives (e.g., They said X, but I mean Y). 
    more » « less
  5. Edges in many real-world social/information networks are associated with rich text information (e.g., user-user communications or user-product reviews). However, mainstream network representation learning models focus on propagating and aggregating node attributes, lacking specific designs to utilize text semantics on edges. While there exist edge-aware graph neural networks, they directly initialize edge attributes as a feature vector, which cannot fully capture the contextualized text semantics of edges. In this paper, we propose Edgeformers, a framework built upon graph-enhanced Transformers, to perform edge and node representation learning by modeling texts on edges in a contextualized way. Specifically, in edge representation learning, we inject network information into each Transformer layer when encoding edge texts; in node representation learning, we aggregate edge representations through an attention mechanism within each node’s ego-graph. On five public datasets from three different domains, Edgeformers consistently outperform state-of-the-art baselines in edge classification and link prediction, demonstrating the efficacy in learning edge and node representations, respectively. 
    more » « less