Table retrieval is the task of extracting the most relevant tables to answer a user's query. Table retrieval is an important task because many domains have tables that contain useful information in a structured form. Given a user's query, the goal is to obtain a relevance ranking for query-table pairs, such that higher ranked tables should be more relevant to the query. In this paper, we present a context-aware table retrieval method that is based on a novel embedding for attribute tokens. We find that differentiated types of contexts are useful in building word embeddings. We also find that including a specialized representation of numerical cell values in our model improves table retrieval performance. We use the trained model to predict different contexts of every table. We show that the predicted contexts are useful in ranking tables against a query using a multi-field ranking approach. We evaluate our approach using public WikiTables data, and we demonstrate improvements in terms of NDCG over unsupervised baseline methods in the table retrieval task.
more »
« less
WTR: A Test Collection for Web Table Retrieval
We describe the development, characteristics and availability of a test collection for the task of Web table retrieval, which uses a large-scale Web Table Corpora extracted from the Common Crawl. Since a Web table usually has rich context information such as the page title and surrounding paragraphs, we not only provide relevance judgments of query-table pairs, but also the relevance judgments of query-table context pairs with respect to a query, which are ignored by previous test collections. To facilitate future research with this benchmark, we provide details about how the dataset is pre-processed and also baseline results from both traditional and recently proposed table retrieval methods. Our experimental results show that proper usage of context labels can benefit previous table retrieval methods.
more »
« less
- Award ID(s):
- 1816325
- PAR ID:
- 10279625
- Date Published:
- Journal Name:
- Proceedings of 44th International ACM SIGIR Conference on Research and Development in Information Retrieval
- Page Range / eLocation ID:
- 2514 to 2520
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
A search engine's ability to retrieve desirable datasets is important for data sharing and reuse. Existing dataset search engines typically rely on matching queries to dataset descriptions. However, a user may not have enough prior knowledge to write a query using terms that match with description text. We propose a novel schema label generation model which generates possible schema labels based on dataset table content. We incorporate the generated schema labels into a mixed ranking model which not only considers the relevance between the query and dataset metadata but also the similarity between the query and generated schema labels. To evaluate our method on real-world datasets, we create a new benchmark specifically for the dataset retrieval task. Experiments show that our approach can effectively improve the precision and NDCG scores of the dataset retrieval task compared with baseline methods. We also test on a collection of Wikipedia tables to show that the features generated from schema labels can improve the unsupervised and supervised web table retrieval task as well.more » « less
-
null (Ed.)We address the problem of ad hoc table retrieval via a new neural architecture that incorporates both semantic and relevance matching. Understanding the connection between the structured form of a table and query tokens is an important yet neglected problem in information retrieval. We use a learning- to-rank approach to train a system to capture semantic and relevance signals within interactions between the structured form of candidate tables and query tokens. Convolutional filters that extract contextual features from query/table interactions are combined with a feature vector based on the distributions of term similarity between queries and tables. We propose using row and column summaries to incorporate table content into our new neural model. We evaluate our approach using two datasets, and we demonstrate substantial improvements in terms of retrieval metrics over state-of-the-art methods in table retrieval and document retrieval, and neural architectures from sentence, document, and table type classification adapted to the table retrieval task. Our ablation study supports the importance of both semantic and relevance matching in the table retrieval.more » « less
-
Table search aims to answer a query with a ranked list of tables. Unfortunately, current test corpora have focused mostly on needle- in-the-haystack tasks, where only a few tables are expected to exactly match the query intent. Instead, table search tasks often arise in response to the need for retrieving new datasets or augment- ing existing ones, e.g., for data augmentation within data science or machine learning pipelines. Existing table repositories and bench- marks are limited in their ability to test retrieval methods for table search tasks. Thus, to close this gap, we introduce a novel dataset for query-by-example Semantic Table Search. This novel dataset con- sists of two snapshots of the large-scale Wikipedia tables collection from 2013 and 2019 with two important additions: (1) a page and topic aware ground truth relevance judgment and (2) a large-scale DBpedia entity linking annotation. Moreover, we generate a novel set of entity-centric queries that allows testing existing methods under a novel search scenario: semantic exploratory search. The resulting resource consists of 9,296 novel queries, 610,553 query- table relevance annotations, and 238,038 entity-linked tables from the 2013 snapshot. Similarly, on the 2019 snapshot, the resource consists of 2,560 queries, 958,214 relevance annotations, and 457,714 total tables. This makes our resource the largest annotated table- search corpus to date (97 times more queries and 956 times more annotated tables than any existing benchmark). We perform a user study among domain experts and prove that these annotators agree with the automatically generated relevance annotations. As a re- sult, we can re-evaluate some basic assumptions behind existing table search approaches identifying their shortcomings along with promising novel research directions.more » « less
-
Relevance feedback techniques assume that users provide relevance judgments for the top k (usually 10) documents and then re-rank using a new query model based on those judgments. Even though this is effective, there has been little research recently on this topic because requiring users to provide substantial feedback on a result list is impractical in a typical web search scenario. In new environments such as voice-based search with smart home devices, however, feedback about result quality can potentially be obtained during users' interactions with the system. Since there are severe limitations on the length and number of results that can be presented in a single interaction in this environment, the focus should move from browsing result lists to iterative retrieval and from retrieving documents to retrieving answers. In this paper, we study iterative relevance feedback techniques with a focus on retrieving answer passages. We first show that iterative feedback can be at least as effective as the top-k approach on standard TREC collections, and more effective on answer passage collections. We then propose an iterative feedback model for answer passages based on semantic similarity at passage level and show that it can produce significant improvements compared to both word-based iterative feedback models and those based on term-level semantic similarity.more » « less
An official website of the United States government

