skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: ARTS: A System for Aggregate Related Table Search
Existing table search techniques define table relatedness with unionablility and/or joinability. While these are valuable, they do not suffice for most data analysis tasks that involve numerical data, which is often aggregated over geographical, temporal, or other groups. In this demonstration, we showcase ARTS, a novel table search system centered on the unique concept of aggregate relatedness. By leveraging pre-trained language models, ARTS offers a superior column semantics understanding capability, with good labels created for both textual and numerical columns. This demonstration will offer attendees hands-on interaction with our system, revealing its potential in effectively addressing real-world data analysis challenges.  more » « less
Award ID(s):
2312931 2106176 1934565
PAR ID:
10612919
Author(s) / Creator(s):
;
Publisher / Repository:
IEEE
Date Published:
ISBN:
979-8-3503-1715-2
Page Range / eLocation ID:
5461 to 5464
Format(s):
Medium: X
Location:
Utrecht, Netherlands
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the table union search problem which has emerged as an important data discovery problem in data lakes. Semantic problems like table union search cannot be benchmarked using only synthetic data. Our current methods for creating benchmarks for this problem involve the manual curation and human label- ing of real data. These methods are not robust or scalable and perhaps more importantly, it is not clear how comprehensive the created benchmarks are. We propose to use generative AI models to create structured data benchmarks for table union search. We present a novel method for using generative models to create ta- bles with specied properties. Using this method, we create a new benchmark containing pairs of tables that are both unionable and non-unionable, but related. We use this benchmark to provide new insights into the strengths and weaknesses of existing methods. We evaluate state-of-the-art table union search methods over both existing benchmarks and our new benchmarks. We also present and evaluate a new table search method based on large language models over all benchmarks. We show that the new benchmarks are more challenging for all methods than hand-curated benchmarks. We examine why this is the case and show that our new methodology for creating benchmarks permits more detailed analysis and com- parison of methods. We discuss how our generation method (and benchmarks created using it) sheds more light into the successes and failures of table union search methods sparking new insights that can help advance the eld. We also discuss how our benchmark generation methodology can be applied to other semantic problems including entity matching and related table search. 
    more » « less
  2. EDBT (Ed.)
    In data lakes, one of the core challenges remains finding rele- vant tables. We introduce the notion of semantic data lakes, i.e., repositories where datasets are linked to concepts and entities described in a knowledge graph (KG). We formalize the problem of semantic table search, i.e., retrieving tables containing informa- tion semantically related to a given set of entities, and provide the first formal definition of semantic relatedness of a dataset to tuples of entities. Our solution offers the first general framework to compute the semantic relevance of the contents of a table w.r.t. entity tuples, as well as efficient algorithms (exploiting seman- tic signals, such as entity types and embeddings) to scale the semantic search to repositories with hundreds of thousands of distinct tables. Our extensive experiments on both real-world and synthetic benchmarks show that our approach is able to retrieve more relevant tables (up to 5.4 times higher recall) in comparison to existing methods while ensuring fast response times (up to 17 times faster with LSH). 
    more » « less
  3. EDBT (Ed.)
    Unionable table search techniques input a query table from a user and search for data lake tables that can contribute additional rows to the query table. The definition of unionability is gener- ally based on similarity measures which may include similarity between columns (e.g., value overlap or semantic similarity of the values in the columns) or tables (e.g., similarity of table embed- dings). Due to this and the large redundancy in many data lakes (which can contain many copies and versions of the same table), the most unionable tables may be identical or nearly identical to the query table and may contain little new information. Hence, we introduce the problem of identifying unionable tuples from a data lake that are diverse with respect to the tuples already present in a query table. We perform an extensive experimen- tal analysis of well-known diversity algorithms applied to this novel problem and identify a gap that we address with a novel, clustering-based tuple diversity algorithm called DUST. DUST uses a novel embedding model to represent unionable tuples that outperforms other tuple representation models by at least 15% when representing unionable tuples. Using real data lake bench- marks, we show that our diversification algorithm is more than six times faster than the most efficient diversification baseline. We also show that it is more effective in diversifying unionable tuples than existing diversification algorithms. 
    more » « less
  4. Table search aims to answer a query with a ranked list of tables. Unfortunately, current test corpora have focused mostly on needle- in-the-haystack tasks, where only a few tables are expected to exactly match the query intent. Instead, table search tasks often arise in response to the need for retrieving new datasets or augment- ing existing ones, e.g., for data augmentation within data science or machine learning pipelines. Existing table repositories and bench- marks are limited in their ability to test retrieval methods for table search tasks. Thus, to close this gap, we introduce a novel dataset for query-by-example Semantic Table Search. This novel dataset con- sists of two snapshots of the large-scale Wikipedia tables collection from 2013 and 2019 with two important additions: (1) a page and topic aware ground truth relevance judgment and (2) a large-scale DBpedia entity linking annotation. Moreover, we generate a novel set of entity-centric queries that allows testing existing methods under a novel search scenario: semantic exploratory search. The resulting resource consists of 9,296 novel queries, 610,553 query- table relevance annotations, and 238,038 entity-linked tables from the 2013 snapshot. Similarly, on the 2019 snapshot, the resource consists of 2,560 queries, 958,214 relevance annotations, and 457,714 total tables. This makes our resource the largest annotated table- search corpus to date (97 times more queries and 956 times more annotated tables than any existing benchmark). We perform a user study among domain experts and prove that these annotators agree with the automatically generated relevance annotations. As a re- sult, we can re-evaluate some basic assumptions behind existing table search approaches identifying their shortcomings along with promising novel research directions. 
    more » « less
  5. null (Ed.)
    Websites are malleable: users can run code in the browser to customize them. However, this malleability is typically only accessible to programmers with knowledge of HTML and Javascript. Previously, we developed a tool called Wildcard which empowers end-users to customize websites through a spreadsheet-like table interface without doing traditional programming. However, there is a limit to end-user agency with Wildcard, because programmers need to first create site-specific adapters mapping website data to the table interface. This means that end-users can only customize a website if a programmer has written an adapter for it, and cannot extend or repair existing adapters. In this paper, we extend Wildcard with a new system for enduser web scraping for customization. It enables end-users to create, extend and repair adapters, by performing concrete demonstrations of how the website user interface maps to a data table. We describe three design principles that guided our system’s development and are applicable to other end-user web scraping and customization systems: (a) users should be able to scrape data and use it in a single, unified environment, (b) users should be able to extend and repair the programs that scrape data via demonstration and (c) users should receive live feedback during their demonstrations. We have successfully used our system to create, extend and repair adapters by demonstration on a variety of websites and we provide example usage scenarios that showcase each of our design principles. Our ultimate goal is to empower end-users to customize websites in the course of their daily use in an intuitive and flexible way, and thus making the web more malleable for all of its users. 
    more » « less