skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning
Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical results on real table benchmarks show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index to accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing).  more » « less
Award ID(s):
2107248
PAR ID:
10432224
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
16
Issue:
7
ISSN:
2150-8097
Page Range / eLocation ID:
1726 to 1739
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover semantic relationship between pairs of columns. The first uses an existing knowledge base (KB), the second (which we call a “synthesized KB”) uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state- of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating a synthesized KBs from data lakes with limited KB coverage and using them for union search. 
    more » « less
  2. Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of the union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover the semantic relationships between pairs of columns. The first uses an existing knowledge base (KB), and the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating synthesized KBs from data lakes with limited KB coverage and using them for union search. 
    more » « less
  3. We consider the table union search problem which has emerged as an important data discovery problem in data lakes. Semantic problems like table union search cannot be benchmarked using only synthetic data. Our current methods for creating benchmarks for this problem involve the manual curation and human label- ing of real data. These methods are not robust or scalable and perhaps more importantly, it is not clear how comprehensive the created benchmarks are. We propose to use generative AI models to create structured data benchmarks for table union search. We present a novel method for using generative models to create ta- bles with specied properties. Using this method, we create a new benchmark containing pairs of tables that are both unionable and non-unionable, but related. We use this benchmark to provide new insights into the strengths and weaknesses of existing methods. We evaluate state-of-the-art table union search methods over both existing benchmarks and our new benchmarks. We also present and evaluate a new table search method based on large language models over all benchmarks. We show that the new benchmarks are more challenging for all methods than hand-curated benchmarks. We examine why this is the case and show that our new methodology for creating benchmarks permits more detailed analysis and com- parison of methods. We discuss how our generation method (and benchmarks created using it) sheds more light into the successes and failures of table union search methods sparking new insights that can help advance the eld. We also discuss how our benchmark generation methodology can be applied to other semantic problems including entity matching and related table search. 
    more » « less
  4. EDBT (Ed.)
    Unionable table search techniques input a query table from a user and search for data lake tables that can contribute additional rows to the query table. The definition of unionability is gener- ally based on similarity measures which may include similarity between columns (e.g., value overlap or semantic similarity of the values in the columns) or tables (e.g., similarity of table embed- dings). Due to this and the large redundancy in many data lakes (which can contain many copies and versions of the same table), the most unionable tables may be identical or nearly identical to the query table and may contain little new information. Hence, we introduce the problem of identifying unionable tuples from a data lake that are diverse with respect to the tuples already present in a query table. We perform an extensive experimen- tal analysis of well-known diversity algorithms applied to this novel problem and identify a gap that we address with a novel, clustering-based tuple diversity algorithm called DUST. DUST uses a novel embedding model to represent unionable tuples that outperforms other tuple representation models by at least 15% when representing unionable tuples. Using real data lake bench- marks, we show that our diversification algorithm is more than six times faster than the most efficient diversification baseline. We also show that it is more effective in diversifying unionable tuples than existing diversification algorithms. 
    more » « less
  5. EDBT (Ed.)
    In data lakes, one of the core challenges remains finding rele- vant tables. We introduce the notion of semantic data lakes, i.e., repositories where datasets are linked to concepts and entities described in a knowledge graph (KG). We formalize the problem of semantic table search, i.e., retrieving tables containing informa- tion semantically related to a given set of entities, and provide the first formal definition of semantic relatedness of a dataset to tuples of entities. Our solution offers the first general framework to compute the semantic relevance of the contents of a table w.r.t. entity tuples, as well as efficient algorithms (exploiting seman- tic signals, such as entity types and embeddings) to scale the semantic search to repositories with hundreds of thousands of distinct tables. Our extensive experiments on both real-world and synthetic benchmarks show that our approach is able to retrieve more relevant tables (up to 5.4 times higher recall) in comparison to existing methods while ensuring fast response times (up to 17 times faster with LSH). 
    more » « less