Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning

Fan, Grace; Wang, Jin; Li, Yuliang; Zhang, Dan; Miller, Renée J.

doi:10.14778/3587136.3587146

Citation Details

Semantics-Aware Dataset Discovery from Data Lakes with Contextualized Column-Based Representation Learning

Dataset discovery from data lakes is essential in many real application scenarios. In this paper, we propose Starmie, an end-to-end framework for dataset discovery from data lakes (with table union search as the main use case). Our proposed framework features a contrastive learning method to train column encoders from pre-trained language models in a fully unsupervised manner. The column encoder of Starmie captures the rich contextual semantic information within tables by leveraging a contrastive multi-column pre-training strategy. We utilize the cosine similarity between column embedding vectors as the column unionability score and propose a filter-and-verification framework that allows exploring a variety of design choices to compute the unionability score between two tables accordingly. Empirical results on real table benchmarks show that Starmie outperforms the best-known solutions in the effectiveness of table union search by 6.8 in MAP and recall. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index to accelerate query processing of table union search which provides a 3,000X performance gain over the linear scan baseline and a 400X performance gain over an LSH index (the state-of-the-art solution for data lake indexing). more »

Award ID(s):: 2107248

PAR ID:: 10432224

Author(s) / Creator(s):: Fan, Grace; Wang, Jin; Li, Yuliang; Zhang, Dan; Miller, Renée J.

Date Published:: 2023-03-01

Journal Name:: Proceedings of the VLDB Endowment

Volume:: 16

Issue:: 7

ISSN:: 2150-8097

Page Range / eLocation ID:: 1726 to 1739

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.14778/3587136.3587146

More Like this