Integrating Data Lake Tables

Khatiwada, Aamod; Shraga, Roee; Gatterbauer, Wolfgang; Miller, Renée J.

doi:10.14778/3574245.3574274

Citation Details

Integrating Data Lake Tables

We have made tremendous strides in providing tools for data scientists to discover new tables useful for their analyses. But despite these advances, the proper integration of discovered tables has been under-explored. An interesting semantics for integration, called Full Disjunction, was proposed in the 1980's, but there has been little progress in using it for data science to integrate tables culled from data lakes. We provide ALITE, the first proposal for scalable integration of tables that may have been discovered using join, union or related table search. We empirically show that ALITE can outperform previous algorithms for computing the Full Disjunction. ALITE relaxes previous assumptions that tables share common attribute names (which completely determine the join columns), are complete (without null values), and have acyclic join patterns. To evaluate ALITE, we develop and share three new benchmarks for integration that use real data lake tables. more »

Award ID(s):: 1956096 1762268 2107248

PAR ID:: 10432219

Author(s) / Creator(s):: Khatiwada, Aamod; Shraga, Roee; Gatterbauer, Wolfgang; Miller, Renée J.

Date Published:: 2022-12-01

Journal Name:: Proceedings of the VLDB Endowment

Volume:: 16

Issue:: 4

ISSN:: 2150-8097

Page Range / eLocation ID:: 932 to 945

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Journal Article:
https://doi.org/10.14778/3574245.3574274

More Like this