Fuzzy Integration of Data Lake Tables. In EDBT 2026.

Khatiwada, Aamod; Shraga, Roee; Miller, Renée J

doi:10.48786/edbt.2026.08

Citation Details

Fuzzy Integration of Data Lake Tables. In EDBT 2026.

Data integration is an important step in any data science pipeline where the objective is to unify the information available in differ- ent datasets for comprehensive analysis. Full Disjunction, which is an associative extension of the outer join operator, has been shown to be an effective operator for integrating datasets. It fully preserves and combines the available information. Existing Full Disjunction algorithms only consider the equi-join scenario where only tuples having the same value on joining columns are integrated. This, however, does not realistically represent many realistic scenarios where datasets come from diverse sources with inconsistent values (e.g., synonyms, abbreviations, etc.) and with limited metadata. So, joining just on equal values severely limits the ability of Full Disjunction to fully combine datasets. Thus, in this work, we propose an extension of Full Disjunction to also account for “fuzzy” matches among tuples. We present a novel data-driven approach to enable the joining of approximate or fuzzy matches within Full Disjunction. Experimentally, we show that fuzzy Full Disjunction does not add significant time over- head over a state-of-the-art Full Disjunction implementation and also that it enhances the accuracy of a downstream data quality task. more »

Award ID(s):: 2325632 2107248 2348121

PAR ID:: 10614595

Author(s) / Creator(s):: Khatiwada, Aamod; Shraga, Roee; Miller, Renée J

Editor(s):: EDBT

Publisher / Repository:: OpenProceedings.org

Date Published:: 2026-01-01

Subject(s) / Keyword(s):: Data Management Database Technology

Format(s):: Medium: X

Institution:: EDBT International Conference on Extending Data Base Techology

Sponsoring Org:: National Science Foundation

Dataset:
https://doi.org/10.48786/edbt.2026.08

More Like this