skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Fuzzy Integration of Data Lake Tables. In EDBT 2026.
Data integration is an important step in any data science pipeline where the objective is to unify the information available in differ- ent datasets for comprehensive analysis. Full Disjunction, which is an associative extension of the outer join operator, has been shown to be an effective operator for integrating datasets. It fully preserves and combines the available information. Existing Full Disjunction algorithms only consider the equi-join scenario where only tuples having the same value on joining columns are integrated. This, however, does not realistically represent many realistic scenarios where datasets come from diverse sources with inconsistent values (e.g., synonyms, abbreviations, etc.) and with limited metadata. So, joining just on equal values severely limits the ability of Full Disjunction to fully combine datasets. Thus, in this work, we propose an extension of Full Disjunction to also account for “fuzzy” matches among tuples. We present a novel data-driven approach to enable the joining of approximate or fuzzy matches within Full Disjunction. Experimentally, we show that fuzzy Full Disjunction does not add significant time over- head over a state-of-the-art Full Disjunction implementation and also that it enhances the accuracy of a downstream data quality task.  more » « less
Award ID(s):
2325632 2107248 2348121
PAR ID:
10614595
Author(s) / Creator(s):
; ;
Editor(s):
EDBT
Publisher / Repository:
OpenProceedings.org
Date Published:
Subject(s) / Keyword(s):
Data Management Database Technology
Format(s):
Medium: X
Institution:
EDBT International Conference on Extending Data Base Techology
Sponsoring Org:
National Science Foundation
More Like this
  1. We have made tremendous strides in providing tools for data scientists to discover new tables useful for their analyses. But despite these advances, the proper integration of discovered tables has been under-explored. An interesting semantics for integration, called Full Disjunction, was proposed in the 1980's, but there has been little progress in using it for data science to integrate tables culled from data lakes. We provide ALITE, the first proposal for scalable integration of tables that may have been discovered using join, union or related table search. We empirically show that ALITE can outperform previous algorithms for computing the Full Disjunction. ALITE relaxes previous assumptions that tables share common attribute names (which completely determine the join columns), are complete (without null values), and have acyclic join patterns. To evaluate ALITE, we develop and share three new benchmarks for integration that use real data lake tables. 
    more » « less
  2. We study the dynamic query evaluation problem: Given a full conjunctive query Q and a sequence of updates to the input database, we construct a data structure that supports constant-delay enumeration of the tuples in the query output after each update. We show that a sequence of N insert-only updates to an initially empty database can be executed in total time O(Nw(Q)), where w(Q) is the fractional hypertree width of Q. This matches the complexity of the static query evaluation problem for Q and a database of size N. One corollary is that the amortized time per single-tuple insert is constant for acyclic full conjunctive queries. In contrast, we show that a sequence of N inserts and deletes can be executed in total time Õ(Nw(Q')), where Q' is obtained from Q by extending every relational atom with extra variables that represent the lifespans of tuples in the database. We show that this reduction is optimal in the sense that the static evaluation runtime of Q' provides a lower bound on the total update time for the output of Q. Our approach achieves amortized optimal update times for the hierarchical and Loomis-Whitney join queries. 
    more » « less
  3. Federated learning enables users to collaboratively train a machine learning model over their private datasets. Secure aggregation protocols are employed to mitigate information leakage about the local datasets from user-submitted model updates. This setup, however, still leaks the user participation in training, which can also be sensitive. Protecting user anonymity is even more challenging in dynamic environments where users may (re)join or leave the training process at any point of time. This paper introduces AnoFel, the first framework to support private and anonymous dynamic participation in federated learning (FL). AnoFel leverages several cryptographic primitives, the concept of anonymity sets, differential privacy, and a public bulletin board to support anonymous user registration, as well as unlinkable and confidential model update submission. Our system allows dynamic participation, where users can join or leave at any time without needing any recovery protocol or interaction. To assess security, we formalize a notion for privacy and anonymity in FL, and formally prove that AnoFel satisfies this notion. To the best of our knowledge, our system is the first solution with provable anonymity guarantees. To assess efficiency, we provide a concrete implementation of AnoFel, and conduct experiments showing its ability to support learning applications scaling to a large number of clients. For a TinyImageNet classification task with 512 clients, the client setup to join is less than 3 sec, and the client runtime for each training iteration takes a total of 8 sec, where the added overhead of AnoFel is 46% of the total runtime. We also compare our system with prior work and demonstrate its practicality. AnoFel client runtime is up to 5x faster than Truex et al., despite the added anonymity guarantee and dynamic user joining in AnoFel. Compared to Bonawitz et al., AnoFel is only 2x slower for added support for privacy in output, dynamic user joining, and anonymity. 
    more » « less
  4. In this work, we demonstrate CaJaDE (Context-Aware Join-Augmented Deep Explanations), a system that explains query results by augmenting provenance with contextual information from other related tables in the database. Given two query results whose difference the user wants to understand, we enumerate possible ways of joining the provenance (i.e., contributing input tuples) of these two query results with tuples from other relevant tables in the database that were not used in the query. We use patterns to concisely explain the difference between the augmented provenance of the two query results. CaJaDE, through a comprehensive UI, enables the user to formulate questions and explore explanations interactively. 
    more » « less
  5. Clustering plays a crucial role in computer science, facilitating data analysis and problem-solving across numerous fields. By partitioning large datasets into meaningful groups, clustering reveals hidden structures and relationships within the data, aiding tasks such as unsupervised learning, classification, anomaly detection, and recommendation systems. Particularly in relational databases, where data is distributed across multiple tables, efficient clustering is essential yet challenging due to the computational complexity of joining tables. This paper addresses this challenge by introducing efficient algorithms for k-median and k-means clustering on relational data without the need for pre-computing the join query results. For the relational k-median clustering, we propose the first efficient relative approximation algorithm. For the relational k-means clustering, our algorithm significantly improves both the approximation factor and the running time of the known relational k-means clustering algorithms, which suffer either from large constant approximation factors, or expensive running time. Given a join query q and a database instance D of O(N) tuples, for both k-median and k-means clustering on the results of q on D, we propose randomized (1+ε)γ-approximation algorithms that run in roughly O(k2Nfhw)+T_γ(k2) time, where ε ∈ (0,1) is a constant parameter decided by the user, \fhw is the fractional hyper-tree width of Q, while γ and T_γ(x) represent the approximation factor and the running time, respectively, of a traditional clustering algorithm in the standard computational setting over x points. 
    more » « less