The Fast and the Private: Task-based Dataset Search

Huang, Zezou; Liu, Jiaxiang; Wang, Haonan; Wu, Eugene

Citation Details

Recent platforms utilize ML task performance metrics, not metadata keywords, to search large data corpus. Requesters provide an initial dataset, and the platform searches for additional datasets that augment---join or union---requester's dataset to most improve the model (e.g., linear regression) performance. Although effective, current task-based data searches are stymied by (1) high latency which deters users, (2) privacy concerns for regulatory standards, and (3) low data quality which provides low utility. We introduce Mileena, a fast, private, and high-quality task-based dataset search platform. At its heart, Mileena is built on pre-computed semi-ring sketches for efficient ML training and evaluation. Based on semi-ring, we develop a novel Factorized Privacy Mechanism that makes the search differentially private and scales to arbitrary corpus sizes and numbers of requests without major quality degradation. We also demonstrate the early promise in using LLM-based agents for automatic data transformation and applying semi-rings to support causal discovery and treatment effect estimation. more »

Award ID(s):: 2312991 2008295

PAR ID:: 10515099

Author(s) / Creator(s):: Huang, Zezou; Liu, Jiaxiang; Wang, Haonan; Wu, Eugene

Publisher / Repository:: Conference on Innovative Data Systems Research

Date Published:: 2024-01-01

Format(s):: Medium: X

Location:: Santa Cruz, California

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this