skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: The Fast and the Private: Task-based Dataset Search
Recent platforms utilize ML task performance metrics, not metadata keywords, to search large data corpus. Requesters provide an initial dataset, and the platform searches for additional datasets that augment---join or union---requester's dataset to most improve the model (e.g., linear regression) performance. Although effective, current task-based data searches are stymied by (1) high latency which deters users, (2) privacy concerns for regulatory standards, and (3) low data quality which provides low utility. We introduce Mileena, a fast, private, and high-quality task-based dataset search platform. At its heart, Mileena is built on pre-computed semi-ring sketches for efficient ML training and evaluation. Based on semi-ring, we develop a novel Factorized Privacy Mechanism that makes the search differentially private and scales to arbitrary corpus sizes and numbers of requests without major quality degradation. We also demonstrate the early promise in using LLM-based agents for automatic data transformation and applying semi-rings to support causal discovery and treatment effect estimation.  more » « less
Award ID(s):
2312991 2008295
PAR ID:
10515099
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Conference on Innovative Data Systems Research
Date Published:
Format(s):
Medium: X
Location:
Santa Cruz, California
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent data search platforms use ML task-based utility measures rather than metadata-based keywords, to search large dataset corpora. Requesters submit a training dataset, and these platforms search foraugmentations---join or union-compatible datasets---that, when used to augment the requester's dataset, most improve model (e.g., linear regression) performance. Although effective, providers that manage personally identifiable data demand differential privacy (DP) guarantees before granting these platforms data access. Unfortunately, making data search differentially private is nontrivial, as a single search can involve training and evaluating datasets hundreds or thousands of times, quickly depleting privacy budgets. We presentSaibot, a differentially private data search platform that employs Factorized Privacy Mechanism (FPM), a novel DP mechanism, to calculate sufficient semi-ring statistics for ML over different combinations of datasets. These statistics are privatized once, and can be freely reused for the search. This allows Saibot to scale to arbitrary numbers of datasets and requests, while minimizing the amount that DP noise affects search results. We optimize the sensitivity of FPM for common augmentation operations, and analyze its properties with respect to linear regression. Specifically, we develop an unbiased estimator for many-to-many joins, prove its bounds, and develop an optimization to redistribute DP noise to minimize the impact on the model. Our evaluation on a real-world dataset corpus of 329 datasets demonstrates thatSaibotcan return augmentations that achieve model accuracy within 50--90% of non-private search, while the leading alternative DP mechanisms (TPM, APM, shuffling) are several orders of magnitude worse. 
    more » « less
  2. Successful supervised learning models rely on predictive features, which rarely come from a single dataset. As a result, relevant datasets need to be integrated before training the actual model. This raises one natural question: \textit{``how can one efficiently search for predictive features from relevant datasets for integration with responsible AI guarantees?"}. This paper formalizes the question as the \textit{data augmentation search problem} with an objective of minimizing the search latency. We propose \sys, an interactive system that intakes a supervised learning task and searches for a set of join-compatible datasets that optimally improve the performance of the task. Specifically, \sys manages a corpus of relational datasets, uses linear regression as a \textit{proxy model} to evaluate augmentation candidates, and applies \textit{factorized machine learning} to accelerate model training and evaluation algorithmically. Furthermore, \sys leverages system and hardware optimizations to maximize parallelism across augmentation searches. These allow \sys to search for a good augmentation plan over 1 million datasets with a latency of $1.4$ seconds. 
    more » « less
  3. A large amount of high-dimensional and heterogeneous data appear in practical applications, which are often published to third parties for data analysis, recommendations, targeted advertising, and reliable predictions. However, publishing these data may disclose personal sensitive information, resulting in an increasing concern on privacy violations. Privacy-preserving data publishing has received considerable attention in recent years. Unfortunately, the differentially private publication of high dimensional data remains a challenging problem. In this paper, we propose a differentially private high-dimensional data publication mechanism (DP2-Pub) that runs in two phases: a Markov-blanket-based attribute clustering phase and an invariant post randomization (PRAM) phase. Specifically, splitting attributes into several low-dimensional clusters with high intra-cluster cohesion and low inter-cluster coupling helps obtain a reasonable allocation of privacy budget, while a double-perturbation mechanism satisfying local differential privacy facilitates an invariant PRAM to ensure no loss of statistical information and thus significantly preserves data utility. We also extend our DP2-Pub mechanism to the scenario with a semi-honest server which satisfies local differential privacy. We conduct extensive experiments on four real-world datasets and the experimental results demonstrate that our mechanism can significantly improve the data utility of the published data while satisfying differential privacy. 
    more » « less
  4. Few studies have compared Large Language Models (LLMs) to traditional Machine Learning (ML)-based automated scoring methods in terms of accuracy, ethics, and economics. Using a corpus of 1000 expert-scored and interview-validated scientific explanations derived from the ACORNS instrument, this study employed three LLMs and the ML-based scoring engine, EvoGrader. We measured scoring reliability (percentage agreement, kappa, precision, recall, F1), processing time, and explored contextual factors like ethics and cost. Results showed that with very basic prompt engineering, ChatGPT-4o achieved the highest performance across LLMs. Proprietary LLMs outperformed open-weight LLMs for most concepts. GPT-4o achieved robust but less accurate scoring than EvoGrader (~500 additional scoring errors). Ethical concerns over data ownership, reliability, and replicability over time were LLM limitations. EvoGrader offered superior accuracy, reliability, and replicability, but required, in its development a large, high-quality, human-scored corpus, domain expertise, and restricted assessment items. These findings highlight the diversity of considerations that should be used when considering LLM and ML scoring in science education. Despite impressive LLM advances, ML approaches may remain valuable in some contexts, particularly those prioritizing precision, reliability, replicability, privacy, and controlled implementation. 
    more » « less
  5. Analog integrated circuit (IC) placement is a heavily manual and time-consuming task that has a significant impact on chip quality. Several recent studies apply machine learning (ML) techniques to directly predict the impact of placement on circuit performance or even guide the placement process. However, the significant diversity in analog design topologies can lead to different impacts on performance metrics (e.g., common-mode rejection ratio (CMRR) or offset voltage). Thus, it is unlikely that the same ML model structure will achieve the best performance for all designs and metrics. In addition, customizing ML models for different designs require more tremendous engineering efforts and longer development cycles. In this work, we leverage Neural Architecture Search (NAS) to automatically develop customized neural architectures for different analog circuit designs and metrics. Our proposed NAS methodology supports an unconstrained DAG-based search space containing a wide range of ML operations and topological connections. Our search strategy can efficiently explore this flexible search space and provide every design with the best-customized model to boost the model performance. We make unprejudiced comparisons with the claimed performance of the previous representative work on exactly the same dataset. After fully automated development within only 0.5 days, generated models give 3.61% superior accuracy than the prior art. 
    more » « less