skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: ZIP: Lazy Imputation during Query Processing
This paper develops a query-time missing value imputation framework, entitled ZIP, that modifies relational operators to be imputation aware in order to minimize the joint cost of imputing and query processing. The modified operators use a cost-based decision function to determine whether to invoke imputation or to defer to downstream operators to resolve missing values. The modified query processing logic ensures results with deferred imputations are identical to those produced if all missing values were imputed first. ZIP includes a novel outer-join based approach to preserve missing values during execution, and a bloom filter based index to optimize the space and running overhead. Extensive experiments on both real and synthetic data sets demonstrate 10 to 25 times improvement when augmenting the state-of-the-art technology, ImputeDB, with ZIP-based deferred imputation. ZIP also outperforms the offline approach by up to 19607 times in a real data set.  more » « less
Award ID(s):
2008993
PAR ID:
10562504
Author(s) / Creator(s):
;
Publisher / Repository:
ACM
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
17
Issue:
1
ISSN:
2150-8097
Page Range / eLocation ID:
28 to 40
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Missing value (MV) imputation is a critical preprocessing means for data mining. Nevertheless, existing MV imputation methods are mostly designed for batch processing, and thus are not applicable to streaming data, especially those with poor quality. In this article, we propose a framework, called Real-time and Error-tolerant Missing vAlue ImputatioN (REMAIN), to impute MVs in poor-quality streaming data. Instead of imputing MVs based on all the observed data, REMAIN first initializes the MV imputation model based on a-RANSAC which is capable of detecting and rejecting anomalies in an efficient manner, and then incrementally updates the model parameters upon the arrival of new data to support real-time MV imputation. As the correlations among attributes of the data may change over time in unforseenable ways, we devise a deterioration detection mechanism to capture the deterioration of the imputation model to further improve the imputation accuracy. Finally, we conduct an extensive evaluation on the proposed algorithms using real-world and synthetic datasets. Experimental results demonstrate that REMAIN achieves significantly higher imputation accuracy over existing solutions. Meanwhile, REMAIN improves up to one order of magnitude in time cost compared with existing approaches. 
    more » « less
  2. Real-world data is often incomplete and contains missing values. To train accurate models over real-world datasets, users need to spend a substantial amount of time and resources imputing and finding proper values for missing data items. In this paper, we demonstrate that it is possible to learn accurate models directly from data with missing values for certain training data and target models. We propose a unified approach for checking the necessity of data imputation to learn accurate models across various widely-used machine learning paradigms. We build efficient algorithms with theoretical guarantees to check this necessity and return accurate models in cases where imputation is unnecessary. Our extensive experiments indicate that our proposed algorithms significantly reduce the amount of time and effort needed for data imputation without imposing considerable computational overhead. 
    more » « less
  3. We investigate the fairness concerns of training a machine learning model using data with missing values. Even though there are a number of fairness intervention methods in the literature, most of them require a complete training set as input. In practice, data can have missing values, and data missing patterns can depend on group attributes (e.g. gender or race). Simply applying off-the-shelf fair learning algorithms to an imputed dataset may lead to an unfair model. In this paper, we first theoretically analyze different sources of discrimination risks when training with an imputed dataset. Then, we propose an integrated approach based on decision trees that does not require a separate process of imputation and learning. Instead, we train a tree with missing incorporated as attribute (MIA), which does not require explicit imputation, and we optimize a fairness-regularized objective function. We demonstrate that our approach outperforms existing fairness intervention methods applied to an imputed dataset, through several experiments on real-world datasets. 
    more » « less
  4. Social science approaches to missing values predict avoided, unrequested, or lost information from dense data sets, typically surveys. The authors propose a matrix factorization approach to missing data imputation that (1) identifies underlying factors to model similarities across respondents and responses and (2) regularizes across factors to reduce their overinfluence for optimal data reconstruction. This approach may enable social scientists to draw new conclusions from sparse data sets with a large number of features, for example, historical or archival sources, online surveys with high attrition rates, or data sets created from Web scraping, which confound traditional imputation techniques. The authors introduce matrix factorization techniques and detail their probabilistic interpretation, and they demonstrate these techniques’ consistency with Rubin’s multiple imputation framework. The authors show via simulations using artificial data and data from real-world subsets of the General Social Survey and National Longitudinal Study of Youth cases for which matrix factorization techniques may be preferred. These findings recommend the use of matrix factorization for data reconstruction in several settings, particularly when data are Boolean and categorical and when large proportions of the data are missing. 
    more » « less
  5. We present Breadcrumb, a system that aids developers in debugging queries through query-based explanations for missing answers. Given as input a query and an expected, but missing, query result, Breadcrumb identifies operators in the input query that are responsible for the failure to derive the missing answer. These operators form explanations that guide developers who can then focus their debugging efforts on fixing these parts of the query. Breadcrumb is implemented on top of Apache Spark. Our approach is the first that scales to big data dimensions and is capable of finding explanations for common errors in queries over nested and de-normalized data, e.g., errors based on misinterpreting schema semantics. 
    more » « less