ADQuaTe2: A Data Quality Test Approach for Automated Constraint Discovery and Fault Detection

Homayouni, Hajar; Ghosh, Sudipto; Ray, Indrakshi; Kahn, Michael G.

Citation Details

The quality of data is extremely important for data analytics. Data quality tests typically involve checking constraints specified by domain experts. Existing approaches detect trivial constraint violations and identify outliers without explaining the constraints that were violated. Moreover, domain experts may specify constraints in an ad hoc manner and miss important ones. We describe an automated data quality test approach, ADQuaTe2, which uses an autoencoder to (1) discover constraints that may have been missed by experts, (2) label as suspicious those records that violate the constraints, and (3) provide explanations about the violations. An interactive learning technique incorporates expert feedback, which improves the accuracy. We evaluate the effectiveness of ADQuaTe2 on real-world datasets from health and plant domains. We also use datasets from the UCI repository to evaluate the improvement in the accuracy after incorporating ground truth knowledge. more »

Award ID(s):: 1931324

PAR ID:: 10185896

Author(s) / Creator(s):: Homayouni, Hajar; Ghosh, Sudipto; Ray, Indrakshi; Kahn, Michael G.

Date Published:: 2019-01-01

Journal Name:: ACM TAPIA

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this