skip to main content


Title: Putting Things into Context: Rich Explanations for Query Answers using Join Graphs
In many data analysis applications there is a need to explain why a surprising or interesting result was produced by a query. Previous approaches to explaining results have directly or indirectly relied on data provenance, i.e., input tuples contributing to the result(s) of interest. However, some information that is relevant for explaining an answer may not be contained in the provenance. We propose a new approach for explaining query results by augmenting provenance with information from other related tables in the database. Using a suite of optimization techniques, we demonstrate experimentally using real datasets and through a user study that our approach produces meaningful results and is efficient.  more » « less
Award ID(s):
1956123 1640864
NSF-PAR ID:
10278465
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings of the 46th International Conference on Management of Data
Page Range / eLocation ID:
1051 to 1063
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Explaining why an answer is (or is not) returned by a query is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Specifically, we introduce a graph-based provenance model that, while syntactic in nature, supports reverse reasoning and is proven to encode a wide range of provenance models from the literature. The implementation of this model in our PUG (Provenance Unification through Graphs) system takes a provenance question and Datalog query as an input and generates a Datalog program that computes an explanation, i.e., the part of the provenance that is relevant to answer the question. Furthermore, we demonstrate how a desirable factorization of provenance can be achieved by rewriting an input query. We experimentally evaluate our approach demonstrating its efficiency. 
    more » « less
  2. Home People Research Publications Courses Jobs Seminars Contact Going Beyond Provenance: Explaining Query Answers with Pattern-based Counterbalances Authors Zhengjie Miao Qitian Zeng Boris Glavic Sudeepa Roy Materials Abstract Provenance and intervention-based techniques have been used to explain surprisingly high or low outcomes of aggregation queries. However, such techniques may miss interesting explanations emerging from data that is not in the provenance. For instance, an unusually low number of publications of a prolific researcher in a certain venue and year can be explained by an increased number of publications in another venue in the same year. We present a novel approach for explaining outliers in aggregation queries through counterbalancing. That is, explanations are outliers in the opposite direction of the outlier of interest. Outliers are defined w.r.t. patterns that hold over the data in aggregate. We present efficient methods for mining such aggregate regression patterns (ARPs), discuss how to use ARPs to generate and rank explanations, and experimentally demonstrate the efficiency and effectiveness of our approach. 
    more » « less
  3. We present the first query-based approach for explaining missing answers to queries over nested relational data which is a common data format used by big data systems such as Apache Spark. Our main contributions are a novel way to define query-based why-not provenance based on repairs to queries and presenting an implementation and preliminary experiments for answering such queries in Spark. 
    more » « less
  4. null (Ed.)
    Semiring provenance is a successful approach, originating in database theory, to providing detailed information on how atomic facts combine to yield the result of a query. In particular, general provenance semirings of polynomials or formal power series provide precise descriptions of the evaluation strategies or “proof trees” for the query. By evaluating these descriptions in specific application semirings, one can extract practical information for instance about the confidence of a query or the cost of its evaluation. This paper develops semiring provenance for very general logical languages featuring the full interaction between negation and fixed-point inductions or, equivalently, arbitrary interleavings of least and greatest fixed points. This also opens the door to provenance analysis applications for modal μ-calculus and temporal logics, as well as for finite and infinite model-checking games. Interestingly, the common approach based on Kleene’s Fixed-Point Theorem for ω-continuous semirings is not sufficient for these general languages. We show that an adequate framework for the provenance analysis of full fixed-point logics is provided by semirings that are (1) fully continuous, and (2) absorptive. Full continuity guarantees that provenance values of least and greatest fixed-points are well-defined. Absorptive semirings provide a symmetry between least and greatest fixed-points and make sure that provenance values of greatest fixed points are informative. We identify semirings of generalized absorptive polynomials S∞[X] and prove universal properties that make them the most general appropriate semirings for our framework. These semirings have the further property of being (3) chain-positive, which is responsible for having truth-preserving interpretations that give non-zero values to all true formulae. We relate the provenance analysis of fixed-point formulae with provenance values of plays and strategies in the associated model-checking games. Specifically, we prove that the provenance value of a fixed point formula gives precise information on the evaluation strategies in these games. 
    more » « less
  5. As new laws governing management of personal data are introduced, e.g., the European Union’s General Data Protection Regulation of 2016 and the California Consumer Privacy Act of 2018, compliance with data governance legislation is becoming an increasingly important aspect of data management. An important component of many data privacy laws is that they require companies to only use an individual’s data for a purpose the individual has explicitly consented to. Prior methods for enforcing consent for aggregate queries either use access control to eliminate data without consent from query evaluation or apply differential privacy algorithms to inject synthetic noise into the outcomes of queries (or input data) to ensure that the anonymity of non-consenting individuals is preserved with high probability. Both approaches return query results that differ from the ground truth results corresponding to the full input containing data from both consenting and non-consenting individuals. We present an alternative frame- work for group-by aggregate queries, tailored for applications, e.g., medicine, where even a small deviation from the correct answer to a query cannot be tolerated. Our approach uses provenance to determine, for each output tuple of a group-by aggregate query, which individual’s data was used to derive the result for this group. We then use statistical tests to determine how likely it is that the presence of data for a non-consenting individual will be revealed by such an output tuple. We filter out tuples for which this test fails, i.e., which are deemed likely to reveal non-consenting data. Thus, our approach always returns a subset of the ground truth query answers. Our experiments successfully return only 100% accurate results in instances where access control or differential privacy would have either returned less total or less accurate results. 
    more » « less