skip to main content


Title: PUG: a framework and practical implementation for why and why-not provenance
Explaining why an answer is (or is not) returned by a query is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Specifically, we introduce a graph-based provenance model that, while syntactic in nature, supports reverse reasoning and is proven to encode a wide range of provenance models from the literature. The implementation of this model in our PUG (Provenance Unification through Graphs) system takes a provenance question and Datalog query as an input and generates a Datalog program that computes an explanation, i.e., the part of the provenance that is relevant to answer the question. Furthermore, we demonstrate how a desirable factorization of provenance can be achieved by rewriting an input query. We experimentally evaluate our approach demonstrating its efficiency.  more » « less
Award ID(s):
1640864
NSF-PAR ID:
10082094
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
The VLDB Journal
ISSN:
1066-8888
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We present the first query-based approach for explaining missing answers to queries over nested relational data which is a common data format used by big data systems such as Apache Spark. Our main contributions are a novel way to define query-based why-not provenance based on repairs to queries and presenting an implementation and preliminary experiments for answering such queries in Spark. 
    more » « less
  2. Range aggregate queries (RAQs) are an integral part of many real-world applications, where, often, fast and approximate answers for the queries are desired. Recent work has studied answering RAQs using machine learning (ML) models, where a model of the data is learned to answer the queries. However, there is no theoretical understanding of why and when the ML based approaches perform well. Furthermore, since the ML approaches model the data, they fail to capitalize on any query specific information to improve performance in practice. In this paper, we focus on modeling "queries" rather than data and train neural networks to learn the query answers. This change of focus allows us to theoretically study our ML approach to provide a distribution and query dependent error bound for neural networks when answering RAQs. We confirm our theoretical results by developing NeuroSketch, a neural network framework to answer RAQs in practice. Extensive experimental study on real-world, TPC-benchmark and synthetic datasets show that NeuroSketch answers RAQs multiple orders of magnitude faster than state-of-the-art and with better accuracy. 
    more » « less
  3. null (Ed.)
    Logical queries constitute an important subset of questions posed in knowledge graph question answering systems. Yet, effectively answering logical queries on large knowledge graphs remains a highly challenging problem. Traditional subgraph matching based methods might suffer from the noise and incompleteness of the underlying knowledge graph, often with a prolonged online response time. Recently, an alternative type of method has emerged whose key idea is to embed knowledge graph entities and the query in an embedding space so that the embedding of answer entities is close to that of the query. Compared with subgraph matching based methods, it can better handle the noisy or missing information in knowledge graph, with a faster online response. Promising as it might be, several fundamental limitations still exist, including the linear transformation assumption for modeling relations and the inability to answer complex queries with multiple variable nodes. In this paper, we propose an embedding based method (NewLook) to address these limitations. Our proposed method offers three major advantages. First (Applicability), it supports four types of logical operations and can answer queries with multiple variable nodes. Second (Effectiveness), the proposed NewLook goes beyond the linear transformation assumption, and thus consistently outperforms the existing methods. Third (Efficiency), compared with subgraph matching based methods, NewLook is at least 3 times faster in answering the queries; compared with the existing embed-ding based methods, NewLook bears a comparable or even faster online response and offline training time. 
    more » « less
  4. Database systems use static analysis to determine upfront which data is needed for answering a query and use indexes and other physical design techniques to speed-up access to that data. However, for important classes of queries, e.g., HAVING and top-k queries, it is impossible to determine up-front what data is relevant. To overcome this limitation, we develop provenance-based data skipping (PBDS), a novel approach that generates provenance sketches to concisely encode what data is relevant for a query. Once a provenance sketch has been captured it is used to speed up subsequent queries. PBDS can exploit physical design artifacts such as indexes and zone maps. 
    more » « less
  5. Knowledge graph question answering aims to identify answers of the query according to the facts in the knowledge graph. In the vast majority of the existing works, the input queries are considered perfect and can precisely express the user’s query intention. However, in reality, input queries might be ambiguous and elusive which only contain a limited amount of information. Directly answering these ambiguous queries may yield unwanted answers and deteriorate user experience. In this paper, we propose PReFNet which focuses on answering ambiguous queries with pseudo relevance feedback on knowledge graphs. In order to leverage the hidden (pseudo) relevance information existed in the results that are initially returned from a given query, PReFNet treats the top-k returned candidate answers as a set of most relevant answers, and uses variational Bayesian inference to infer user’s query intention. To boost the quality of the inferred queries, a neighborhood embedding based VGAE model is used to prune inferior inferred queries. The inferred high quality queries will be returned to the users to help them search with ease. Moreover, all the high-quality candidate nodes will be re-ranked according to the inferred queries. The experiment results show that our proposed method can recommend high-quality query graphs to users and improve the question answering accuracy. 
    more » « less