skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Debugging missing answers for spark queries over nested data with breadcrumb
We present Breadcrumb, a system that aids developers in debugging queries through query-based explanations for missing answers. Given as input a query and an expected, but missing, query result, Breadcrumb identifies operators in the input query that are responsible for the failure to derive the missing answer. These operators form explanations that guide developers who can then focus their debugging efforts on fixing these parts of the query. Breadcrumb is implemented on top of Apache Spark. Our approach is the first that scales to big data dimensions and is capable of finding explanations for common errors in queries over nested and de-normalized data, e.g., errors based on misinterpreting schema semantics.  more » « less
Award ID(s):
2107107 1956123
PAR ID:
10358641
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
14
Issue:
12
ISSN:
2150-8097
Page Range / eLocation ID:
2731 to 2734
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Query-based explanations for missing answers identify which operators of a query are responsible for the failure to return a missing answer of interest. This type of explanations has proven useful, e.g., to debug complex analytical queries. Such queries are frequent in big data systems such as Apache Spark. We present a novel approach to produce query-based explanations. It is the first to support nested data and to consider operators that modify the schema and structure of the data (e.g., nesting, projections) as potential causes of missing answers. To efficiently compute explanations, we propose a heuristic algorithm that applies two novel techniques: (i) reasoning about multiple schema alternatives for a query and (ii) re-validating at each step whether an intermediate result can contribute to the missing answer. Using an implementation on Spark, we demonstrate that our approach is the first to scale to large datasets while often finding explanations that existing techniques fail to identify. 
    more » « less
  2. SQL is the most commonly used front-end language for data-intensive scalable computing (DISC) applications due to its broad presence in new and legacy workflows and shallow learning curve. However, DISC-backed SQL introduces several layers of abstraction that significantly reduce the visibility and transparency of workflows, making it challenging for developers to find and fix errors in a query. When a query returns incorrect outputs, it takes a non-trivial effort to comprehend every stage of the query execution and find the root cause among the input data and complex SQL query. We aim to bring the benefits of step-through interactive debugging to DISC-powered SQL with DeSQL. Due to the declarative nature of SQL, there are no ordered atomic statements to place a breakpoint to monitor the flow of data. DeSQL’s automated query decomposition breaks a SQL query into its constituent sub queries, offering natural locations for setting breakpoints and monitoring intermediate data. However, due to advanced query optimization and translation in DISC systems, a user query rarely matches the physical execution, making it challenging to associate subqueries with their intermediate data. DeSQL performs fine-grained taint analysis to dynamically map the subqueries to their intermediate data, while also recognizing subqueries removed by the optimizers. For such subqueries, DeSQL efficiently regenerates the intermediate data from a nearby subquery’s data. On the popular TPC-DC benchmark, DeSQL provides a complete debugging view in 13% less time than the original job time while incurring an average overhead of 10% in addition to retaining Apache Spark’s scalability. In a user study comprising 15 participants engaged in two debugging tasks, we find that participants utilizing DeSQL identify the root cause behind a wrong query output in 74% less time than the de-facto, manual debugging. 
    more » « less
  3. Explaining why an answer is (or is not) returned by a query is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Specifically, we introduce a graph-based provenance model that, while syntactic in nature, supports reverse reasoning and is proven to encode a wide range of provenance models from the literature. The implementation of this model in our PUG (Provenance Unification through Graphs) system takes a provenance question and Datalog query as an input and generates a Datalog program that computes an explanation, i.e., the part of the provenance that is relevant to answer the question. Furthermore, we demonstrate how a desirable factorization of provenance can be achieved by rewriting an input query. We experimentally evaluate our approach demonstrating its efficiency. 
    more » « less
  4. An increasing number of applications in all aspects of society rely on data. Despite the long line of research in data cleaning and repairs, data correctness has been an elusive goal. Errors in the data can be extremely disruptive, and are detrimental to the effectiveness and proper function of data-driven applications. Even when data is cleaned, new errors can be introduced by applications and users who interact with the data. Subsequent valid updates can obscure these errors and propagate them through the dataset causing more discrepancies. Any discovered errors tend to be corrected superficially, on a case-by-case basis, further obscuring the true underlying cause, and making detection of the remaining errors harder. In this demo proposal, we outline the design of QFix, a query-centric framework that derives explanations and repairs for discrepancies in relational data based on potential errors in the queries that operated on the data. This is a marked departure from traditional data-centric techniques that directly fix the data. We then describe how users will use QFix in a demonstration scenario. Participants will be able to select from a number of transactional benchmarks, introduce errors into the queries that are executed, and compare the fixes to the queries proposed by QFix as well as existing alternative algorithms such as decision trees. 
    more » « less
  5. Data-driven applications rely on the correctness of their data to function properly and effectively. Errors in data can be incredibly costly and disruptive, leading to loss of revenue, incorrect conclusions, and misguided policy decisions. While data cleaning tools can purge datasets of many errors before the data is used, applications and users interacting with the data can introduce new errors. Subsequent valid updates can obscure these errors and propagate them through the dataset causing more discrepancies. Even when some of these discrepancies are discovered, they are often corrected superficially, on a case-by-case basis, further obscuring the true underlying cause, and making detection of the remaining errors harder. In this paper, we propose QFix, a framework that derives explanations and repairs for discrepancies in relational data, by analyzing the effect of queries that operated on the data and identifying potential mistakes in those queries. QFix is flexible, handling scenarios where only a subset of the true discrepancies is known, and robust to different types of update workloads. We make four important contributions: (a) we formalize the problem of diagnosing the causes of data errors based on the queries that operated on and introduced errors to a dataset; (b) we develop exact methods for deriving diagnoses and fixes for identified errors using state-of-the-art tools; (c) we present several optimization techniques that improve our basic approach without compromising accuracy, and (d) we leverage a tradeoff between accuracy and performance to scale diagnosis to large datasets and query logs, while achieving near-optimal results. We demonstrate the effectiveness of QFix through extensive evaluation over benchmark and synthetic data. 
    more » « less