With the requirements to enable data analytics and exploration interactively and efficiently, progressive data processing, especially progressive join, became essential to data science. Join queries are particularly challenging due to the correlation between input datasets which causes the results to be biased towards some join keys. Existing methods carefully control which parts of the input to process in order to improve the quality of progressive results. If the quality is not satisfactory, they will process more data to improve the result. In this paper, we propose an alternative approach that initially seems counter-intuitive but surprisingly works very well. After query processing, we intentionally report fewer results to the user with the goal of improving the quality. The key idea is that if the output is deviated from the correct distribution, we temporarily hide some results to correct the bias. As we process more data, the hidden results are inserted back until the full dataset is processed. The main challenge is that we do not know the correct output distribution while the progressive query is running. In this work, we formally define the progressive join problem with quality and progressive result rate constraints. We propose an input&output quality-aware progressive join framework (QPJ) that (1) provides input control that decides which parts of the input to process; (2) estimates the final result distribution progressively; (3) automatically controls the quality of the progressive output rate; and (4) combines input&output control to enable quality control of the progressive results. We compare QPJ with existing methods and show QPJ can provide the progressive output that can represent the final answer better than existing methods.
more »
« less
Less is More: How Fewer Results Improve Progressive Join Query Processing
With the requirements to enable data analytics and exploration interactively and efficiently, progressive data processing, especially progressive join, became essential to data science. Join queries are particularly challenging due to the correlation between input datasets which causes the results to be biased towards some join keys. Existing methods carefully control which parts of the input to process in order to improve the quality of progressive results. If the quality is not satisfactory, they will process more data to improve the result. In this paper, we propose an alternative approach that initially seems counter-intuitive but surprisingly works very well. After query processing, we intentionally report fewer results to the user with the goal of improving the quality. The key idea is that if the output is deviated from the correct distribution, we temporarily hide some results to correct the bias. As we process more data, the hidden results are inserted back until the full dataset is processed. The main challenge is that we do not know the correct output distribution while the progressive query is running. In this work, we formally define the progressive join problem with quality and progressive result rate constraints. We propose an input&output quality-aware progressive join framework (QPJ) that (1) provides input control that decides which parts of the input to process; (2) estimates the final result distribution progressively; (3) automat- ically controls the quality of the progressive output rate; and (4) combines input&output control to enable quality control of the progressive results. We compare QPJ with existing methods and show QPJ can provide the progressive output that can represent the final answer better than existing methods.
more »
« less
- Award ID(s):
- 2046236
- PAR ID:
- 10438961
- Date Published:
- Journal Name:
- the 35th International Conference on on Scientific and Statistical Database Management, SSDBM 2023
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Progressive query processing enables data scientists to efficiently analyze and explore large datasets. Data scientists can start further analyses earlier if the progressive result can represent the complete results well. Most progressive processing frameworks carefully control which parts of the input to process in order to improve the quality of progressive results. The input control strategies work well when the data are processed uniformly. However, the progressive results will be biased towards the join keys if the processed data are not uniform. A recently proposed input&output framework named QPJ corrects the bias by temporarily hiding some results. The framework dynamically estimates the distribution of the complete result and outputs progressive results with a similar distribution to the estimated complete result. This demo presents QPJVis, which is a progressive query processing system designed to inherently process the progressive queries using the QPJ framework. Additionally, we also implement an input control framework, Prism, in QPJVis so that users can compare the difference between the input&output framework and a purely input framework.more » « less
-
Progressive query processing enables data scientists to efficiently analyze and explore large datasets. Data scientists can start further analyses earlier if the progressive result can represent the complete results well. Most progressive processing frameworks carefully control which parts of the input to process in order to improve the quality of progressive results. The input control strategies work well when the data are processed uniformly. However, the progressive results will be biased towards the join keys if the processed data are not uniform. A recently proposed input&output framework named QPJ corrects the bias by temporarily hiding some results. The framework dynamically estimates the distribution of the complete result and outputs progressive results with a similar distribution to the estimated complete result. This demo presents QPJVis, which is a progressive query processing system designed to inherently process the progressive queries using the QPJ frame- work. Additionally, we also implement an input control framework, Prism, in QPJVis so that users can compare the difference between the input&output framework and a purely input framework.more » « less
-
Progressive visual analytics enable data scientists to efficiently explore large datasets and examine progressive results with low latency. Most progressive visualization frameworks use a progressive query processing module that controls the quality of the results and then feeds these results into a visualization module. The goal is to avoid poor-quality progressive results which could mislead data scientists. This method misses some optimization opportunities as it improves the quality of the intermediate result while ignoring how this result affects the final visualization. This work presents a work-in-progress quality-aware progressive visualization input control component, named QPV. The key idea of the proposed framework is to integrate the visualization module into the progressive query results so that the quality control takes into account the final visualization. With limited computational resources, QPV solves an optimization problem to allocate resources and alleviate the misleading effects in the progressive plots.more » « less
-
SkinnerDB uses reinforcement learning for reliable join ordering, exploiting an adaptive processing engine with specialized join algorithms and data structures. It maintains no data statistics and uses no cost or cardinality models. Also, it uses no training workloads nor does it try to link the current query to seemingly similar queries in the past. Instead, it uses reinforcement learning to learn optimal join orders from scratch during the execution of the current query. To that purpose, it divides the execution of a query into many small time slices. Different join orders are tried in different time slices. SkinnerDB merges result tuples generated according to different join orders until a complete query result is obtained. By measuring execution progress per time slice, it identifies promising join orders as execution proceeds. Along with SkinnerDB, we introduce a new quality criterion for query execution strategies. We upper-bound expected execution cost regret, i.e., the expected amount of execution cost wasted due to sub-optimal join order choices. SkinnerDB features multiple execution strategies that are optimized for that criterion. Some of them can be executed on top of existing database systems. For maximal performance, we introduce a customized execution engine, facilitating fast join order switching via specialized multi-way join algorithms and tuple representations. We experimentally compare SkinnerDB’s performance against various baselines, including MonetDB, Postgres, and adaptive processing methods. We consider various benchmarks, including the join order benchmark, TPC-H, and JCC-H, as well as benchmark variants with user-defined functions. Overall, the overheads of reliable join ordering are negligible compared to the performance impact of the occasional, catastrophic join order choice.more » « less