skip to main content


Title: Revisiting Runtime Dynamic Optimization for Join Queries in Big Data Management Systems
Effective query optimization remains an open problem for Big Data Management Systems. In this work, we revisit an old idea, runtime dynamic optimization, and adapt it to a big data management system, AsterixDB. The approach runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created by a stage is then used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimates, thus leading to much better execution plans. While it introduces overhead for materializing intermediate results, experiments show that this overhead is relatively small and is an acceptable price to pay given the optimization benefits.  more » « less
Award ID(s):
1954962 1954644 1924694 1838248 1838222
NSF-PAR ID:
10466687
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
ACM SIGMOD Record
Volume:
52
Issue:
1
ISSN:
0163-5808
Page Range / eLocation ID:
104 to 113
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Query Optimization remains an open problem for Big Data Management Systems. Traditional optimizers are cost-based and use statistical estimates of intermediate result cardinalities to assign costs and pick the best plan. However, such estimates tend to become less accurate because of filtering conditions caused either from undetected correlations between multiple predicates local to a single dataset, predicates with query parameters, or predicates involving user-defined functions (UDFs). Consequently, traditional query optimizers tend to ignore or miscalculate those settings, thus leading to suboptimal execution plans. Given the volume of today’s data, a suboptimal plan can quickly become very inefficient. In this work, we revisit the old idea of runtime dynamic optimization and adapt it to a shared-nothing distributed database system, AsterixDB. The optimization runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created from each stage is used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimations, thus leading to much better execution plans. While it introduces the overhead for materializing these intermediate results, our experiments show that this overhead is relatively small and it is an acceptable price to pay given the optimization benefits. In fact, our experimental evaluation shows that runtime dynamic optimization leads to much better execution plans as compared to the current default AsterixDB plans as well as to plans produced by static cost-based optimization (i.e. based on the initial dataset statistics) and other state-of-the-art approaches. 
    more » « less
  2. null (Ed.)
    Query-based explanations for missing answers identify which operators of a query are responsible for the failure to return a missing answer of interest. This type of explanations has proven useful, e.g., to debug complex analytical queries. Such queries are frequent in big data systems such as Apache Spark. We present a novel approach to produce query-based explanations. It is the first to support nested data and to consider operators that modify the schema and structure of the data (e.g., nesting, projections) as potential causes of missing answers. To efficiently compute explanations, we propose a heuristic algorithm that applies two novel techniques: (i) reasoning about multiple schema alternatives for a query and (ii) re-validating at each step whether an intermediate result can contribute to the missing answer. Using an implementation on Spark, we demonstrate that our approach is the first to scale to large datasets while often finding explanations that existing techniques fail to identify. 
    more » « less
  3. To process real-world datasets, modern data-parallel systems often require extremely large amounts of memory, which are both costly and energy inefficient. Emerging non-volatile memory (NVM) technologies offer high capacity compared to DRAM and low energy compared to SSDs. Hence, NVMs have the potential to fundamentally change the dichotomy between DRAM and durable storage in Big Data processing. However, most Big Data applications are written in managed languages and executed on top of a managed runtime that already performs various dimensions of memory management. Supporting hybrid physical memories adds a new dimension, creating unique challenges in data replacement. This article proposes Panthera, a semantics-aware, fully automated memory management technique for Big Data processing over hybrid memories. Panthera analyzes user programs on a Big Data system to infer their coarse-grained access patterns, which are then passed to the Panthera runtime for efficient data placement and migration. For Big Data applications, the coarse-grained data division information is accurate enough to guide the GC for data layout, which hardly incurs overhead in data monitoring and moving. We implemented Panthera in OpenJDK and Apache Spark. Based on Big Data applications’ memory access pattern, we also implemented a new profiling-guided optimization strategy, which is transparent to applications. With this optimization, our extensive evaluation demonstrates that Panthera reduces energy by 32–53% at less than 1% time overhead on average. To show Panthera’s applicability, we extend it to QuickCached, a pure Java implementation of Memcached. Our evaluation results show that Panthera reduces energy by 28.7% at 5.2% time overhead on average. 
    more » « less
  4. null (Ed.)
    In the last few years, the field of data science has been growing rapidly as various businesses have adopted statistical and machine learning techniques to empower their decision-making and applications. Scaling data analyses to large volumes of data requires the utilization of distributed frameworks. This can lead to serious technical challenges for data analysts and reduce their productivity. AFrame, a data analytics library, is implemented as a layer on top of Apache AsterixDB, addressing these issues by providing the data scientists' familiar interface, Pandas Dataframe, and transparently scaling out the evaluation of analytical operations through a Big Data management system. While AFrame is able to leverage data management facilities (e.g., indexes and query optimization) and allows users to interact with a large volume of data, the initial version only generated SQL++ queries and only operated against AsterixDB. In this work, we describe a new design that retargets AFrame's incremental query formation to other query-based database systems, making it more flexible for deployment against other data management systems with composable query languages. 
    more » « less
  5. We present Breadcrumb, a system that aids developers in debugging queries through query-based explanations for missing answers. Given as input a query and an expected, but missing, query result, Breadcrumb identifies operators in the input query that are responsible for the failure to derive the missing answer. These operators form explanations that guide developers who can then focus their debugging efforts on fixing these parts of the query. Breadcrumb is implemented on top of Apache Spark. Our approach is the first that scales to big data dimensions and is capable of finding explanations for common errors in queries over nested and de-normalized data, e.g., errors based on misinterpreting schema semantics. 
    more » « less