Sketches are single-pass small-space data summaries that can quickly estimate the cardinality of join queries. However, sketches are not directly applicable to join queries with dynamic filter conditions --- where arbitrary selection predicate(s) are applied --- since a sketch is limited to a fixed selection. While multiple sketches for various selections can be used in combination, they each incur individual storage and maintenance costs. Alternatively, exact sketches can be built during runtime for every selection. To make this process scale, a high-degree of parallelism --- available in hardware accelerators such as GPUs --- is required. Therefore, sketch usage for cardinality estimation in query optimization is limited. Following recent work that applies transformers to cardinality estimation, we design a novel learning-based method to approximate the sketch of any arbitrary selection, enabling sketches for join queries with filter conditions. We train a transformer on each table to estimate the sketch of any subset of the table, i.e., any arbitrary selection. Transformers achieve this by learning the joint distribution amongst table attributes, which is equivalent to a multidimensional sketch. Subsequently, transformers can approximate any sketch, enabling sketches for join cardinality estimation. In turn, estimating joins via approximate sketches allows tables to be modeled individually and thus scales linearly with the number of tables. We evaluate the accuracy and efficacy of approximate sketches on queries with selection predicates consisting of conjunctions of point and range conditions. Approximate sketches achieve similar accuracy to exact sketches with at least one order of magnitude less overhead. 
                        more » 
                        « less   
                    
                            
                            COMPASS: Online Sketch-based Query Optimization for In-Memory Databases
                        
                    
    
            Cost-based query optimization remains a critical task in relational databases even after decades of research and industrial development. Query optimizers rely on a large range of statistical synopses for accurate cardinality estimation. As the complexity of selections and the number of join predicates increase, two problems arise. First, statistics cannot be incrementally composed to effectively estimate the cost of the sub-plans generated in plan enumeration. Second, small errors are propagated exponentially through joins, which can lead to severely sub-optimal plans. In this paper, we introduce COMPASS, a novel query optimization paradigm for in-memory databases based on a single type of statistics---Fast-AGMS sketches. In COMPASS, query optimization and execution are intertwined. Selection predicates and sketch updates are pushed-down and evaluated online during query optimization. This allows Fast-AGMS sketches to be computed only over the relevant tuples---which enhances cardinality estimation accuracy. Plan enumeration is performed over the query join graph by incrementally composing attribute-level sketches---not by building a separate sketch for every sub-plan. We prototype COMPASS in MapD -- an open-source parallel database -- and perform extensive experiments over the complete JOB benchmark. The results prove that COMPASS generates better execution plans -- both in terms of cardinality and runtime -- compared to four other database systems. Overall, COMPASS achieves a speedup ranging from 1.35X to 11.28X in cumulative query execution time over the considered competitors. Supplementary Material Read me (3448016.3452840_readme.pdf) Download 472.23 KB Source Code (3448016.3452840_source_code.zip) Download 6.94 MB MP4 File (3448016.3452840.mp4) Cost-based query optimization remains a critical task in relational databases even after decades of research and industrial development. Query optimizers rely on a large range of statistical synopses -- including attribute-level histograms and table-level samples -- for accurate cardinality estimation. As the complexity of selection predicates and the number of join predicates increase, two problems arise. First, statistics cannot be incrementally composed to effectively estimate the cost of the sub-plans generated in plan enumeration. Second, small errors are propagated exponentially through joins, which can lead to severely sub-optimal plans. In this paper, we introduce COMPASS, a novel query optimization paradigm for in-memory databases based on a single type of statistics---Fast-AGMS sketches. In COMPASS, query optimization and execution are intertwined. Selection predicates and sketch updates are pushed-down and evaluated online during query optimization. This allows Fast-AGMS sketches to be computed only over the relevant tuples---which enhances cardinality estimation accuracy. Plan enumeration is performed over the query join graph by incrementally composing attribute-level sketches---not by building a separate sketch for every sub-plan.We prototype COMPASS in MapD -- an open-source parallel database -- and perform extensive experiments over the complete JOB benchmark. The results prove the reduced overhead COMPASS incurs, while generating better execution plans -- both in terms of cardinality and runtime -- compared to four other database systems. Overall, COMPASS achieves a speedup ranging from 1.89X to 7.09X in cumulative query execution time over the considered competitors. Moreover, COMPASS is the only optimizer that consistently generates effective plans for complex queries with 10 or more joins. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2008815
- PAR ID:
- 10545562
- Publisher / Repository:
- ACM SIGMOD
- Date Published:
- ISBN:
- 9781450383431
- Page Range / eLocation ID:
- 804 to 816
- Format(s):
- Medium: X
- Location:
- Virtual Event China
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Most query optimizers rely on cardinality estimates to optimize their execution plans. Traditional databases such as PostgreSQL, Oracle, and Db2 utilize synopses, such as histograms, samples, and sketches. Recent main-memory databases like DuckDB and Heavy.AI often operate with minimal or even without estimates, yet their performance does not necessarily suffer. To the best of our knowledge, no analytical comparison has been conducted between optimizers with and without cardinality estimates. In this paper, we present a comprehensive analysis of optimizers that use cardinality estimates and those that do not. To represent optimizers that don’t use cardinality estimates, we design a simple graph-based optimizer that only utilizes join types and table sizes. Our evaluation on the Join Order Benchmark reveals that cardinality estimates have a marginal impact in non-indexed settings, whereas inaccuracies in estimates can be detrimental in indexed settings. Furthermore, the impact of cardinality estimates is negligible in highly parallel main-memory databases.more » « less
- 
            null (Ed.)Query Optimization remains an open problem for Big Data Management Systems. Traditional optimizers are cost-based and use statistical estimates of intermediate result cardinalities to assign costs and pick the best plan. However, such estimates tend to become less accurate because of filtering conditions caused either from undetected correlations between multiple predicates local to a single dataset, predicates with query parameters, or predicates involving user-defined functions (UDFs). Consequently, traditional query optimizers tend to ignore or miscalculate those settings, thus leading to suboptimal execution plans. Given the volume of today’s data, a suboptimal plan can quickly become very inefficient. In this work, we revisit the old idea of runtime dynamic optimization and adapt it to a shared-nothing distributed database system, AsterixDB. The optimization runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created from each stage is used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimations, thus leading to much better execution plans. While it introduces the overhead for materializing these intermediate results, our experiments show that this overhead is relatively small and it is an acceptable price to pay given the optimization benefits. In fact, our experimental evaluation shows that runtime dynamic optimization leads to much better execution plans as compared to the current default AsterixDB plans as well as to plans produced by static cost-based optimization (i.e. based on the initial dataset statistics) and other state-of-the-art approaches.more » « less
- 
            Cardinality estimation is the problem of estimating the size of the output of a query, without actually evaluating the query. The cardinality estimator is a critical piece of a query optimizer, and is often the main culprit when the optimizer chooses a poor plan. This paper introduces LpBound, a pessimistic cardinality estimator for multi-join queries (acyclic or cyclic) with selection predicates and group-by clauses.LpBoundcomputes a guaranteed upper bound on the size of the query output using simple statistics on the input relations, consisting of ℓp-norms of degree sequences. The bound is the optimal solution of a linear program whose constraints encode data statistics and Shannon inequalities. We introduce two optimizations that exploit the structure of the query in order to speed up the estimation time and makeLpBoundpractical. We experimentally evaluateLpBoundagainst a range of traditional, pessimistic, and machine learning-based estimators on the JOB, STATS, and subgraph matching benchmarks. Our main finding is thatLpBoundcan be orders of magnitude more accurate than traditional estimators used in mainstream open-source and commercial database systems. Yet it has comparable low estimation time and space requirements. When injected the estimates ofLpBound, Postgres derives query plans at least as good as those derived using the true cardinalities.more » « less
- 
            We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with n denoting the number of tuples in the database, we guarantee for acyclic full join queries with inequality conditions that for every value of k , the k top-ranked answers are returned in O ( n polylog n + k log k ) time. This is within a polylogarithmic factor of O ( n + k log k ), i.e., the best known complexity for equi-joins, and even of O ( n + k ), i.e., the time it takes to look at the input and return k answers in any order. Our guarantees extend to join queries with selections and many types of projections (namely those called "free-connex" queries and those that use bag semantics). Remarkably, they hold even when the number of join results is n ℓ for a join of ℓ relations. The key ingredient is a novel O ( n polylog n )-size factorized representation of the query output , which is constructed on-the-fly for a given query and database. In addition to providing the first nontrivial theoretical guarantees beyond equi-joins, we show in an experimental study that our ranked-enumeration approach is also memory-efficient and fast in practice, beating the running time of state-of-the-art database systems by orders of magnitude.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    