We will demonstrate a prototype query-processing engine, which utilizes correlations among predicates to accelerate machine learning (ML) inference queries on unstructured data. Expensive operators such as feature extractors and classifiers are deployed as user-defined functions (UDFs), which are not penetrable by classic query optimization techniques such as predicate push-down. Recent optimization schemes (e.g., Probabilistic Predicates or PP) build a cheap proxy model for each predicate offline, and inject proxy models in the front of expensive ML UDFs under the independence assumption in queries. Input records that do not satisfy query predicates are filtered early by proxy models to bypass ML UDFs. But enforcing the independence assumption may result in sub-optimal plans. We use correlative proxy models to better exploit predicate correlations and accelerate ML queries. We will demonstrate our query optimizer called CORE, which builds proxy models online, allocates parameters to each model, and reorders them. We will also show end-to-end query processing with or without proxy models.
more »
« less
Optimizing machine learning inference queries with correlative proxy models
We consider accelerating machine learning (ML) inference queries on unstructured datasets. Expensive operators such as feature extractors and classifiers are deployed as user-defined functions (UDFs), which are not penetrable with classic query optimization techniques such as predicate push-down. Recent optimization schemes (e.g., Probabilistic Predicates or PP) assume independence among the query predicates, build a proxy model for each predicate offline, and rewrite a new query by injecting these cheap proxy models in the front of the expensive ML UDFs. In such a manner, unlikely inputs that do not satisfy query predicates are filtered early to bypass the ML UDFs. We show that enforcing the independence assumption in this context may result in sub-optimal plans. In this paper, we propose CORE, a query optimizer that better exploits the predicate correlations and accelerates ML inference queries. Our solution builds the proxy models online for a new query and leverages a branch-and-bound search process to reduce the building costs. Results on three real-world text, image and video datasets show that CORE improves the query throughput by up to 63% compared to PP and up to 80% compared to running the queries as it is.
more »
« less
- Award ID(s):
- 2107150
- PAR ID:
- 10442819
- Date Published:
- Journal Name:
- Proceedings of the VLDB Endowment
- Volume:
- 15
- Issue:
- 10
- ISSN:
- 2150-8097
- Page Range / eLocation ID:
- 2032 to 2044
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Predicate pushing down is a key optimization used to speed up query processing. Much of the existing practice is restricted to pushing predicates explicitly listed in the query. In this paper, we consider the challenge of learning predicates during query execution which are then exploited to accelerate execution. Prior related approaches with a similar goal are restricted (e.g., learn only from only join columns or from specific data statistics). We significantly expand the realm of predicates that can be learned from different query operators (aggregations, joins, grouping, etc.) and develop a system, entitled PLAQUE, that learns such predicates during query execution. Comprehensive evaluations on both synthetic and real datasets demonstrate that the learned predicate approach adopted by PLAQUE can significantly accelerate query execution by up to 33x, and this improvement increases to up to 100x when User-Defined Functions (UDFs) are utilized in queries.more » « less
-
Recent work has reemphasized the importance of cardinality estimates for query optimization. While new techniques have continuously improved in accuracy over time, they still generally allow for under-estimates which often lead optimizers to make overly optimistic decisions. This can be very costly for expensive queries. An alternative approach to estimation is cardinality bounding, also called pessimistic cardinality estimation, where the cardinality estimator provides guaranteed upper bounds of the true cardinality. By never underestimating, this approach allows the optimizer to avoid potentially inefficient plans. However, existing pessimistic cardinality estimators are not yet practical: they use very limited statistics on the data, and cannot handle predicates. In this paper, we introduce SafeBound, the first practical system for generating cardinality bounds. SafeBound builds on a recent theoretical work that uses degree sequences on join attributes to compute cardinality bounds, extends this framework with predicates, introduces a practical compression method for the degree sequences, and implements an efficient inference algorithm. Across four workloads, SafeBound achieves up to 80% lower end-to-end runtimes than PostgreSQL, and is on par or better than state of the art ML-based estimators and pessimistic cardinality estimators, by improving the runtime of the expensive queries. It also saves up to 500x in query planning time, and uses up to 6.8x less space compared to state of the art cardinality estimation methods.more » « less
-
Predicate-centric rules for rewriting queries is a key technique in optimizing queries. These include pushing down the predicate below the join and aggregation operators, or optimizing the order of evaluating predicates. However, many of these rules are only applicable when the predicate uses a certain set of columns. For example, to move the predicate below the join operator, the predicate must only use columns from one of the joined tables. By generating a predicate that satisfies these column constraints and preserves the semantics of the original query, the optimizer may leverage additional predicate-centric rules that were not applicable before. Researchers have proposed syntax-driven rewrite rules and machine learning algorithms for inferring such predicates. However, these techniques suffer from two limitations. First, they do not let the optimizer constrain the set of columns that may be used in the learned predicate. Second, machine learning algorithms do not guarantee that the learned predicate preserves semantics. In this paper, we present SIA, a system for learning predicates while being guided by counter-examples and a verification technique, that addresses these limitations. The key idea is to leverage satisfiability modulo theories to generate counter-examples and use them to iteratively learn a valid, optimal predicate. We formalize this problem by proving the key properties of synthesized predicates. We implement our approach in SIA and evaluate its efficacy and efficiency. We demonstrate that it synthesizes a larger set of valid predicates compared to prior approaches. On a collection of 200 queries derived from the TPC-H benchmark, SIA successfully rewrites 114 queries with learned predicates. 66 of these rewritten queries exhibit more than 2X speed up.more » « less
-
Gridded datasets occur in several domains. These datasets comprise (un)structured grid points, where each grid point is characterized by XY(Z) coordinates in a spatial referencing system. The data available at individual grid points are high-dimensional encapsulating multiple variables of interest. This study has two thrusts. The first targets supporting effective management of voluminous gridded datasets while reconciling challenges relating to colocation and dispersion. The second thrust is to support sliding (temporal) window queries over the gridded dataset. Such queries involve sliding a temporal window over the data to identify spatial locations and chronological time points where the specified predicate evaluates to true. Our methodology includes support for a space-efficient data structure for organizing information within the data, query decomposition based on dyadic intervals, support for temporal anchoring, query transformations, and effective evaluation of query predicates. Our empirical benchmarks are conducted on representative voluminous high dimensional datasets such as gridMET (historical meteorological data) and MACA (future climate datasets based on the RCP 8.5 greenhouse gas trajectory). In our benchmarks, our system can handle throughputs of over 3000 multi-predicate sliding window queries per second.more » « less