skip to main content


This content will become publicly available on July 1, 2024

Title: SlabCity: Whole-Query Optimization Using Program Synthesis
Query rewriting is often a prerequisite for effective query optimization, particularly for poorly-written queries. Prior work on query rewriting has relied on a set of "rules" based on syntactic pattern-matching. Whether relying on manual rules or auto-generated ones, rule-based query rewriters are inherently limited in their ability to handle new query patterns. Their success is limited by the quality and quantity of the rules provided to them. To our knowledge, we present the first synthesis-based query rewriting technique, SlabCity, capable of whole-query optimization without relying on any rewrite rules. SlabCity directly searches the space of SQL queries using a novel query synthesis algorithm that leverages a new concept called query dataflows. We evaluate SlabCity on four workloads, including a newly curated benchmark with more than 1000 real-life queries. We show that not only can SlabCity optimize more queries than state-of-the-art query rewriting techniques, but interestingly, it also leads to queries that are significantly faster than those generated by rule-based systems.  more » « less
Award ID(s):
2210832
NSF-PAR ID:
10451715
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
16
Issue:
11
ISSN:
2150-8097
Page Range / eLocation ID:
3151 to 3164
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Predicate-centric rules for rewriting queries is a key technique in optimizing queries. These include pushing down the predicate below the join and aggregation operators, or optimizing the order of evaluating predicates. However, many of these rules are only applicable when the predicate uses a certain set of columns. For example, to move the predicate below the join operator, the predicate must only use columns from one of the joined tables. By generating a predicate that satisfies these column constraints and preserves the semantics of the original query, the optimizer may leverage additional predicate-centric rules that were not applicable before. Researchers have proposed syntax-driven rewrite rules and machine learning algorithms for inferring such predicates. However, these techniques suffer from two limitations. First, they do not let the optimizer constrain the set of columns that may be used in the learned predicate. Second, machine learning algorithms do not guarantee that the learned predicate preserves semantics. In this paper, we present SIA, a system for learning predicates while being guided by counter-examples and a verification technique, that addresses these limitations. The key idea is to leverage satisfiability modulo theories to generate counter-examples and use them to iteratively learn a valid, optimal predicate. We formalize this problem by proving the key properties of synthesized predicates. We implement our approach in SIA and evaluate its efficacy and efficiency. We demonstrate that it synthesizes a larger set of valid predicates compared to prior approaches. On a collection of 200 queries derived from the TPC-H benchmark, SIA successfully rewrites 114 queries with learned predicates. 66 of these rewritten queries exhibit more than 2X speed up. 
    more » « less
  2. Regular expressions are pervasive in modern systems. Many real world regular expressions are inefficient, sometimes to the extent that they are vulnerable to complexity-based attacks, and while much research has focused on detecting inefficient regular expressions or accelerating regular expression matching at the hardware level, we investigate automatically transforming regular expressions to remove inefficiencies. We reduce this problem to general expression optimization, an important task necessary in a variety of domains even beyond compilers, e.g., digital logic design, etc. Syntax-guided synthesis (SyGuS) with a cost function can be used for this purpose, but ordered enumeration through a large space of candidate expressions can be prohibitively expensive. Equality saturation is an alternative approach which allows efficientconstruction and maintenance of expression equivalence classes generated by rewrite rules, but the procedure may not reach saturation, meaning global minimality cannot be confirmed. We present a new approach called rewrite-guided synthesis (ReGiS), in which a unique interplay between SyGuS and equality saturation-based rewriting helps to overcome these problems, resulting in an efficient, scalable framework for expression optimization. 
    more » « less
  3. Query driven cardinality estimation models learn from a historical log of queries. They are lightweight, having low storage requirements, fast inference and training, and are easily adaptable for any kind of query. Unfortunately, such models can suffer unpredictably bad performance under workload drift, i.e., if the query pattern or data changes. This makes them unreliable and hard to deploy. We analyze the reasons why models become unpredictable due to workload drift, and introduce modifications to the query representation and neural network training techniques to make query-driven models robust to the effects of workload drift. First, we emulate workload drift in queries involving some unseen tables or columns by randomly masking out some table or column features during training. This forces the model to make predictions with missing query information, relying more on robust features based on up-to-date DBMS statistics that are useful even when query or data drift happens. Second, we introduce join bitmaps, which extends sampling-based features to be consistent across joins using ideas from sideways information passing. Finally, we show how both of these ideas can be adapted to handle data updates.

    We show significantly greater generalization than past works across different workloads and databases. For instance, a model trained with our techniques on a simple workload (JOBLight-train), with 40ksynthetically generated queries of at most 3 tables each, is able to generalize to the much more complex Join Order Benchmark, which include queries with up to 16 tables, and improve query runtimes by 2× over PostgreSQL. We show similar robustness results with data updates, and across other workloads. We discuss the situations where we expect, and see, improvements, as well as more challenging workload drift scenarios where these techniques do not improve much over PostgreSQL. However, even in the most challenging scenarios, our models never perform worse than PostgreSQL, while standard query driven models can get much worse than PostgreSQL.

     
    more » « less
  4. This paper focuses on a newly challenging setting in hard-label adversarial attacks on text data by taking the budget information into account. Although existing approaches can successfully generate adversarial examples in the hard-label setting, they follow an ideal assumption that the victim model does not restrict the number of queries. However, in real-world applications the query budget is usually tight or limited. Moreover, existing hard-label adversarial attack techniques use the genetic algorithm to optimize discrete text data by maintaining a number of adversarial candidates during optimization, which can lead to the problem of generating low-quality adversarial examples in the tight-budget setting. To solve this problem, in this paper, we propose a new method named TextHoaxer by formulating the budgeted hard-label adversarial attack task on text data as a gradient-based optimization problem of perturbation matrix in the continuous word embedding space. Compared with the genetic algorithm-based optimization, our solution only uses a single initialized adversarial example as the adversarial candidate for optimization, which significantly reduces the number of queries. The optimization is guided by a new objective function consisting of three terms, i.e., semantic similarity term, pair-wise perturbation constraint, and sparsity constraint. Semantic similarity term and pair-wise perturbation constraint can ensure the high semantic similarity of adversarial examples from both comprehensive text-level and individual word-level, while the sparsity constraint explicitly restricts the number of perturbed words, which is also helpful for enhancing the quality of generated text. We conduct extensive experiments on eight text datasets against three representative natural language models, and experimental results show that TextHoaxer can generate high-quality adversarial examples with higher semantic similarity and lower perturbation rate under the tight-budget setting. 
    more » « less
  5. State-of-the-art password guessing tools, such as HashCat and John the Ripper, enable users to check billions of passwords per second against password hashes. In addition to performing straightforward dictionary attacks, these tools can expand password dictionaries using password generation rules, such as concatenation of words (e.g., “password123456”) and leet speak (e.g., “password” becomes “p4s5w0rd”). Although these rules work well in practice, creating and expanding them to model further passwords is a labor-intensive task that requires specialized expertise. To address this issue, in this paper we introduce PassGAN, a novel approach that replaces human-generated password rules with theory-grounded machine learning algorithms. Instead of relying on manual password analysis, PassGAN uses a Generative Adversarial Network (GAN) to autonomously learn the distribution of real passwords from actual password leaks, and to generate high-quality password guesses. Our experiments show that this approach is very promising. When we evaluated PassGAN on two large password datasets, we were able to surpass rule-based and state-of-the-art machine learning password guessing tools. However, in contrast with the other tools, PassGAN achieved this result without any a-priori knowledge on passwords or common password structures. Additionally, when we combined the output of PassGAN with the output of HashCat, we were able to match 51%–73% more passwords than with HashCat alone. This is remarkable, because it shows that PassGAN can autonomously extract a considerable number of password properties that current state-of-the art rules do not encode. 
    more » « less