skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: VeriEQL: Bounded Equivalence Verification for Complex SQL Queries with Integrity Constraints
The task of SQL query equivalence checking is important in various real-world applications (including query rewriting and automated grading) that involve complex queries with integrity constraints; yet, state-of-the-art techniques are very limited in their capability of reasoning about complex features (e.g., those that involve sorting, case statement, rich integrity constraints, etc.) in real-life queries. To the best of our knowledge, we propose the first SMT-based approach and its implementation, VeriEQL, capable of proving and disproving bounded equivalence of complex SQL queries. VeriEQL is based on a new logical encoding that models query semantics over symbolic tuples using the theory of integers with uninterpreted functions. It is simple yet highly practical -- our comprehensive evaluation on over 20,000 benchmarks shows that VeriEQL outperforms all state-of-the-art techniques by more than one order of magnitude in terms of the number of benchmarks that can be proved or disproved. VeriEQL can also generate counterexamples that facilitate many downstream tasks (such as finding serious bugs in systems like MySQL and Apache Calcite).  more » « less
Award ID(s):
2210832
PAR ID:
10543008
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
Proceedings of the ACM on Programming Languages
Volume:
8
Issue:
OOPSLA1
ISSN:
2475-1421
Page Range / eLocation ID:
1071 to 1099
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Checking query equivalence is of great significance in database systems. Prior work in automated query equivalence checking sets the first steps in formally modeling and reasoning about query optimization rules, but only supports a limited number of query features. In this paper, we present Qed, a new framework for query equivalence checking based on bag semantics. Qed uses a new formalism called Q-expressions that models queries using different normal forms for efficient equivalence checking, and models features such as integrity constraints and NULLs in a principled way unlike prior work. Our formalism also allows us to define a new query fragment that encompasses many real-world queries with a complete equivalence checking algorithm, assuming a complete first-order theory solver. Empirically, Qed can verify 299 out of 444 query pairs extracted from the Calcite framework and 979 out of 1287 query pairs extracted from CockroachDB, which is more than 2× the number of cases proven by prior state-of-the-art solver. 
    more » « less
  2. Testing database-backed web applications is chal- lenging because their behaviors (e.g., control flow) are highly dependent on data returned from SQL queries. Without a database containing sufficient and realistic data, it is challenging to reach potentially vulnerable code snippets, limiting various existing dynamic-based security testing approaches. However, obtaining such a database for testing is difficult in practice as it often contains sensitive information. Sharing it can lead to data leaks and privacy issues. In this paper, we present SYNTHDB, a program analysis- based database generation technique for database-backed PHP applications. SYNTHDB leverages a concolic execution engine to identify interactions between PHP codebase and the SQL queries. It then collects and solves various constraints to reconstruct a database that can enable exploring uncovered program paths without violating database integrity. Our evaluation results show that the database generated by SYNTHDB outperforms state-of- the-arts database generation techniques in terms of code and query coverage in 17 real-world PHP applications. Specifically, SYNTHDB generated databases achieve 62.9% code and 77.1% query coverages, which are 14.0% and 24.2% more in code and query coverages than the state-of-the-art techniques. Fur- thermore, our security analysis results show that SYNTHDB effectively aids existing security testing tools: Burp Suite, Wfuzz, and webFuzz. Burp Suite aided by SYNTHDB detects 76.8% of vulnerabilities while other existing techniques cover 55.7% or fewer. Impressively, with SYNTHDB, Burp Suite discovers 33 pre- viously unknown vulnerabilities from 5 real-world applications. 
    more » « less
  3. Proving the equivalence between SQL queries is a fundamental problem in database research. Existing solvers model queries using algebraic representations and convert such representations into first-order logic formulas so that query equivalence can be verified by solving a satisfiability problem. The main challenge lies in unbounded summations, which appear commonly in a query's algebraic representation in order to model common SQL features, such as projection and aggregate functions. Unfortunately, existing solvers handle unbounded summations in an ad-hoc manner based on heuristics or syntax comparison, which severely limits the set of queries that can be supported. This paper develops a new SQL equivalence prover called SQLSolver, which can handle unbounded summations in a principled way. Our key insight is to use the theory of LIA^*, which extends linear integer arithmetic formulas with unbounded sums and provides algorithms to translate a LIA^* formula to a LIA formula that can be decided using existing SMT solvers. We augment the basic LIA^* theory to handle several complex scenarios (such as nested unbounded summations) that arise from modeling real-world queries. We evaluate SQLSolver with 359 equivalent query pairs derived from the SQL rewrite rules in Calcite and Spark SQL. SQLSolver successfully proves 346 pairs of them, which significantly outperforms existing provers. 
    more » « less
  4. We present a novel symbolic reasoning engine for SQL which can efficiently generate an inputIfornqueriesP1, ⋯,Pn, such that their outputs onIsatisfy a given property (expressed in SMT). This is useful in different contexts, such as disproving equivalence of two SQL queries and disambiguating a set of queries. Our first idea is to reason about an under-approximation of eachPi— that is, a subset ofPi’s input-output behaviors. While it makes our approach both semantics-aware and lightweight, this idea alone is incomplete (as a fixed under-approximation might miss some behaviors of interest). Therefore, our second idea is to perform search over an expressive family of under-approximations (which collectively cover all program behaviors of interest), thereby making our approach complete. We have implemented these ideas in a tool, Polygon, and evaluated it on over 30,000 benchmarks across two tasks (namely, SQL equivalence refutation and query disambiguation). Our evaluation results show that Polygon significantly outperforms all prior techniques. 
    more » « less
  5. Query rewriting is often a prerequisite for effective query optimization, particularly for poorly-written queries. Prior work on query rewriting has relied on a set of "rules" based on syntactic pattern-matching. Whether relying on manual rules or auto-generated ones, rule-based query rewriters are inherently limited in their ability to handle new query patterns. Their success is limited by the quality and quantity of the rules provided to them. To our knowledge, we present the first synthesis-based query rewriting technique, SlabCity, capable of whole-query optimization without relying on any rewrite rules. SlabCity directly searches the space of SQL queries using a novel query synthesis algorithm that leverages a new concept called query dataflows. We evaluate SlabCity on four workloads, including a newly curated benchmark with more than 1000 real-life queries. We show that not only can SlabCity optimize more queries than state-of-the-art query rewriting techniques, but interestingly, it also leads to queries that are significantly faster than those generated by rule-based systems. 
    more » « less