<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>SPES: A Symbolic Approach to Proving Query Equivalence Under Bag Semantics</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>07/01/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10436871</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings  International Conference on Data Engineering</title>
<idno>1084-4627</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>J. Arulraj Q. Zhou</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[In database-as-a-service platforms, automated ver-ification of query equivalence helps eliminate redundant computation in the form of overlapping sub-queries. Researchers have proposed two pragmatic techniques to tackle this problem. The first approach consists of reducing the queries to algebraic expressions and proving their equivalence using an algebraic theory. The limitations of this technique are threefold. It cannot prove the equivalence of queries with significant differences in the attributes of their relational operators (e.g., predicates in the filter operator). It does not support certain widely-used SQL features (e.g., NULL values). Its verification procedure is computationally intensive. The second approach transforms this problem to a constraint satisfaction problem and leverages a general-purpose solver to determine query equivalence. This technique consists of deriving the symbolic representation of the queries and proving their equivalence by determining the query containment relationship between the symbolic expressions. While the latter approach addresses all the limitations of the former technique, it only proves the equivalence of queries under set semantics (i.e., output tables must not contain duplicate tuples). However, in practice, database applications use bag semantics (i.e., output tables may contain duplicate tuples) In this paper, we introduce a novel symbolic approach for proving query equivalence under bag semantics. We transform the problem of proving query equivalence under bag semantics to that of proving the existence of a bijective, identity map between tuples returned by the queries on all valid inputs. We classify SQL queries into four categories, and propose a set of novel category-specific verification algorithms. We implement this symbolic approach in SPES and demonstrate that it proves the equivalence of a larger set of query pairs (95/232) under bag semantics compared to the SOTA tools based on algebraic (30/232) and symbolic approaches (67/232) under set and bag semantics, respectively. Furthermore, SPES is 3X faster than the symbolic tool that proves equivalence under set semantics.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">INTRODUCTION</head><p>Database-as-a-service (DBaaS) platforms enable users to quickly deploy complex data processing pipelines consisting of SQL queries <ref type="bibr">[1,</ref><ref type="bibr">9,</ref><ref type="bibr">11]</ref>. These pipelines often exhibit a significant amount of computational overlap (i.e., semantically equivalent sub-queries) <ref type="bibr">[17,</ref><ref type="bibr">50]</ref>. This results in higher resource usage and longer query execution times. Researchers have developed techniques for minimizing redundant computation by materializing the overlapping sub-queries as views and rewriting the original queries to operate on these materialized views <ref type="bibr">[43,</ref><ref type="bibr">33]</ref>. All of these techniques rely on an effective and efficient algorithm for automatically deciding the equivalence of a pair of SQL queries. Two queries are equivalent if they always return the same output table for any given input tables.</p><p>Prior Work: Although proving query equivalence (QE) is an undecidable problem <ref type="bibr">[14,</ref><ref type="bibr">18]</ref>, researchers have recently formulated two pragmatic approaches for automatically proving QE.</p><p>The first approach is based on an algebraic representation of queries <ref type="bibr">[23,</ref><ref type="bibr">22]</ref>. UDP, the state-of-the-art prover based on this algebraic approach, determines QE using three steps. First, it transforms the queries from an abstract syntax tree (AST) representation to an algebraic representation. Next, it applies a set of normalization rules to homogenize the algebraic representations (ARs) of these queries. Lastly, it attempts to find an isomorphism between the tuple vocabularies of two normalized ARs to determine their syntactical equivalence using substitution. This pragmatic algebraic approach works well on real-world SQL queries. However, it suffers from three limitations. First, UDP cannot prove the equivalence of queries when the attributes in their relational operators are syntactically different (e.g., predicates in the filter operator). This is because it relies on a set of pre-defined syntax-driven rewrite rules to normalize the algebraic representation. Second, it does not support certain widely used SQL features (e.g., NULL values). Third, its verification procedure is computationally intensive due to the large number of possible normalized ARs.</p><p>EQUITAS addresses these limitations by reducing the problem of proving QE to a constraint satisfaction problem <ref type="bibr">[50]</ref>. EQUITAS determines QE using two steps. First, it transforms the queries from an AST representation to a symbolic representation (SR) (i.e., a set of first-order logic (FOL) formulae). A query's SR symbolically represents an arbitrary tuple that it returns. Next, it leverages a general-purpose solver based on satisfiability modulo theory (SMT) to determine the containment relationship between two SRs 1 . If the containment relationship holds in both directions, then the queries are equivalent. EQUITAS is empirically more effective and efficient than UDP <ref type="bibr">[50]</ref>.</p><p>While this symbolic approach addresses the drawbacks of the algebraic approach, it has one significant limitation. It only proves the equivalence of queries under set semantics (i.e., output tables must not contain duplicate tuples <ref type="bibr">[41]</ref>). In practice, database applications use bag semantics (i.e., output tables may contain duplicate tuples <ref type="bibr">[16]</ref>) Proving QE under bag semantics is a strictly harder problem than doing so under set semantics. This is because if two queries are equivalent under bag semantics, then they are also equivalent under set semantics. However, the converse does not hold. It is infeasible to extend EQUITAS to prove query equivalence under bag semantics. To prove QE under set semantics, it verifies the containment relationship between queries. This technique only works if all the tuples in the output tables of the queries are distinct. So, it does not work if duplicate tuples are present in the tables.</p><p>Our Approach: In this paper, we present a novel technique for proving QE under bag semantics. We reduce the problem of proving that two queries are equivalent under bag semantics to the problem of proving the existence of a bijective, identity map between tuples returned by the queries across all valid inputs. We introduce a novel query pair symbolic representation (QPSR) to symbolically represent the bijective map between tuples returned by two cardinally equivalent queries and leverage an SMT solver to efficiently verify that the bijective map is an identity map <ref type="foot">2</ref> . We classify SQL queries into four categories, and propose a set of novel category-specific verification algorithms to verify cardinal equivalence and construct the QPSR.</p><p>We implemented this symbolic approach in SPES, a tool for automatically verifying the equivalence of SQL queries under bag semantics. We evaluate SPES using a collection of 232 pairs of equivalent SQL queries available in the Apache CALCITE framework <ref type="bibr">[3]</ref>. Each query pair is constructed by applying various optimization rules on complex SQL queries with diverse features (e.g., arithmetic operations, three-valued logic for supporting NULL, subqueries, grouping, and aggregate functions). Our evaluation shows that SPES proves the semantic equivalence of a larger set of query pairs (95/232) compared to UDP (34/232) and EQUITAS (67/232). Furthermore, SPES is 3&#8677; faster than EQUITAS on this benchmark. In addition to the CALCITE benchmark, we evaluate the efficacy of SPES on a cloud-scale workload comprising 9, 486 real-world SQL queries from Ant Financial Services Group <ref type="bibr">[2]</ref>. SPES automatically found that 27% of the queries in this workload contain overlapping computation. Furthermore, 48% overlapping queries contain either aggregate or join operations, which are expensive to evaluate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Contributions:</head><p>We make the following contributions:</p><p>&#8226; We present a novel query pair symbolic representation and a set of verification algorithms for determining QE under bag semantics in &#167;5. &#8226; We classify SQL queries into four categories, and present a set of normalization rules to enhance the ability of SPES to prove query equivalence under bag semantics in &#167;4. &#8226; We implement this approach in SPES and evaluate its efficacy.</p><p>We demonstrate that SPES proves the equivalence of a larger set of query pairs in CALCITE benchmark compared to the stateof-the-art tools in &#167;7. More importantly, unlike EQUITAS, SPES proves QE under bag semantics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">BACKGROUND &amp; MOTIVATION</head><p>In this section, we present an overview of the symbolic approach for proving QE under set semantics used in EQUITAS <ref type="bibr">[50]</ref>. We then explain why this approach cannot be used to prove QE under bag semantics.</p><p>A pair of queries Q1 and Q2 are equivalent under set semantics if and only if for all valid input tables, Q1 and Q2 return the same set of tuples. EQUITAS proves QE by verifying whether the containment relationship holds in both directions (i.e., Q1 contains Q2 and Q2 contains Q1). Q1 is contained by Q2 under set semantics if and only if the following condition is satisfied: for all valid input tables, if any tuple is returned by Q1, then there is always a corresponding, identical tuple returned by Q2. EQUITAS first constructs the SRs of Q1 and Q2, and then uses an SMT solver to verify the relational properties between two SRs to prove the containment relationship.</p><p>We present an example to demonstrate how EQUITAS proves QE. The queries are based on these two tables:</p><p>&#8226; EMP Q1 selects the department id and location columns of all employees whose department id is greater than 10. Q2 first selects the same columns grouped by the department id and location columns. Q2 then filters out tuples whose department id plus five is not greater than 15. Q1 and Q2 are equivalent under set semantics because they return the same set of tuples.</p><p>EQUITAS first constructs the SRs for Q1 and Q2: Each SR contains two parts: COND and COLS. COND is an SMT formula that each tuple needs to satisfy (i.e., the condition) in order to be returned by the given query. For example, in Q1's SR, COND is v3 &gt; 10 and !n3, which means that each returned tuple's department id must be greater than ten and must not be NULL. COLS is a vector of pairs of SMT formula that symbolically represents an arbitrary tuple returned by the query. Each pair of SMT formula represents a column, where the first formula represents the value of the column and the second boolean formula indicates if the value is NULL. For example, in Q1's SR, COLS has two pairs of SMT formulae, where the first pair of SMT formulae symbolically represents the department id column, and the second pair of SMT formulae represents the location column. EQUITAS leverages the SMT solver to verify two relational properties between the SRs to prove that Q1 is contained by Q2. It first verifies that COND1 =) COND2. EQUITAS checks this property to show that if there is an unique tuple returned by Q1, then there is always a corresponding unique tuple returned by Q2. It then verifies if (COND1 ^COND2) =) ( COLS1 = COLS2). EQUITAS checks this property to show that the two unique tuples returned by Q1 and Q2 are always equivalent. If these two properties hold, then Q1 is contained by Q2. EQUITAS uses the same technique to prove that Q2 is contained by Q1. Taken together, these containment relationships mean than Q1 and Q2 are equivalent.</p><p>Bag Semantics: Q1 and Q2 are not equivalent under bag semantics. Under bag semantics, a table may contain duplicate tuples. Q1 and Q2 would return different output tables if there were two employees working in the same department and location. As shown in Figure <ref type="figure">1</ref>, three employees working in the same department and location show up in the return table of Q1. But, they only show up once in the return table of Q2. Since database applications tend to assume bag semantics, it is critical to prove QE under bag semantics.</p><p>It is not feasible to generalize the containment approach used in EQUITAS to support bag semantics. EQUITAS derives the SR of one arbitrary tuple in the returned table. It proves the containment relationship by proving there is an identical tuple in the output table of the other query. This is sufficient for proving QE under set semantics because each tuple occurs only once in the output table. However, it is not sufficient for proving QE under bag semantics, since each distinct tuple may appear multiple times in the output table. It cannot track and compare the frequencies of occurrence of these tuples. As shown in Figure <ref type="figure">1</ref>, the first three tuples in Q1's output table are proved to be equal to one tuple in Q2's output. So, EQUITAS incorrectly concludes that Q1 is contained by Q2 under bag semantics.</p><p>It is not feasible to construct an SR with an unbounded number of symbolic tuples, where each bag of tuples represented by the SR is different. As shown in Figure <ref type="figure">1</ref>, the maximum number of times a tuple appears in Q1's output table may be arbitrarily large. It is infeasible to construct an SR with a fixed number of symbolic tuples such that each bag of tuples represented by the SR is guaranteed to be different. Therefore, proving that two queries are equivalent under bag semantics requires a novel approach.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">OVERVIEW</head><p>In this section, we first present an overview of how SPES proves QE under bag semantics in &#167;3.1. We then illustrate this approach using an example in &#167;3.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Problem Formulation</head><p>To prove QE under bag semantics, we present a novel problem formulation. Based on the definition of bag semantics, two queries are equivalent under bag semantics if and only if for all valid input tables, both queries return the same multi-valued set of tuples. So, for all valid input tables, if there always exists a bijective, identity map between the output tables, then the two queries are equivalent. As shown in Figure <ref type="figure">2b</ref>, a bijective, identity map is a one-to-one map that maps a tuple in one output table to an unique, identical tuple in the other output table. Even if the output table has duplicate tuples, each of them would show up once and exactly once in the bijective, identity map. Thus, we transform the problem of proving QE under bag semantics to proving the existence of a bijective, identity map for all valid input tables.</p><p>SPES tackles the latter problem into two steps.</p><p>&#8706; Cardinal Equivalence: In the first step, SPES proves that for all valid inputs, there exists a bijective map between tuples in the output tables of two given queries. SPES first verifies if the given pair of queries are cardinally equivalent under bag semantics to verify the existence of a bijective map. Two queries are cardinally equivalent if and only if for all valid inputs, their output tables contain the  same number of tuples. We defer a formal definition of cardinal equivalence to &#167;5.1. If two queries are cardinally equivalent, then there exists a bijective map between the tuples returned by these two queries for all valid inputs, as shown in Figure <ref type="figure">2a</ref>. In this map, each tuple in the first table is mapped to a unique tuple in the second table, and all tuples in second table are covered by the map. We note that the contents of the output tables of two cardinally equivalent queries may not be identical (i.e., they are not fully equivalent). SPES then constructs a Query Pair Symbolic Representation (QPSR) for two cardinally equivalent queries to symbolically represent the bijective map between the returned tuples for all valid inputs. The QPSR contains a pair of symbolic tuples to represent the bijective map. The first tuple represents an arbitrary tuple returned by the first query. The second tuple represents the corresponding tuple returned by the second query as defined by the bijective map. Given a set of concrete input tables, each pair of tuples in the bijective map that are returned by two queries can be obtained by substituting the two symbolic tuples with a model (i.e., a set of concrete values for the symbolic variables). We defer a discussion of QPSR to &#167;5.2. SPES proves the cardinal equivalence of two queries by recursively constructing the QPSR of their sub-queries and using the SMT solver to verify specific properties of construed sub-QPSRs based on the semantics of different types of queries. We defer a discussion of how SPES proves cardinal equivalence and constructs QPSR to Sections 5.3 to 5.7.</p><p>&#8721; Full Equivalence: In the second step, SPES proves that the symbolic bijective map is always an identity map to prove that the given pair of queries are fully equivalent under bag semantics (i.e., their contents are identical as well). Two queries are fully equivalent if and only if for all valid input tables, their output tables contain the same tuples (ignoring the order of the tuples). We defer a formal definition of full equivalence to &#167;5.1. If there exists a bijective, identity map between the tuples returned by these two queries for all valid inputs, then the two queries are fully equivalent. Figure <ref type="figure">2b</ref> shows an example of one bijective, identity map for one possible set of output tables. In this map, each tuple in the first table is mapped to a unique, identical tuple in the second table. All tuples in the second table are covered by this map.</p><p>SPES uses the constructed QPSR to verify that the given pair of queries are fully equivalent. It proves full equivalence by showing that the bijective map is an identity map. Because the QPSR uses a pair of symbolic tuples to represent the bijective tuple, SPES leverages the SMT solver to prove that the two symbolic tuples are always equivalent for all valid models. We defer a discussion of how to prove the full equivalence using QPSR to &#167;5.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>SPES vs EQUITAS:</head><p>The key differences between SPES and EQUITAS lies in how they prove QE.</p><p>&#8226; SPES converts the QE problem to a problem of proving the existence of a bijective, identity map between tuples in the output tables of the two queries for all valid input tables. EQUITAS reduces the QE problem to a query containment problem by showing that an arbitrary tuple returned by one query is also returned by the other query. &#8226; SPES constructs a QPSR for a pair of queries after verifying that the queries are cardinally equivalent. The QPSR symbolically represents the bijective map between the tuples in their output tables. In contrast, EQUITAS separately constructs an SR for each individual query that represents the tuples in its output table. &#8226; SPES decomposes the problem of proving equivalence of queries into smaller proofs of equivalence of their sub-queries (i.e.sub-QPSRs). It constructs the bijective map between tuples in the final output tables by recursively constructing the bijective maps between tuples in all of the intermediate output tables. EQUITAS directly proves the containment relationship for the entire queries. So, SPES scales well to larger queries.</p><p>SMT Solver: SPES leverages an SMT solver to prove cardinal and full equivalence of queries <ref type="bibr">[30]</ref>. An SMT solver determines if a given FOL formula is satisfiable. For example, the solver decides that the following formula can be satisfied: x + 5 &gt; 10 ^x &gt; 3 when x is six. Similarly, it determines that the following formula cannot be satisfied: x + 5 &gt; 10 ^x &lt; 4 since there is no integral value of x for which this formula holds. A detailed description of solvers is available in <ref type="bibr">[29]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Illustrative Example</head><p>We now use an example to show how SPES proves QE under bag semantics. These queries are based on the EMP and DEPT tables. Q1 is an aggregation query that calculates the sum of salaries of employees in the same location, whose department id plus five is 15. Q2 is an aggregation query that calculates the sum of salaries of employees in the same location and department, whose department id is 10. Q1 and Q2 are semantically equivalent under bag semantics, since the inner query retrieves tuples whose department id is 10 (and 10 + 5 = 15), and the two GROUP BY sets cluster tuples in the same manner (department id is constant).</p><p>Tree Representation: SPES first normalizes each queries into a tree representation. This representation is similar to a logical execution plan, which captures the semantics of the original queries. The only difference is that each node in this tree represents a subquery that belongs to a specific category (e.g., SPJ query). We classify SQL queries into four categories based on their constructors. We defer a discussion of these four categories to &#167;4.</p><p>Figure <ref type="figure">3</ref> shows the tree representation of queries Q1 and Q2. The tree representation of Q1 is an aggregate query that takes a SELECT-PROJECT-JOIN (SPJ) sub-query as input, the department id as the group set, and the sum of salaries as the aggregate operation. The SPJ node takes two table sub-queries (EMP and DEPT) as inputs, and has a filter predicate on DEPT. SPES constructs the tree representation of Q2 in a similar manner.</p><p>Proving QE Under Bag Semantics: To prove Q1 and Q2 are equivalent under bag semantics, SPES first verifies the cardinal equivalence of two queries. In order to verify the cardinal equivalence of two aggregate queries, SPES recursively constructs the QPSR of two SPJ sub-queries that the aggregate queries take as inputs. To verify the cardinal equivalence of two SPJ sub-queries, SPES constructs a bijective map between their input sub-queries and checks if each pair of input sub-queries are cardinally equivalent. If that is the case, then it constructs a QPSR for each pair of input sub-queries. SPES pairs the EMP and the DEPT tables in Q1 with those in Q2, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>QPSR-1:</head><p>The QPSR for the pair of EMP tables is given by:</p><p>Here, COLS1 and COLS2 symbolically represent two corresponding tuples present in the two cardinally equivalent tables, respectively. Each symbolic tuple is a vector of pairs of FOL terms. We present the formal definitions of COLS1 and COLS2 in &#167;5.2. This pair of symbolic tuples COLS1 and COLS2 defines a bijective map between the tuples returned by the table. Since both tables refer to EMP, the bijective map is an identity map.</p><p>{(v1, n1), (v2, n2), (v3, n3), (v4, n4)} symbolically represents a tuple returned by the EMP table. Each pair of symbolic variables represents a column. For instance, (v1, n1) denotes EMP_ID in this symbolic tuple. v1 represents the value of EMP_ID, n1 indicates if the value is NULL. The encoding scheme is similar to the one used by EQUITAS <ref type="bibr">[50]</ref>. COND is an FOL formula that represents the filter conditions. It must be satisfied for the tuples to be present in the output table of this query. COND is TRUE because the table query ( &#167;4.1) simply returns all tuples present in the table.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>QPSR-2:</head><p>The QPSR for the pair of DEPT tables is given by:</p><p>Here, {(v5, n5), (v6, n6)} symbolically represents a tuple returned by the DEPT table.</p><p>QPSR-3: SPES uses QPSR-1 and QPSR-2 to construct two symbolic predicates in two SPJ queries. It then leverages the SMT solver to verify that two predicates always return the same boolean results for pairs of corresponding tuples in the bijective map between two join tables. In this way, it proves that the two SPJ queries are cardinally equivalent. It then constructs a QPSR for the SPJ queries:</p><p>COLS1 and COLS2 symbolically represent a bijective map between tuples in the output tables of two SPJ queries. This bijective map preserves the two bijective maps in the two sub-QPSRs between their input table queries. In other words, if a tuple t1 is mapped to another tuple t2 in QPSR-1, and a tuple t3 is mapped to another tuple t4 in QPSR-2, then the join tuple of t1 and t2 maps to that of t3 and t4 in QPSR-3. In this manner, the mapping in the lower-level QPSRs is preserved in the higher-level QPSR. COND is obtained by combining the filter predicates.</p><p>QPSR-4: SPES uses QPSR-3 and the SMT solver to verify that two aggregate queries always return the same number of groups of tuples based on the group by set to prove the two aggregate queries are cardinally equivalent. If and only if they are cardinally equivalent, it constructs a QPSR for the entire aggregate queries (i.e., Q1 and Q2):</p><p>COND: (v3 + 5 = 15 and !n3) and (v3 = 10 and !n3) COLS1: {(v1,n1),(v7,n7)} COLS2: {(v1,n1),(v7,n7)}</p><p>Here, COLS1 and COLS2 symbolically represent a bijective map between tuples returned by Q1 and Q2, respectively. (v7, n7) represents the sum of salaries column.</p><p>Full Equivalence: After determining cardinal equivalence of Q1 and Q2, SPES uses QPSR-4 to verify the full equivalence of these queries by showing that the bijective map is an identity map. It uses an SMT solver to verify the following property of QPSR-4:</p><p>It feeds the negation of the property to the solver. The solver determines that it cannot be satisfied, thereby showing that the paired symbolic tuples are always equivalent when COND holds. Thus, the bijective map between the tuples returned by the queries is an identity map. So, Q1 and Q2 are fully equivalent under bag semantics.</p><p>Summary: SPES first constructs QPSR-1 for EMP table and QPSR-2 for DEPT table. It then uses these QPSRs to determine the cardinal equivalence of SPJ queries. Next, it constructs QPSR-3 for the SPJ queries. SPES then uses QPSR-3 to determine the cardinal equivalence of aggregate queries. Lastly, it uses QPSR-4 to check the full equivalence of Q1 and Q2. Thus, SPES always checks for cardinal equivalence before constructing the QPSRs. It only checks the full equivalence property for the top-level QPSR (i.e., QPSR-4). This is because two queries may be fully equivalent even if their sub-queries are not fully equivalent (e.g., due to aggregation)</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">QUERY REPRESENTATION</head><p>In this section, we first describe the syntax and semantics of four different categories of queries in &#167;4.1. SPES uses a a tree representation for processing queries. We then introduce the Union Normal Form (UNF) of a normalized query representation and a minimal set of normalization rules in &#167;4.2. These normalization rules enhance the ability of SPES in proving QE for a larger set of queries under bag semantics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Syntax and Semantics</head><p>We first present the syntax of the four categories. The reasons for defining these categories are twofold. First, it allows SPES to cover most of the frequently-observed SQL queries. Second, since different types of queries have different semantics, SPES can leverage category-specific QE verification algorithms. A query e is defined as:</p><p>In SPES, a query can be: (1) a table query, (2) an SPJ query, (3) an aggregate query, or (4) a union query. Except for the table query, all other types of queries take sub-queries as its inputs. Thus, SPES represents the whole query as a tree, where each node is a sub-query that belongs to one of the four categories. We consider a table to be a bag (i.e., multi-valued set) of tuples as it best represents real-world databases. SPES supports the DISTINCT keyword for discarding duplicate tuples in a bag. So, it also supports set semantics.</p><p>We next describe the semantics of the four categories based on the relationships between the input and output tables. &#8721; SPJ Query: This type of query contains three fields: (1) a vector of input sub-queries (&#7869;), (2) a predicate that determines whether a tuple is selected (P), and (3) a vector of projection expressions that transform each selected tuple (&#245;). A predicate may contain arithmetic operators, logical operators, and functions that check if a term is NULL. SPES supports higher-order predicates (e.g., EXISTS) which are encoded as uninterpreted functions. A projection expression may contain columns, constant values, NULL, arithmetic operations, user-defined functions, and the CASE keyword. The SPJ query first selects the tuples in the cartesian product of the vector of input sub-queries that satisfy the predicate p, and it then emits the transformed tuples obtained by applying the vector of projection expressions on each tuple.</p><p>&#8719; Aggregate Query: The aggregate query contains three fields:</p><p>(1) an input sub-query (e), (2) a group by set (g), and (3) a vector of aggregate functions ( &#227; g g ). The aggregate query first groups the tuples returned by the input sub-query based on the set of columns in the group by set, such that tuples in each group take the same values for the columns in g. It then applies the vector of aggregate functions to each group of tuples, and returns one aggregate tuple per group. Each aggregate function generates a column in the tuple emitted by this query.</p><p>&#960; Union Query: The union query contains one field: a vector of input sub-queries (&#7869;). It returns all the tuples present in the input tables (without discarding duplicate tuples). The union query captures the semantics of the UNION ALL operator <ref type="bibr">[35]</ref>.</p><p>Complex SQL Constructs: SPES also supports certain complex SQL constructs that do not directly map to these four categories by reducing them to a combination of these categories. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Normalization Rules</head><p>The syntax of an UNF query is defined as follows:</p><p>The UNF query is a union query that takes a vector of normalized SPJ sub-queries as input ( S PJE). Each normalized SPJ query takes a vector of sub-queries as input ( &#7868;). These sub-queries are either a table query or a normalized aggregate queries. Each normalized aggregate queries can only take a UNF query as input.</p><p>Conversion to UNF: An query can be normalized to UNF query by repeatedly applying a set of normalization rules. The number of rule applications is finite and the rules are not applied in a specific order. Due to space constraints, we only present the most significant rule. If a query is SPJ(e0 :: &#7869;1, p1, &#245;1) and e0 = SPJ( &#7869;2, p2, &#245;2), then SPES transforms the query to SPJ( &#7869;2 :: &#7869;1, p1 ^p2, &#245;1 &#245;2). Here, :: denotes concatenation of two vectors and represents elementwise composition of two vectors of projection expressions. There are two additional rules to merge Union and SPJ queries, respectively.</p><p>Normalization Rules: SPES uses a minimal set of pre-defined rules to further simplify the UNF queries. These rules allow SPES to prove the equivalence of a larger set of SQL queries. &#960; Integrity Constraints: SPES supports integrity constraints by encoding them as normalization rules. If a table is joined with itself on its primary key, then SPES normalizes the join operation to a projection. If the input query of an aggregate query is a single table, and the group by set only contains the primary key, and there is no aggregation functions, then the aggregate query may be removed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">EQUIVALENCE VERIFICATION</head><p>In this section, we discuss how SPES verifies the equivalence of two queries. We first present the formal definitions of two types of equivalence under bag semantics in &#167;5.1. We then describe how SPES proves the full equivalence of a pair of cardinally equivalent queries using their QPSR in &#167;5.2. Lastly, we discuss how SPES verifies if a pair of queries are cardinally equivalent, and constructs their QPSR when they are cardinally equivalent in &#167;5.3, &#167;5.4, &#167;5.5, &#167;5.6, and &#167;5.7.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Equivalence Definitions</head><p>To define the full equivalence relationship under bag semantics between two queries, we first define the cardinal equivalence relationship. DEF 1. CARDINAL EQUIVALENCE: Given a pair of queries Q1 and Q2, Q1 and Q2 are cardinally equivalent under bag semantics if and only if (iff), for all valid input tables, the output tables T1 and T2 of Q1 and Q2 contain the same number of tuples.</p><p>If Q1 and Q2 are cardinally equivalent under bag semantics, for all valid inputs, each tuple in T1 can be mapped to a unique tuple in T2, and all tuples in T2 are in the map. Thus, it is a bijective (one-to-one) map between tuples in T1 and T2. However, the two mapped tuples may differ in their values, as shown in Figure <ref type="figure">2a</ref>. DEF 2. FULL EQUIVALENCE: Given a pair of queries Q1 and Q2, Q1 and Q2 are fully equivalent under bag semantics iff, for all valid input tables Ts, the output tables T1 and T2 of Q1 and Q2 are identical.</p><p>Based on the definition, Q1 and Q2 are fully equivalent under bag semantics, iff there exists a bijective map between tuples in T1 and T2, and this bijective map is an identity map. In other words, each tuple in T1 can always be mapped to a unique, identical tuple in T2, and all tuples in T2 are in the map, as shown in Figure <ref type="figure">2b</ref>. Thus, by proving the existence of a bijective, identity map between tuples in T1 and T2 for all valid inputs, we prove Q1 and Q2 are fully equivalent under bag semantics.</p><p>Motivation: SPES first quickly checks for cardinal equivalence before checking for full equivalence. This is because if Q1 and Q2 are fully equivalent, then they must be cardinally equivalent. If Q1 and Q2 are cardinally equivalent, then there always exists a bijective map between tuples in output tables for all valid inputs. To prove Q1 and Q2 are fully equivalent, we only need to prove that the bijective map between tuples in the output tables is always an identity map for all valid inputs. In the rest of the paper, equivalent queries without any qualifier refer to fully-equivalent queries.</p><p>SPES can prove two queries are fully equivalent even if their sub-queries are only cardinally equivalent. For example, while Q1 and Q2 in Example 1 are fully equivalent, their sub-queries are not fully equivalent. This is because the inner query in Q1 and Q2 returns two and three columns, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Query Pair Symbolic Representation</head><p>We now define the QPSR of two cardinally equivalent queries that SPES uses for proving QE. QPSR is used to symbolically represent a bijective map between the tuples that are returned by two cardinally equivalent queries for all valid inputs. QPSR of a pair of cardinally equivalent queries Q1 and Q2 is a tuple of the form:</p><p>COLS1 is a vector of pairs of FOL terms that represent an arbitrary tuple returned by Q1. Each element of this vector represents a column and is of the form: (Val, Is-Null), where Val represents the value of the column and Is-Null denotes the nullability of the column.</p><p>COLS2 is another vector of pairs of FOL terms that represents the corresponding tuple returned by Q2. Since Q1 and Q2 must be cardinally equivalent before SPES constructs their QPSR, a bijective map exists between the returned tuples for all valid inputs, which is symbolically represented by the two tuples COLS1 and COLS2. COND is an FOL formula that represents the constraints that must be satisfied for the symbolic tuples COLS1 and COLS2 to be returned by Q1 and Q2, respectively. They encode the semantics of the predicates in the queries. ASSIGN is another FOL formula that specifies the relational constraints between symbolic variables used in COLS1, COLS2 and COND. This formula is used for supporting complex SQL operators, such as CASE expression.  Verifying Full Equivalence: To prove that two cardinally equivalent queries Q1 and Q2 are fully equivalent, SPES needs to prove that the bijective map between returned tuples is always an identity map. In other words, SPES needs to prove that, for an arbitrary tuple t returned by Q1, the bijective map associates t to an identical tuple returned by Q2 with the same values. SPES verifies this property using the QPSR of Q1 and Q2. When both symbolic tuples satisfy the predicate (i.e., COND), it must verify that COLS1 is equivalent to COLS2. This property is formalized as:</p><p>SPES verifies this property using an SMT solver <ref type="bibr">[30]</ref>. If the property does not hold, then the negation of this property is satisfiable. SPES feeds the negation of this property into the SMT solver. If the solver determines that this formula is unsatisfiable, then it determines that COLS1 and COLS2 are always identical. In this manner, SPES leverages the QPSR to prove full equivalence. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Construction of QPSR</head><p>Alg. 1 presents a recursive procedure VeriCard for verifying the cardinal equivalence of two queries. The VeriCard procedure takes a pair of queries as inputs (i.e., Q1 and Q2). SPES first checks the types of the given queries. If they are of the same type, then it invokes the appropriate sub-procedure for that particular type. We describe these four sub-procedures in Sections 5.4 to 5.7. If Q1 and Q2 are cardinally equivalent, then VeriCard returns their QPSR. If these queries are of different types, it returns NULL to indicate that it cannot determine their cardinal equivalence. This is because each type of queries has different semantics ( &#167;4.1). Some sub-procedures recursively invoke VeriCard to verify the cardinal equivalence between their sub-queries. SPES applies the normalization rules defined in &#167;4 to transform the given two queries so that they are of the same type (and the sub-queries are also of the same types recursively). This normalization process is incomplete (i.e., SPES may conclude that two queries are not cardinally  equivalent since they cannot be normalized to the same type, even if they are actually cardinally equivalent). We discuss this limitation in &#167;7.4. Each sub-procedure takes a pair of queries of the same type as inputs. It first attempts to determine if they are cardinally equivalent. If they are cardinally equivalent, then it constructs the QPSR of Q1 and Q2. Otherwise, it returns NULL to indicate that it cannot determine their cardinal equivalence. In each of the following sub-sections, we first describe the conditions that are sufficient for proving cardinal equivalence based on the semantics of the query. We then describe how each sub-procedure verifies these conditions to prove cardinal equivalence. We then discuss how SPES constructs the QPSR if they are cardinally equivalent. Lastly, we describe their soundness and completeness properties 3 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Table Query</head><p>Alg. 2 illustrates the VeriTable procedure for table query.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>LEMMA 2. A pair of table queries TABLE(n1) and TABLE(n2)</head><p>are cardinally equivalent iff their input tables are the same. (i.e., n1 = n2).</p><p>Cardinal Equivalence: Since the table query returns all tuples from the input table, thus if two table queries have the same input table, then they will always have the same number of tuples. So VeriTable compares the names of the two input tables. QPSR: We define the QPSR of the two cardinally equivalent table queries using an identity map between the returned tuples (e.g., QPSR-1 in Section 3.2). VeriTable first constructs the symbolic tuple COLS1 using a vector of pairs of variables based on n1's table schema T-Schema(n1), and then sets the symbolic tuple COLS2 to be the same as COLS1. These two equivalent tuples COLS1 and COLS2 define a bijective map between the returned tuples. VeriTable sets the COND and ASSIGN fields as TRUE since there are no additional constraints that the tuples in the table must satisfy.</p><p>Properties: VeriTable is sound and complete. These two properties directly follow from Lemma 2. We present a formal proof in the appendix &#167;C.2.1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">SPJ Queries</head><p>Alg. 3 illustrates the VeriSPJ procedure for SPJ queries. If this procedure determines that the two input SPJ queries SPJ( &#7869;1, p1, &#245;1) and SPJ( &#7869;2, p2, &#245;2) are cardinally equivalent, then it returns their QPSR. Otherwise, it returns NULL. VeriSPJ leverages two procedures from <ref type="bibr">[50]</ref>: ConstExpr and ConstPred .</p><p>ConstExpr takes a vector of projection expressions and a symbolic tuple as inputs, and returns a new symbolic tuple with additional constraints ASSIGN that models the relation between variables. 3 A sub-procedure P is sound if whenever it returns a QPSR, the given two queries are cardinally equivalent and the two symbolic tuples symbolically represents a bijective map for all valid inputs. A sub-procedure P is complete if whenever it returns NULL, the given two queries are not cardinally equivalent. This new symbolic tuple represents the modified tuple based on the vector of projection expressions. ConstPred takes a predicate and a symbolic tuple as the input and returns a boolean formula COND with additional constraints ASSIGN. COND symbolically represents the result of evaluating the predicate on the symbolic tuples. ConstPred supports higher-order predicates, such as EXISTS, by encoding them as an uninterpreted function.</p><p>Cardinal Equivalence: As covered in &#167;4. are cardinally equivalent if there is a bijective map m between tuples in intermediate join tables, such that the predicates p1 and p2 always return the same result for the corresponding tuples in m.</p><p>To prove that there is a bijective map between the tuples in the two intermediate join tables, VeriSPJ first uses the VeriVec procedure to find a bijective map between input sub-queries such that each pair of sub-queries are cardinally equivalent. VeriVec exhaustively examines all possible maps and recursively uses VeriCard to verify the cardinal equivalence between two sub-queries. VeriVec returns all possible candidate maps wherein each pair of sub-queries are cardinally equivalent ({ QP SR}). Each candidate map is represented by a vector of QPSR ( QP SR), wherein each QPSR defines a bijective map between tuples returned by a pair of cardinally equivalent sub-queries.</p><p>VeriSPJ then uses the Compose procedure to construct two symbolic tuples COLS1 and COLS2 (line 4) that represent a bijective map between the tuples in the two intermediate join tables. These two symbolic tuples are constructed by concatenating symbolic tuples from the QPSRs of sub-queries based on the order of subqueries in the input vectors. Compose also constructs COND and ASSIGN by taking the conjunction of COND and ASSIGN from the QPSRs of sub-queries, respectively.</p><p>VeriSPJ then tries to prove that the two predicates always return the same result for the two symbolic tuples. VeriSPJ first leverages the ConstPred procedure to encode predicates p1 and p2 on COLS1 and COLS2, respectively (line 6). VeriSPJ uses an SMT solver to prove this property under sub-conditions COND and all relational constraints: ASSIGN, ASSIGN1, ASSIGN2 (line 7). If the property holds, then negation of this property is unsatisfiable: VeriSPJ feeds this formula to an SMT solver. If the solver determines that this formula is unsatisfiable, then we prove COND1 and COND2 are always equivalent when the relational constraints ASSIGN0, ASSIGN1, and ASSIGN2 and sub-conditions COND hold.</p><p>Consider the cardinally equivalent SPJ queries shown in Figure <ref type="figure">4</ref>. In this case, VeriSPJ first verifies that sub-query E11 is cardinally equivalent to sub-query E22, and sub-query E12 is cardinally equivalent to sub-query E21. Thus, the two intermediate join tables (i.e., cartesian product of sub-tables) are cardinally equivalent. VeriSPJ constructs two symbolic tuples to represent the bijective map between these intermediate join tables by leveraging the two bijective maps between the underlying tables. VeriSPJ then verifies that two corresponding tuples in the map either both satisfy the predicate or do not satisfy the predicate. Thus, the bijective map between the tuples in the intermediate join tables is the bijective map between the tuples in the output tables before projection.</p><p>QPSR: Since VeriSPJ verifies that the given pair of SPJ queries are cardinally equivalent, the two symbolic tuples COLS1 and COLS2 define a bijective map between tuples in the output tables before projection. Projection does not change the bijective map between tuples as it is applied separately on each tuple. Thus, VeriSPJ leverages ConstExpr to construct new symbolic tuples COLS 0 1 and COLS 0 2 based on the vector of projection expressions and the given symbolic tuples. The QPSR consists of the derived symbolic tuples COLS 0 1 , COLS 0 2 , the conjunction of COND1, COND2 and COND, and the conjunction of all the relational constraints.</p><p>Properties: VeriSPJ is sound. Based on Lemma 3, if VeriSPJ returns the QPSR, then the given SPJ queries are cardinally equivalent. We present a formal proof in &#167;C.2.2.</p><p>In general, VeriSPJ is not complete. The reasons are threefold. First, the SMT solver is only complete for linear operators. If the predicates have non-linear operators (e.g., multiplication between columns), then the solver may return UNKNOWN when it should return UNSAT <ref type="bibr">[50]</ref>. Second, SPES encodes all user-defined functions, string operations, and higher-order predicates as uninterpreted functions. These encodings do not preserve the semantics of these operations. Third, VeriCard is not complete ( &#167;5.3).</p><p>VeriSPJ procedure is complete if all input queries for the given two SPJ queries are table queries, and the SMT solver can determine the satisfiability of the predicates. This is because the problem of deciding equivalence of two conjunctive (i.e., SPJ) queries is decidable <ref type="bibr">[26]</ref>. LEMMA 4. For a given pair of cardinally equivalent SPJ queries SPJ( &#7869;1, p1, &#245;1) and SPJ( &#7869;2, p2, &#245;2), if &#7869;1 and &#7869;1 only have table sub-queries, and the SMT solver can determine the satisfiability of the predicates and projection expressions, then VeriSPJ procedure returns the QPSR.</p><p>PROOF. We prove this theorem using the method of contraposition. If VeriSPJ returns NULL, then there are two cases: Case 1: There is no bijective map between ! e1 and ! e2, such that each pair of table sub-queries are cardinally equivalent. There are two possible sub-cases.</p><p>In the first sub-case, ! e1 has more input table queries than ! e2. Here, we can always construct the input such that the intermediate join table of &#7869;1 has more tuples than the intermediate join table of &#7869;2. SPES has eliminated the case where the predicates are F alse ( &#167;4.2). Thus, the constructed tuples in intermediate join table can all satisfy the predicate. So, the number of tuples returned by the two SPJ queries is different, and are hence not cardinally equivalent.</p><p>In the second sub-case, ! e1 have different table queries than ! e2. Here, each input table in ! e2 contains only one tuple. And a different table in ! e1 has two tuples that satisfy the predicate. So, the number of tuples returned by the two SPJ queries is different, and are hence not cardinally equivalent.</p><p>Case 2: VeriSPJ cannot verify that the two predicates always return the same result for two corresponding tuples in a bijective map between tuples in the intermediate table. In this case, since both predicates are decidable, the solver will generate a model m such that one symbolic tuple satisfies the predicate and the other one does not. We then construct inputs such that each intermediate table only contains one tuple that matches the values in m. Then the first SPJ query returns a table that contains one tuple, while the other one returns an empty table. Thus, the two SPJ queries are not cardinally equivalent.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.6">Aggregate Queries</head><p>Alg. 4 illustrates the VeriAgg procedure for aggregate queries. If this procedure determines that the two input aggregate queries AGG(e1, g1, &#227;gg1) and AGG(e2, g2, &#227;gg2) are cardinally equivalent, then it returns their QPSR. Otherwise, it returns NULL.</p><p>Cardinal Equivalence: An aggregate query groups the tuples in the input table based on the GROUP BY column set, and then returns a tuple by applying the aggregate function on each group. LEMMA 5. Two aggregate queries AGG(e1, g1, &#227;gg1) and AGG(e2, g2, &#227;gg2) are cardinally equivalent if two conditions are satisfied: (1) the two input sub-queries e1 and e2 are cardinally equivalent; (2) for any two pairs of corresponding tuples in a bijective map of the QPSR of e1 and e2, two tuples in e1 belong to the same group as defined by g1 iff their associated tuples in e2 belong to the same group as defined by g2.</p><p>VeriAgg first recursively invokes the VeriCard procedure to determine the cardinal equivalence of the two input sub-queries e1 and e2 (line 2). If VeriCard returns the QPSR of e1 and e2, then VeriAgg has proved the first condition in Lemma 5.</p><p>To prove the second condition, VeriAgg collects the symbolic tuples COLS1 and COLS2 from the QPSR. Since these two symbolic tuples represent a bijective map between tuples returned by e1 and e2, VeriAgg replaces all variables in COLS1 and COLS2 by a set of fresh variables to generate a second pair of symbolic tuples COLS 0 1 and COLS 0 2 that represents the same bijective map with different tuples.</p><p>We decompose the proof for the second condition into two stages (line 5). In the first stage, we want to prove that if COLS1 and COLS 0 1 belong to the same group, then COLS2 and COLS 0 2 also belong to the same group. To prove this, VeriAgg extracts the GROUP BY column sets g1, g0</p><p>1 , g2 and g0 2 from COLS1, COLS 0 1 , COLS2 and COLS 0 2 , respectively. It then attempts to prove the property:</p><p>VeriAgg sends the negation of this property to the solver. If the solver decides that this formula is unsatisfiable, then it is impossible to find two tuples returned by e1 that are assigned to the same group by g1, such that their corresponding tuples returned by e2 are assigned to different groups by g2. In the second stage, we use the same technique in the reverse direction of the implication. Consider the cardinally equivalent aggregate queries shown in Figure <ref type="figure">5</ref>. VeriAgg first verifies that the two input queries E1 and E2 are cardinally equivalent, and then constructs the QPSR to represent the bijective map between their returned tuples. VeriAgg then verifies that if two arbitrary tuples in E1 belong to same group (e.g., first two tuples), then the two corresponding tuples in E2 also belong to the same group. It also verifies that if two arbitrary tuples in E1 belong to different groups (e.g., first and third tuples), then the two corresponding tuples in E2 also belong to different groups. VeriAgg verifies two aggregate queries are cardinally equivalent by verifying that they emit the same number of groups.</p><p>QPSR: VeriAgg constructs the QPSR of two given aggregate queries after proving they are cardinally equivalent. COLS1 and COLS2 define a bijective map between tuples returned by input queries, and can also be used to define a bijective map between groups in two aggregate queries. If two aggregate functions in &#227; g g 1 and &#227; g g 2 are the same and operate on same values (i.e., input columns of the symbolic tuples are the same), then the aggregate values in the output tuples are the same, since each group contains the same number of tuples.</p><p>VeriAgg invokes the InitAgg procedure on &#227; g g 1 to construct a vector of pairs of new symbolic variables as the symbolic tuples for aggregate functions. In each pair of symbolic variables, the first variable represents the aggregate value. The second variable indicates if the aggregate value is NULL. VeriAgg concatenates the GROUP BY column set g1 with the symbolic tuple COLS1. VeriAgg then invokes the CtrAgg procedure to construct the symbolic columns for &#227; g g 2, and then concatenates with the GROUP BY column set g2. CtrAgg uses the same pairs of symbolic variables for all aggregation operations in &#227; g g 2, where the aggregation function type and operand columns are the same in &#227; g g 1. VeriAgg propagates COND and ASSIGN into the sub-QPSRs. nally equivalent. This is because the two symbolic tuples COLS1 and COLS2 are constructed from corresponding groups. Thus, COLS1 and COLS2 define a bijective map between tuples returned by the two aggregate queries. We present a formal proof in Appendix C.2.3.</p><p>VeriAgg is not complete. The sources of incompleteness are threefold: (1) incompleteness of VeriCard, (2) limitations of the SMT solver, and (3) when VeriCard returns the QPSR of two input sub-queries, the symbolic tuples in the QPSR define only one possible bijective map between tuples in the input tables. If VeriAgg fails to prove the second condition in Lemma 5, it is still possible that there exists another bijective map that satisfies the second condition.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.7">Union Queries</head><p>Alg. 5 illustrates the VeriUnion procedure for union queries. If it determines that the union expressions are cardinally equivalent, then it returns their QPSR. Otherwise, it returns NULL. LEMMA 6. Two union queries UNION( &#7869;1) and UNION( &#7869;2) are cardinally equivalent if there exists a bijective map between the two input sub-queries &#7869;1 and &#7869;2, such that each pair of queries are cardinally equivalent.</p><p>Cardinal Equivalence: The lemma directly follows from the semantics of the union queries. VeriUnion procedure invokes VeriVec ( &#167;5.5) to find a bijective map between &#7869;1 and &#7869;2 (line 2), such that each pair of queries are cardinally equivalent. QPSR: VeriVec finds all candidate bijective maps ({ QP SR}) between two input sub-queries &#7869;1 and &#7869;2, such that each pair of subqueries are cardinally equivalent. In each candidate bijective map ( QP SR), a vector of QPSRs is constructed such that each QPSR defines a bijective map between tuples returned by a pair of subqueries. VeriUnion gets an arbitrary QP SR (i.e., one candidate bijective map between the sub-queries). It seeks to construct a bijective map between tuples returned by two union queries that preserves all of the bijective maps between tuples returned by subqueries in that QP SR. It first constructs two fresh symbolic tuples COLS1 and COLS2. It then invokes the ConstAssign procedure to set ASSIGN such that both COLS1 and COLS2 are always equivalent to the symbolic tuples in one sub-QPSR returned by VeriVec, and COND such that COND in one sub-QPSR holds when symbolic tuples equal to the tuples in that sub-QPSR. ConstAssign creates a vector of boolean variables to set these constraints. VeriUnion returns these two symbolic tuples, COND, and ASSIGN as the QPSR of the given union queries.</p><p>Properties: VeriUnion is sound. Based on Lemma 6, if VeriUnion returns the QPSR, then the two union queries are cardinally equivalent. The symbolic tuples COLS1 and COLS2 define a bijective map between tuples returned by two union queries that preserves all of the bijective maps between tuples in their cardinally equivalent sub-queries. The formal proof is given in &#167;C.2.4.</p><p>VeriUnion is incomplete. The sources of incompleteness are threefold: (1) incompleteness of VeriCard, (2) limitations of the SMT solver, and (3) two union queries may be cardinally equivalent even if there is no bijective map between their sub-queries such that each pair of sub-queries is cardinally equivalent.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">SOUNDNESS AND COMPLETENESS</head><p>We now discuss the soundness and completeness of SPES for verifying the equivalence of two queries.</p><p>Soundness: SPES is sound. LEMMA 7. Given a pair of queries Q1 and Q2, if VeriCard returns a QPSR of Q1 and Q2, then Q1 and Q2 are cardinally equivalent and the pair of symbolic tuples ( COLS1, COLS2) symbolically represents a bijective map between tuples returned by Q1 and Q2.</p><p>PROOF. Proof is by induction on Q1 (the choice of Q1 versus Q2 is arbitrary). Each case on the form Q1 is proved by the soundness of four sub-procedure and the inductive hypothesis in cases when Q1 is composite of sub-queries. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Completeness:</head><p>In general, SPES is not complete. Sources of incompleteness are discussed in Sections 5.3 to 5.7, and discuss their practical impact in &#167;7.4. However, SPES is complete for a pair of SPJ queries Q1 and Q2 that do not have predicates or projection expressions whose satisfiability cannot be determined by the SMT solver and the input sub-queries are only table queries. PROOF. Based on Lemma 4, if Q1 and Q2 are fully equivalent, then VeriCard returns an QPSR. Based on Lemma 7, COLS1 and COLS2 symbolically represent the bijective map between tuples returned by Q1 and Q2 for all valid inputs. Since Q1 and Q2 do not have predicates and projection expressions whose satisfiability cannot be determined by the SMT solver, If the SMT solver determines COND^ASSIGN^&#172;( COLS1 = COLS2) is satisfiable, then it generates a model m. The model m defines input tables from which Q1 and Q2 each return non-identical tuples equal to the interpretation of their respective symbolic tuples under m. Thus, Q1 and Q2 are not fully equivalent. By contradiction, SPES is complete.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">EVALUATION</head><p>In this section, we describe our implementation and evaluation of SPES. We begin with a description of our implementation in &#167;7.1. We next report the results of a comparative analysis of SPES against EQUITAS <ref type="bibr">[50]</ref> and UDP <ref type="bibr">[23]</ref>, state-of-the-art automated QE verifiers based on SR and AR, respectively. We then quantify the efficacy of SPES in identifying overlapping queries across production SQL queries in &#167;7.3. We conclude with the limitations of the current implementation of SPES in &#167;7. <ref type="bibr">4</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.1">Implementation</head><p>The architecture of SPES is illustrated in Figure <ref type="figure">6</ref>. SPES takes a pair of SQL queries as inputs and returns a boolean decision that indicates whether they are fully equivalent. The QE verification pipeline consists of three components: &#8706; The compiler converts the given queries to logical query execution plans. We use the open-sourced CALCITE framework <ref type="bibr">[3]</ref>. &#8721; SPES operates on these logical plans in two stages. First, it converts them to the categories of queries that it supports and normalizes them. Next, it uses the third component (i.e., an SMT solver) to verify the cardinal equivalence of two queries and then constructs their QPSR. It also uses the solver for verifying the properties of QPSR to determine full equivalence. We implement this component in Java (2,065 lines of code). &#8719; The third component is an SMT solver Z3 that SPES leverages for determining the satisfiability of FOL formulae <ref type="bibr">[12]</ref>. We have published the source code of SPES on GitHub [?].</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.2">Comparative Analysis</head><p>Comparison Tools: We compare SPES against two QE verifiers: EQUITAS <ref type="bibr">[50]</ref> and UDP <ref type="bibr">[23]</ref>. EQUITAS takes a symbolic approach and only supports set semantics. UDP takes an algebraic approach and does support bag semantics.</p><p>Benchmark: We use queries in the test suite of Apache CAL-CITE <ref type="bibr">[3]</ref> as our benchmark. This test suite contains 232 semantically equivalent query pairs. The reasons for using this benchmark are twofold. First, the CALCITE optimizer is widely used in data processing engines <ref type="bibr">[4,</ref><ref type="bibr">5,</ref><ref type="bibr">6,</ref><ref type="bibr">7,</ref><ref type="bibr">8]</ref>. So, it covers a wide range of SQL features <ref type="foot">4</ref> . Second, since EQUITAS and UDP are both evaluated on this query pair benchmark <ref type="bibr">[23,</ref><ref type="bibr">50]</ref>, we can quantitatively and qualitatively compare the efficacy of these tools. We send every query pair with the schemata of their input tables to SPES and ask it to check their QE. We conduct this experiment on a commodity server (Intel Core i7-860 processor and 16 GB RAM).</p><p>Automated SQL QE Verifiers: The results of this experiment are shown in Table <ref type="table">1</ref>. We compare SPES against EQUITAS in the same environment. We present the results reported in the UDP paper <ref type="bibr">[23]</ref>  <ref type="foot">5</ref> .</p><p>We evaluate SPES under two configurations to delineate the impact of the normalization rules. &#8706; SPES (w/o normalization) naively converts the original queries to a tree representation without normalizing them, and then applies the verification algorithms. &#8721; SPES additionally leverages a minimal set of normalization rules.</p><p>SPES proves the equivalence of a larger set of query pairs (95/232) compared to UDP (34/232) and EQUITAS (67/232). SPES currently supports 120 out of 232 pairs. The un-supported queries either: (1) contain SQL features that are not yet supported (e.g., CAST), or (2) cannot be compiled by CALCITE due to syntax errors. Among the 120 pairs supported by SPES, it proves that 95 pairs (80%) are equivalent under bag semantics. In contrast, UDP proves the equivalence of 34 pairs under bag semantics. EQUITAS proves the equivalence of 67 pairs, but only under set semantics. We group the proved query pairs into three categories:</p><p>&#8226; USPJ: Queries that are union of SELECT-PROJECT-JOIN.</p><p>&#8226; Aggregate: Queries containing at least one aggregate.</p><p>&#8226; Outer-Join: Queries containing at least one outer JOIN.</p><p>Table <ref type="table">1</ref> reports the number of pairs proved by UDP and EQUI-TAS in each category. The number of proved pairs containing outer JOIN is not known in case of UDP. SPES outperforms the other tools on queries containing aggregate and outer JOIN operators. SPES proves QE of a larger set of query pairs (95/232) compared to SPES (w/o normalization) (56/232). This illustrates the importance of the normalization stage of SPES for real-world applications.</p><p>Efficiency: We next compare the average time taken by SPES and EQUITAS to prove the equivalence of a pair of queries in each category. This is an important metric for a cloud-scale tool that must be deployed in a DBaaS platform. We only compute this metric for the pairs that these tools can prove. SPES and EQUITAS take 0.05 s and 0.15 s on average to prove QE, respectively. So, SPES is 3&#8677; faster than EQUITAS on this benchmark. SPES consistently outperforms EQUITAS across all categories of queries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.3">Efficacy on Production Queries</head><p>In this experiment, we quantify the efficacy of SPES in detecting overlap in production SQL queries. We leverage three sets of real production queries from Ant Financial <ref type="bibr">[2]</ref>, a financial technology company. These queries are used to detect fraud in business transactions. In each set, we run SPES on each pair of queries that operate on the same set of input tables. If SPES decides that a given pair of queries are not equivalent, then we check any constituent sub-queries that operate on the same input tables. We skip checking queries containing only table scans and those that only differ in the parameters passed on to their predicates. This is because SPES trivially proves their equivalence and the computational resources needed for evaluating such queries are negligible.   Table <ref type="table">2</ref> presents the results of this experiment. SPES effectively identifies overlap between complex analytical queries. Among 9486 queries, SPES finds overlapping computation between 2591 (27%) queries, while EQUITAS only finds overlapping computation between 1126 (12%) queries. These tools report overlapping computation if the query or a constituent sub-query is equivalent to another query or sub-query within the set.</p><note type="other">Query Set Number of Queries Queries with</note><p>We also report the highest frequency of queries present in these pairs that are repeatedly executed in the workload. In practice, most of the computational resources are expended on executing queries containing aggregate functions or different types of join. Among 12090 equivalent pairs, 5831 (48%) contain join and aggregate operations. This illustrates that SPES works well on queries containing these operators.</p><p>Query Complexity: Figure <ref type="figure">7</ref> illustrates the complexity of queries in this workload. We compute the distribution of the number of sub-queries in a given query (complex queries will have a larger set of sub-queries). We found that the average number of algebraic expressions in the Ant Financial workload <ref type="bibr">(45.38)</ref> is 8&#8677; larger than that in the CALCITE benchmark (5.37).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.4">Limitations</head><p>In general, the problem of deciding QE is undecidable <ref type="bibr">[15]</ref>. Among the 120 query pairs supported by SPES, it cannot prove the QE of 25 pairs. We classify them into three categories: (1) lack of normalization rules <ref type="bibr">(21)</ref>, and (2) support for additional integrity constraints (4) Normalization Rules: SPES verifies the cardinal equivalence of two queries only if it is able to normalize them into the same type of queries using a set of pre-defined semantically-equivalent rewrite rules ( &#167;5.3). We will need to introduce additional normalization rules for queries with: (1) union and aggregate (15/21), and (2) join and aggregate (6/21), Adding these rewrite rules in the normalization stage will enable SPES to prove the QE of these 21 pairs. However, it will also increase the average QE verification time. Furthermore, these rules are not required for supporting production queries discussed in &#167;7.3.</p><p>Integrity Constraints: SPES currently only has partial support for integrity constraints in the form of normalization rules ( &#167;4.2): <ref type="bibr">(1)</ref> self join on primary key, and (2) grouping based on primary key. We plan to add additional rules to handle join on primary and foreign keys in the future. For example, we may normalize an OUTER JOIN operation based on a foreign key to an INNER JOIN operation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8.">RELATED WORK</head><p>Query Equivalence: The state-of-the-art QE verification tools are based on either symbolic reasoning <ref type="bibr">[50]</ref> or algebraic reasoning <ref type="bibr">[24,</ref><ref type="bibr">22,</ref><ref type="bibr">25]</ref> . Tools based on algebraic representation include COSETTE and UDP that rely on K-relation and U-semi-ring theories, respectively. Tools based on symbolic reasoning include EQUITAS. We highlighted the differences between SPES and these tools in &#167;2.</p><p>Prior efforts have examined the theoretical aspects of equivalence and containment relationships between queries. Since it is an undecidable problem <ref type="bibr">[14,</ref><ref type="bibr">18]</ref>, these efforts focused on determining categories of queries for which it is a decidable problem: (1) conjunctive queries <ref type="bibr">[21]</ref>, (2) conjunctive queries with additional constraints <ref type="bibr">[19,</ref><ref type="bibr">36,</ref><ref type="bibr">28]</ref>, and (3) conjunctive queries under bag semantics <ref type="bibr">[37]</ref>. The problem of deciding the containment relationship between conjunctive queries can be reduced to a constraint satisfiability problem <ref type="bibr">[40]</ref>. Other proposals include decision procedures for proving the equivalence of a subset of queries under set <ref type="bibr">[20,</ref><ref type="bibr">45,</ref><ref type="bibr">44]</ref> and bag semantics <ref type="bibr">[27,</ref><ref type="bibr">31,</ref><ref type="bibr">32,</ref><ref type="bibr">42]</ref>. There is a large body of work on optimizing queries using materialized views <ref type="bibr">[33, 43, 17, ?, ?]</ref>. These systems internally use a QE verification algorithm that relies on syntactical equivalence and that work on a subset of SQL queries. These systems may leverage SPES to better detect QE and deliver more optimized execution plans.</p><p>Symbolic Reasoning in DBMSs: Researchers have leveraged symbolic reasoning in DBMSs by reducing the given problem to an FOL satisfiability problem and then using an SMT solver to solve it. These efforts include: (1) automatically generating test cases for database applications <ref type="bibr">[46,</ref><ref type="bibr">47,</ref><ref type="bibr">13]</ref>, (2) verifying the correctness of database applications <ref type="bibr">[48,</ref><ref type="bibr">38,</ref><ref type="bibr">34]</ref>, (3) disproving the equivalence of SQL queries <ref type="bibr">[24]</ref>, and (4) finding the best application-aware memory layout <ref type="bibr">[49]</ref>. SPES differs from these efforts in that it uses symbolic reasoning to prove QE under bag semantics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="9.">CONCLUSION</head><p>In this paper, we presented the design and implementation of SPES, an automated tool for proving query equivalence under bag semantics using symbolic reasoning. We reduced the problem of proving equivalence under bag semantics to that of proving the existence of a bijective, identity map between the output tables of the queries for all valid input tables. SPES classifies queries into four categories, and leverages a set of category-specific verification algorithms to effectively determine their equivalence. Our evaluation shows that SPES proves the equivalence of a larger set of query pairs under bag semantics compared to state-of-the-art tools based on symbolic and algebraic approaches. Furthermore, it is 3&#8677; faster than EQUITAS that only proves equivalence under set semantics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>APPENDIX A. SEMANTICS OF AR</head><p>We now formally define the semantics of queries, using the following formal notation. + is the evaluation symbol. The left side of this symbol is an algebraic expression that is evaluated on valid input tables Ts. The right side of this symbol is the evaluation result, which is the output table. All output tables are bags (i.e., can contain duplicate tuples). A horizontal line separates the pre-and the postconditions. The pre-conditions on the top of the line include a set of evaluation relations. The post-condition on the bottom side of the line is an evaluation relation. If all the relations in the pre-conditions hold, then the relation in the post-condition holds.  &#8226; Given a set of valid input tables Ts, the table query returns all the tuples in table n. &#8226; Given a set of valid input tables Ts, the SPJ query first evaluates the vector of input sub-queries on Ts to obtain a vector of input tables. For each tuple t in the cartesian product of the vector of input tables, if t satisfies the given predicate p, it then applies the vector of expressions ! o on the selected tuple t and emits the transformed tuple. &#8226; Given a set of valid input tables Ts, this aggregate query first evaluates the input sub-query on Ts to get an input table <ref type="table">T0</ref>.</p><p>Then, it uses part to partition the input table T0 into a set of bags of tuples as defined by a set of group set g (tuples in each bag take the same values for the grouping attributes). Lastly, for each bag of tuples, it applies the vector of aggregate functions and returns one tuple. &#8226; Given a set of valid input tables Ts, this union query first evaluates the vector of input sub-queries on Ts to get a vector of input tables. It then returns all the tuples present in the input tables, which does not eliminate duplicate tuples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. PREDICATE &amp; PROJ. EXPRESSION</head><p>SPES supports the predicate and project expressions shown in Figure <ref type="figure">B</ref>. It uses the same encoding scheme as the one employed in EQUITAS (described in Section 3.4 of <ref type="bibr">[50]</ref>).</p><p>A projection expression E can either be a column that refer to a specific column, a constant value, NULL, a binary expression, an uninterpreted function, or an CASE expression (Eqn. 1). A predicate P can either be a binary comparison between two projection expression, a binary predicate that is composed by two predicates,  a not predicate and a predicate decide if a projection expression is NULL (Eqn. 5).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. SOUNDNESS OF VERIFICATION</head><p>In this section, we give the formal proof of the soundness of SPES. The overall proof is structured as follows: we introduce symbolic representations of bijections over tuples in ( &#167;C.1), prove correctness of the procedure for generating symbolic representations ( &#167;C.2), and then prove correctness of the procedure for determining equivalence ( &#167;C.3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C.1 Symbolic bijections between queries</head><p>All definitions in this section are given with respect to arbitrary queries Q1 and Q2, whose columns are denoted COLS1 and COLS2, respectively.</p><p>A cardinality-preserving binary relation between Q1 and Q2 is a relation</p><p>such that for each input table set I and all tuples (t, u) 2 R, it holds that |t| Q1(I) = |u| Q2(I)</p><p>where |u|T denotes the number of occurrences of tuple u in table T .</p><p>Cardinality-preserving binary relations can act as witnesses of full equivalence (see Def 2) between queries. LEMMA 9. If the identity function is a cardinality-preserving binary relation between Q1 and Q2, then Q1 and Q2 are fully equivalent (denoted Q1 &#8984; Q2).</p><p>PROOF. Let I be an arbitrary table set and let t be an arbitrary tuple over columns COLS. Then where m( COLS1) is the tuple of interpretations of each column name in COLS1 (and similarly for m( COLS2)). Symbolic representations of cardinality-preserving bijections can be viewed as QPSRs (defined in Section 5.2), collapsed into single</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>If every tuple returned by a query Y is returned by query X on all possible input databases, then X contains Y . If X contains Y and vice versa, then X and Y are equivalent.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>Two queries are cardinally equivalent if and only if they return the same number of tuples for all valid inputs.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>The test cases were obtained from the open-sourced COSETTE repository<ref type="bibr">[10]</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3"><p>We were unable to conduct a comparative performance analysis under the same environment since UDP is currently not open-sourced.</p></note>
		</body>
		</text>
</TEI>
