skip to main content


Title: UDF to SQL translation through compositional lazy inductive synthesis
Many data processing systems allow SQL queries that call user-defined functions (UDFs) written in conventional programming languages. While such SQL extensions provide convenience and flexibility to users, queries involving UDFs are not as efficient as their pure SQL counterparts that invoke SQL’s highly-optimized built-in functions. Motivated by this problem, we propose a new technique for translating SQL queries with UDFs to pure SQL expressions. Unlike prior work in this space, our method is not based on syntactic rewrite rules and can handle a much more general class of UDFs. At a high-level, our method is based on counterexample-guided inductive synthesis (CEGIS) but employs a novel compositional strategy that decomposes the synthesis task into simpler sub-problems. However, because there is no universal decomposition strategy that works for all UDFs, we propose a novel lazy inductive synthesis approach that generates a sequence of decompositions that correspond to increasingly harder inductive synthesis problems. Because most realistic UDF-to-SQL translation tasks are amenable to a fine-grained decomposition strategy, our lazy inductive synthesis method scales significantly better than traditional CEGIS. We have implemented our proposed technique in a tool called CLIS for optimizing Spark SQL programs containing Scala UDFs. To evaluate CLIS, we manually study 100 randomly selected UDFs and find that 63 of them can be expressed in pure SQL. Our evaluation on these 63 UDFs shows that CLIS can automatically synthesize equivalent SQL expressions in 92% of the cases and that it can solve 2.4× more benchmarks compared to a baseline that does not use our compositional approach. We also show that CLIS yields an average speed-up of 3.5× for individual UDFs and 1.3× to 3.1× in terms of end-to-end application performance.  more » « less
Award ID(s):
1918889
NSF-PAR ID:
10320423
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the ACM on Programming Languages
Volume:
5
Issue:
OOPSLA
ISSN:
2475-1421
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Data-intensive scalable computing (DISC) systems such as Google’s MapReduce, Apache Hadoop, and Apache Spark are being leveraged to process massive quantities of data in the cloud. Modern DISC applications pose new challenges in exhaustive, automatic testing because they consist of dataflow operators, and complex user-defined functions (UDF) are prevalent unlike SQL queries. We design a new white-box testing approach, called BigTest to reason about the internal semantics of UDFs in tandem with the equivalence classes created by each dataflow and relational operator. Our evaluation shows that, despite ultra-large scale input data size, real world DISC applications are often significantly skewed and inadequate in terms of test coverage, leaving 34% of Joint Dataflow and UDF (JDU) paths untested. BigTest shows the potential to minimize data size for local testing by 10^5 to 10^8 orders of magnitude while revealing 2X more manually-injected faults than the previous approach. Our experiment shows that only few of the data records (order of tens) are actually required to achieve the same JDU coverage as the entire production data. The reduction in test data also provides CPU time saving of 194X on average, demonstrating that interactive and fast local testing is feasible for big data analytics, obviating the need to test applications on huge production data. 
    more » « less
  2. In database-as-a-service platforms, automated ver-ification of query equivalence helps eliminate redundant computation in the form of overlapping sub-queries. Researchers have proposed two pragmatic techniques to tackle this problem. The first approach consists of reducing the queries to algebraic expressions and proving their equivalence using an algebraic theory. The limitations of this technique are threefold. It cannot prove the equivalence of queries with significant differences in the attributes of their relational operators (e.g., predicates in the filter operator). It does not support certain widely-used SQL features (e.g., NULL values). Its verification procedure is computationally intensive. The second approach transforms this problem to a constraint satisfaction problem and leverages a general-purpose solver to determine query equivalence. This technique consists of deriving the symbolic representation of the queries and proving their equivalence by determining the query containment relationship between the symbolic expressions. While the latter approach addresses all the limitations of the former technique, it only proves the equivalence of queries under set semantics (i.e., output tables must not contain duplicate tuples). However, in practice, database applications use bag semantics (i.e., output tables may contain duplicate tuples) In this paper, we introduce a novel symbolic approach for proving query equivalence under bag semantics. We transform the problem of proving query equivalence under bag semantics to that of proving the existence of a bijective, identity map between tuples returned by the queries on all valid inputs. We classify SQL queries into four categories, and propose a set of novel category-specific verification algorithms. We implement this symbolic approach in SPES and demonstrate that it proves the equivalence of a larger set of query pairs (95/232) under bag semantics compared to the SOTA tools based on algebraic (30/232) and symbolic approaches (67/232) under set and bag semantics, respectively. Furthermore, SPES is 3X faster than the symbolic tool that proves equivalence under set semantics. 
    more » « less
  3. We consider accelerating machine learning (ML) inference queries on unstructured datasets. Expensive operators such as feature extractors and classifiers are deployed as user-defined functions (UDFs), which are not penetrable with classic query optimization techniques such as predicate push-down. Recent optimization schemes (e.g., Probabilistic Predicates or PP) assume independence among the query predicates, build a proxy model for each predicate offline, and rewrite a new query by injecting these cheap proxy models in the front of the expensive ML UDFs. In such a manner, unlikely inputs that do not satisfy query predicates are filtered early to bypass the ML UDFs. We show that enforcing the independence assumption in this context may result in sub-optimal plans. In this paper, we propose CORE, a query optimizer that better exploits the predicate correlations and accelerates ML inference queries. Our solution builds the proxy models online for a new query and leverages a branch-and-bound search process to reduce the building costs. Results on three real-world text, image and video datasets show that CORE improves the query throughput by up to 63% compared to PP and up to 80% compared to running the queries as it is. 
    more » « less
  4. null (Ed.)
    Syntax-guided synthesis (SyGuS) aims to find a program satisfying semantic specification as well as user-provided structural hypotheses. There are two main synthesis approaches: enumerative synthesis, which repeatedly enumerates possible candidate programs and checks their correctness, and deductive synthesis, which leverages a symbolic procedure to construct implementations from specifications. Neither approach is strictly better than the other: automated deductive synthesis is usually very efficient but only works for special grammars or applications; enumerative synthesis is very generally applicable but limited in scalability. In this paper, we propose a cooperative synthesis technique for SyGuS problems with the conditional linear integer arithmetic (CLIA) background theory, as a novel integration of the two approaches, combining the best of the two worlds. The technique exploits several novel divide-and-conquer strategies to split a large synthesis problem to smaller subproblems. The subproblems are solved separately and their solutions are combined to form a final solution. The technique integrates two synthesis engines: a pure deductive component that can efficiently solve some problems, and a height-based enumeration algorithm that can handle arbitrary grammar. We implemented the cooperative synthesis technique, and evaluated it on a wide range of benchmarks. Experiments showed that our technique can solve many challenging synthesis problems not possible before, and tends to be more scalable than state-of-the-art synthesis algorithms. 
    more » « less
  5. Enhancing battery energy storage capability and reducing the cost per average energy capacity is urgent to satisfy the increasing energy demand in modern society. The lithium-sulfur (Li-S) battery is especially attractive because of its high theoretical specific energy (around 2600 W h kg-1), low cost, and low toxicity.1 Despite these advantages, the practical utilization of lithium-sulfur (Li-S) batteries to date has been hindered by a series of obstacles, including low active material loading, shuttle effects, and sluggish sulfur conversion kinetics.2 The traditional 2D planer thick electrode is considered as a general approach to enhance the mass loading of the Li-S battery.3 However, the longer diffusion length of lithium ions, which resulted in high tortuosity in the compact stacking thick electrode, decreases the penetration ability of the electrolyte into the entire cathode.4 Although an effort to induce catalysts in the cathode was made to promote sulfur conversion kinetic conditions, catalysts based on transition metals suffered from the low electronic conductivity, and some elements (i.e.: Co, Mn) may even absorb and restrict polysulfides for further reaction. 5 To mitigate the issues listed above, herein we propose a novel sulfur cathode design strategy enabled by additive manufacturing and oxidative chemical vapor deposition (oCVD). 6,7 Specifically, the cathode is designed to have a hierarchal hollow structure via a stereolithography technique to increase sulfur usage. Microchannels are constructed on the tailored sulfur cathode to further fortify the wettability of the electrolyte. The as-printed cathode is then sintered at 700 °C in an N2 atmosphere in order to generate a carbon skeleton (i.e.: carbonization of resin) with intrinsic carbon defects. The intrinsic carbon defects are expected to create favorable sulfur conversion conditions with sufficient electronic conductivity. In this study, the oCVD technique is leveraged to produce a conformal coating layer to eliminate shuttle effects. Identified by scanning electron microscopy and energy-dispersive X-ray spectroscopy mapping characterizations, the oCVD PEDOT is not only covered on the surface of the cathode but also on the inner surface of the microchannels. High-resolution x-ray photoelectron spectroscopy analyses (C 1s and S 2p orbitals) between pristine and modified samples demonstrate that a high concentration of the defects has been produced on the sulfur matrix after sintering and posttreatment. In-operando XRD diffractograms show that the Li2S is generated in the oCVD PEDOT-coated sample during the charge and discharge process even with a high current density, confirming an eminent sulfur conversion kinetic condition. In addition, ICP-OES results of lithium metal anode at different states of charge (SoC) verify that the shuttle effects are excellently restricted by oCVD PEDOT. Overall, the high mass loading (> 5 mg cm-2) with an elevated sulfur utilization ratio, accelerated reaction kinetics and stabilized electrochemical process have been achieved on the sulfur cathode by implementing this innovative cathode design strategy. The results of this study demonstrate significant promises of employing pure sulfur powder with high electrochemical performance and suggest a pathway to the higher energy and power density battery. References: 1 Chen, Y. Adv Mater 33, e2003666. 2 Bhargav, A. Joule 4, 285-291. 3 Liu, S. Nano Energy 63, 103894. 4 Chu, T. Carbon Energy 3. 5 Li, Y. Matter 4, 1142-1188. 6 John P. Lock. Macromolecules 39, 4 (2006). 7 Zekoll, S. Energy & Environmental Science 11, 185-201. 
    more » « less