skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: UDF to SQL translation through compositional lazy inductive synthesis
Many data processing systems allow SQL queries that calluser-defined functions (UDFs)written in conventional programming languages. While such SQL extensions provide convenience and flexibility to users, queries involving UDFs are not as efficient as their pure SQL counterparts that invoke SQL’s highly-optimized built-in functions. Motivated by this problem, we propose a new technique for translating SQL queries with UDFs to pure SQL expressions. Unlike prior work in this space, our method is not based on syntactic rewrite rules and can handle a much more general class of UDFs. At a high-level, our method is based on counterexample-guided inductive synthesis (CEGIS) but employs a novel compositional strategy that decomposes the synthesis task into simpler sub-problems. However, because there is no universal decomposition strategy that works for all UDFs, we propose a novellazy inductive synthesisapproach that generates a sequence of decompositions that correspond to increasingly harder inductive synthesis problems. Because most realistic UDF-to-SQL translation tasks are amenable to a fine-grained decomposition strategy, our lazy inductive synthesis method scales significantly better than traditional CEGIS. We have implemented our proposed technique in a tool called CLIS for optimizing Spark SQL programs containing Scala UDFs. To evaluate CLIS, we manually study 100 randomly selected UDFs and find that 63 of them can be expressed in pure SQL. Our evaluation on these 63 UDFs shows that CLIS can automatically synthesize equivalent SQL expressions in 92% of the cases and that it can solve 2.4× more benchmarks compared to a baseline that does not use our compositional approach. We also show that CLIS yields an average speed-up of 3.5× for individual UDFs and 1.3× to 3.1× in terms of end-to-end application performance.  more » « less
Award ID(s):
1918889
PAR ID:
10603577
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Association for Computing Machinery (ACM)
Date Published:
Journal Name:
Proceedings of the ACM on Programming Languages
Volume:
5
Issue:
OOPSLA
ISSN:
2475-1421
Format(s):
Medium: X Size: p. 1-26
Size(s):
p. 1-26
Sponsoring Org:
National Science Foundation
More Like this
  1. User-defined functions (UDFs) are widely used to enhance the ca- pabilities of DBMSs. However, using UDFs comes with a significant performance penalty because DBMSs treat UDFs as black boxes, which hinders their ability to optimize queries that invoke such UDFs. To mitigate this problem, in this paper we present LAMBDA, a technique and framework for improving DBMSs’ performance in the presence of UDFs. The core idea of LAMBDA is to statically infer properties of UDFs that facilitate UDF processing. Taking one such property as an example, if DBMSs know that a UDF is pure, that is it returns the same result given the same arguments, they can leverage a cache to avoid repetitive UDF invocations that have the same call arguments. We reframe the problem of analyzing UDF properties as a data flow problem. We tackle the data flow problem by building LAMBDA on top of an extensible abstract interpretation framework and de- veloping an analysis model that is tailored for UDFs. Currently, LAMBDA supports inferring four properties from UDFs that are widely used across DBMSs. We evaluate LAMBDA on a benchmark that is derived from production query workloads and UDFs. Our evaluation results show that (1) LAMBDA conservatively and ef- ficiently infers the considered UDF properties, and (2) inferring such properties improves UDF performance, with a time reduction ranging from 10% to 99%. In addition, when applied to 20 produc- tion UDFs, LAMBDA caught five instances in which developers provided incorrect UDF property annotations. We qualitatively compare LAMBDA against Froid, a state-of-the-art framework for improving UDF performance, and explain how LAMBDA can opti- mize UDFs that are not supported by Froid. 
    more » « less
  2. Complex video queries can be answered by decomposing them into modular subtasks. However, existing video data management systems assume the existence of predefined modules for each subtask. We introduce VOCAL-UDF, a novel self-enhancing system that supports compositional queries over videos without the need for predefined modules. VOCAL-UDF automatically identifies and constructs missing modules and encapsulates them as user-defined functions (UDFs), thus expanding its querying capabilities. To achieve this, we formulate a unified UDF model that leverages large language models (LLMs) to aid in new UDF generation. VOCAL UDF handles a wide range of concepts by supporting both program-based UDFs (i.e., Python functions generated by LLMs) and distilled-model UDFs (lightweight vision models distilled from strong pretrained models). To resolve the inherent ambiguity in user intent, VOCAL-UDF generates multiple candidate UDFs and uses active learning to efficiently select the best one. With the self-enhancing capability, VOCAL-UDF significantly improves query performance across three video datasets. 
    more » « less
  3. We consider accelerating machine learning (ML) inference queries on unstructured datasets. Expensive operators such as feature extractors and classifiers are deployed as user-defined functions (UDFs), which are not penetrable with classic query optimization techniques such as predicate push-down. Recent optimization schemes (e.g., Probabilistic Predicates or PP) assume independence among the query predicates, build a proxy model for each predicate offline, and rewrite a new query by injecting these cheap proxy models in the front of the expensive ML UDFs. In such a manner, unlikely inputs that do not satisfy query predicates are filtered early to bypass the ML UDFs. We show that enforcing the independence assumption in this context may result in sub-optimal plans. In this paper, we propose CORE, a query optimizer that better exploits the predicate correlations and accelerates ML inference queries. Our solution builds the proxy models online for a new query and leverages a branch-and-bound search process to reduce the building costs. Results on three real-world text, image and video datasets show that CORE improves the query throughput by up to 63% compared to PP and up to 80% compared to running the queries as it is. 
    more » « less
  4. We will demonstrate a prototype query-processing engine, which utilizes correlations among predicates to accelerate machine learning (ML) inference queries on unstructured data. Expensive operators such as feature extractors and classifiers are deployed as user-defined functions (UDFs), which are not penetrable by classic query optimization techniques such as predicate push-down. Recent optimization schemes (e.g., Probabilistic Predicates or PP) build a cheap proxy model for each predicate offline, and inject proxy models in the front of expensive ML UDFs under the independence assumption in queries. Input records that do not satisfy query predicates are filtered early by proxy models to bypass ML UDFs. But enforcing the independence assumption may result in sub-optimal plans. We use correlative proxy models to better exploit predicate correlations and accelerate ML queries. We will demonstrate our query optimizer called CORE, which builds proxy models online, allocates parameters to each model, and reorders them. We will also show end-to-end query processing with or without proxy models. 
    more » « less
  5. null (Ed.)
    Relational database management systems (RDBMS) have limited iterative processing support. Recursive queries were added to ANSI SQL, however, their semantics do not allow aggregation functions, which disqualifies their use for several applications, such as PageRank and shortest path computations. Recently, another SQL extension, iterative Common Table Expressions (CTEs), is proposed to enable users to perform general iterative computations on RDBMSs.In this work 1 , we demonstrate how iterative CTEs can be efficiently incorporated into a production RDBMS without major intrusion to the system. We have prototyped our approach on Futurewei's MPPDB, a shared nothing relational parallel database engine. The implementation is based on a functional rewrite that translates iterative CTEs to other existing SQL operators. Thus, query plans of iterative CTEs can be optimized and executed by the engine with minimal modification to the code base. We have also applied several optimizations specifically for iterative CTEs to i) minimize data movement, ii) reuse results that remain constant and iii) push down predicates to avoid unnecessary data processing. We verified our implementation through extensive experimental evaluation using real world datasets and queries. The results show the feasibility of the rewrite approach and the effectiveness of the optimizations, which improve performance by an order of magnitude in some cases. 
    more » « less