Complex video queries can be answered by decomposing them into modular subtasks. However, existing video data management systems assume the existence of predefined modules for each subtask. We introduce VOCAL-UDF, a novel self-enhancing system that supports compositional queries over videos without the need for predefined modules. VOCAL-UDF automatically identifies and constructs missing modules and encapsulates them as user-defined functions (UDFs), thus expanding its querying capabilities. To achieve this, we formulate a unified UDF model that leverages large language models (LLMs) to aid in new UDF generation. VOCAL UDF handles a wide range of concepts by supporting both program-based UDFs (i.e., Python functions generated by LLMs) and distilled-model UDFs (lightweight vision models distilled from strong pretrained models). To resolve the inherent ambiguity in user intent, VOCAL-UDF generates multiple candidate UDFs and uses active learning to efficiently select the best one. With the self-enhancing capability, VOCAL-UDF significantly improves query performance across three video datasets. 
                        more » 
                        « less   
                    
                            
                            A Framework For Inferring Properties of User-Defined Functions
                        
                    
    
            User-defined functions (UDFs) are widely used to enhance the ca- pabilities of DBMSs. However, using UDFs comes with a significant performance penalty because DBMSs treat UDFs as black boxes, which hinders their ability to optimize queries that invoke such UDFs. To mitigate this problem, in this paper we present LAMBDA, a technique and framework for improving DBMSs’ performance in the presence of UDFs. The core idea of LAMBDA is to statically infer properties of UDFs that facilitate UDF processing. Taking one such property as an example, if DBMSs know that a UDF is pure, that is it returns the same result given the same arguments, they can leverage a cache to avoid repetitive UDF invocations that have the same call arguments. We reframe the problem of analyzing UDF properties as a data flow problem. We tackle the data flow problem by building LAMBDA on top of an extensible abstract interpretation framework and de- veloping an analysis model that is tailored for UDFs. Currently, LAMBDA supports inferring four properties from UDFs that are widely used across DBMSs. We evaluate LAMBDA on a benchmark that is derived from production query workloads and UDFs. Our evaluation results show that (1) LAMBDA conservatively and ef- ficiently infers the considered UDF properties, and (2) inferring such properties improves UDF performance, with a time reduction ranging from 10% to 99%. In addition, when applied to 20 produc- tion UDFs, LAMBDA caught five instances in which developers provided incorrect UDF property annotations. We qualitatively compare LAMBDA against Froid, a state-of-the-art framework for improving UDF performance, and explain how LAMBDA can opti- mize UDFs that are not supported by Froid. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10538583
- Publisher / Repository:
- ACM
- Date Published:
- ISBN:
- 9798400702174
- Page Range / eLocation ID:
- 1 to 11
- Subject(s) / Keyword(s):
- UDF properties, DBMSs, static analysis
- Format(s):
- Medium: X
- Location:
- Lisbon Portugal
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Many big data systems are written in languages such as C, C++, Java, and Scala to process large amounts of data efficiently, while data analysts often use Python to conduct data wrangling, statistical analysis, and machine learning. User-defined functions (UDFs) are commonly used in these systems to bridge the gap between the two ecosystems. In this paper, we propose Udon, a novel debugger to support fine-grained debugging of UDFs. Udon encapsulates the modern line-by-line debugging primitives, such as the ability to set breakpoints, perform code inspections, and make code modifications while executing a UDF on a single tuple. It includes a novel debug-aware UDF execution model to ensure the responsiveness of the operator during debugging. It utilizes advanced state-transfer techniques to satisfy breakpoint conditions that span across multiple UDFs. It incorporates various optimization techniques to reduce the runtime overhead. We conduct experiments with multiple UDF workloads on various datasets and show its high efficiency and scalability.more » « less
- 
            Many data processing systems allow SQL queries that calluser-defined functions (UDFs)written in conventional programming languages. While such SQL extensions provide convenience and flexibility to users, queries involving UDFs are not as efficient as their pure SQL counterparts that invoke SQL’s highly-optimized built-in functions. Motivated by this problem, we propose a new technique for translating SQL queries with UDFs to pure SQL expressions. Unlike prior work in this space, our method is not based on syntactic rewrite rules and can handle a much more general class of UDFs. At a high-level, our method is based on counterexample-guided inductive synthesis (CEGIS) but employs a novel compositional strategy that decomposes the synthesis task into simpler sub-problems. However, because there is no universal decomposition strategy that works for all UDFs, we propose a novellazy inductive synthesisapproach that generates a sequence of decompositions that correspond to increasingly harder inductive synthesis problems. Because most realistic UDF-to-SQL translation tasks are amenable to a fine-grained decomposition strategy, our lazy inductive synthesis method scales significantly better than traditional CEGIS. We have implemented our proposed technique in a tool called CLIS for optimizing Spark SQL programs containing Scala UDFs. To evaluate CLIS, we manually study 100 randomly selected UDFs and find that 63 of them can be expressed in pure SQL. Our evaluation on these 63 UDFs shows that CLIS can automatically synthesize equivalent SQL expressions in 92% of the cases and that it can solve 2.4× more benchmarks compared to a baseline that does not use our compositional approach. We also show that CLIS yields an average speed-up of 3.5× for individual UDFs and 1.3× to 3.1× in terms of end-to-end application performance.more » « less
- 
            Data-intensive scalable computing (DISC) systems such as Google’s MapReduce, Apache Hadoop, and Apache Spark are being leveraged to process massive quantities of data in the cloud. Modern DISC applications pose new challenges in exhaustive, automatic testing because they consist of dataflow operators, and complex user-defined functions (UDF) are prevalent unlike SQL queries. We design a new white-box testing approach, called BigTest to reason about the internal semantics of UDFs in tandem with the equivalence classes created by each dataflow and relational operator. Our evaluation shows that, despite ultra-large scale input data size, real world DISC applications are often significantly skewed and inadequate in terms of test coverage, leaving 34% of Joint Dataflow and UDF (JDU) paths untested. BigTest shows the potential to minimize data size for local testing by 10^5 to 10^8 orders of magnitude while revealing 2X more manually-injected faults than the previous approach. Our experiment shows that only few of the data records (order of tens) are actually required to achieve the same JDU coverage as the entire production data. The reduction in test data also provides CPU time saving of 194X on average, demonstrating that interactive and fast local testing is feasible for big data analytics, obviating the need to test applications on huge production data.more » « less
- 
            In the recent literature on machine learning and decision making, calibration has emerged as a desirable and widely-studied statistical property of the outputs of binary prediction models. However, the algorithmic aspects of measuring model calibration have remained relatively less well-explored. Motivated by [BGHN23], which proposed a rigorous framework for measuring distances to calibration, we initiate the algorithmic study of calibration through the lens of property testing. We define the problem of calibration testing from samples where given n draws from a distribution  on (predictions,binaryoutcomes), our goal is to distinguish between the case where  is perfectly calibrated, and the case where  is ε-far from calibration. We make the simple observation that the empirical smooth calibration linear program can be reformulated as an instance of minimum-cost flow on a highly-structured graph, and design an exact dynamic programming-based solver for it which runs in time O(nlog2(n)), and solves the calibration testing problem information-theoretically optimally in the same time. This improves upon state-of-the-art black-box linear program solvers requiring Ω(nω) time, where ω>2 is the exponent of matrix multiplication. We also develop algorithms for tolerant variants of our testing problem improving upon black-box linear program solvers, and give sample complexity lower bounds for alternative calibration measures to the one considered in this work. Finally, we present experiments showing the testing problem we define faithfully captures standard notions of calibration, and that our algorithms scale efficiently to accommodate large sample sizes.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    