Query optimization is a key component in database management systems (DBMS) and distributed data processing platforms. Recent research in the database community incorporated techniques from artificial intelligence to enhance query optimization. Various learning models have been extended and applied to the query optimization tasks, including query execution plan, query rewriting, and cost estimation. The tasks involved in query optimization differ based on the type of data being processed, such as relational data or spatial geometries. This tutorial reviews recent learning-based approaches for spatial query optimization tasks. We go over methods designed specifically for spatial data, as well as solutions proposed for high-dimensional data. Additionally, we present learning-based spatial indexing and spatial partitioning methods, which are also vital components in spatial data processing. We also identify several open research problems in these fields.
more »
« less
This content will become publicly available on September 19, 2025
Spatial Query Optimization With Learning
Query optimization is a key component in database management systems (DBMS) and distributed data processing platforms. Re- cent research in the database community incorporated techniques from artificial intelligence to enhance query optimization. Various learning models have been extended and applied to the query optimization tasks, including query execution plan, query rewriting, and cost estimation. The tasks involved in query optimization differ based on the type of data being processed, such as relational data or spatial geometries. This tutorial reviews recent learning-based approaches for spatial query optimization tasks. We go over methods designed specifically for spatial data, as well as solutions proposed for high-dimensional data. Additionally, we present learning-based spatial indexing and spatial partitioning methods, which are also vital components in spatial data processing. We also identify several open research problems in these fields.
more »
« less
- Award ID(s):
- 1924694
- PAR ID:
- 10550228
- Publisher / Repository:
- VLDB Endowment
- Date Published:
- Journal Name:
- Proceedings of the VLDB Endowment
- ISSN:
- 2150-8097
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
There have been many decades of work on optimizing query processing in database management systems. Recently, modern machine learning (ML), and specifically reinforcement learning (RL), has gained increased attention as a means to develop a query optimizer (QO). In this work, we take a closer look at two recent state-of-the-art (SOTA) RL-based QO methods to better understand their behavior. We find that these RL-based methods do not generalize as well as it seems at first glance. Thus, we ask a simple question:How do SOTA RL-based QOs compare to a simple, modern, adaptive query processing approach?To answer this question, we choose two simple adaptive query processing techniques and implemented them in PostgreSQL. The first adapts an individual join operation on-the-fly and switches between a Nested Loop Join algorithm and a Hash Join algorithm to avoid sub-optimal join algorithm decisions. The second is a technique calledLookahead Information Passing(LIP), in which adaptive semijoin techniques are used to make a pipeline of join operations execute efficiently. To our surprise, we find that this simple adaptive query processing approach is not only competitive to the SOTA RL-based approaches but, in some cases, outperforms the RL-based approaches. The adaptive approach is also appealing because it does not require an expensive training step, and it is fully interpretable compared to the RL-based QO approaches. Further, the adaptive method works across complex query constructs that RL-based QO methods currently cannot optimize.more » « less
-
null (Ed.)Increasingly, individuals and companies adopt a cloud service provider as a primary data and IT infrastructure platform. The remote access of the data inevitably brings the issue of trust. Data encryption is necessary to keep sensitive information secure and private on the cloud. Yet adversaries can still learn valuable information regarding encrypted data by observing data access patterns. To solve such problem, Oblivious RAMs (ORAMs) are proposed to completely hide access patterns. However, most ORAM constructions are expensive and not suitable to deploy in a database for supporting query processing over large data. Furthermore, an ORAM processes queries synchronously, hence, does not provide high throughput for concurrent query processing. In this work, we design a practical oblivious query processing framework to enable efficient query processing over a cloud database. In particular, we focus on processing multiple range and kNN queries asynchronously and concurrently with high throughput. The key idea is to integrate indices into ORAM which leverages a suite of optimization techniques (e.g., oblivious batch processing and caching). The effectiveness and efficiency of our oblivious query processing framework is demonstrated through extensive evaluations over large datasets. Our construction shows an order of magnitude speedup in comparison with other baselines.more » « less
-
null (Ed.)Modern database management systems employ sophisticated query optimization techniques that enable the generation of efficient plans for queries over very large data sets. A variety of other applications also process large data sets, but cannot leverage database-style query optimization for their code. We therefore identify an opportunity to enhance an open-source programming language compiler with database-style query optimization. Our system dynamically generates execution plans at query time, and runs those plans on chunks of data at a time. Based on feedback from earlier chunks, alternative plans might be used for later chunks. The compiler extension could be used for a variety of data-intensive applications, allowing all of them to benefit from this class of performance optimizations.more » « less
-
Join operations are crucial in data analysis, but can suffer inefficiency with large datasets and complex non-equality-based conditions. Optimized join algorithms have gained traction in database research to address these challenges. One popular choice for implementing join algorithms is distributed data processing frameworks, e.g., Hadoop and Spark, but each implementation is highly tailored for specific query types. As a result, they do not address join queries that involve diverse and complex conditions since they are not integrated into a holistic query optimization engine like in DBMSs. On the other hand, implementing new join algorithms on a DBMS from scratch requires substantial effort and expertise. This paper introduces FUDJ, Flexible User-defined Distributed Joins, a framework for complex distributed join algorithms. The key idea of FUDJ is to allow developers to realize new distributed join algorithms into the database without delving into the database internals. As shown, an algorithm implemented in FUDJ is up to an order of magnitude faster than existing user-defined implementations with an order of magnitude fewer lines of code.more » « less