<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Spatial Query Optimization With Learning</title></titleStmt>
			<publicationStmt>
				<publisher>VLDB Endowment</publisher>
				<date>09/19/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10550228</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the VLDB Endowment</title>
<idno>2150-8097</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Xin Zhang</author><author>Ahmed Eldawy</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Query optimization is a key component in database management systems (DBMS) and distributed data processing platforms. Re- cent research in the database community incorporated techniques from artificial intelligence to enhance query optimization. Various learning models have been extended and applied to the query optimization tasks, including query execution plan, query rewriting, and cost estimation. The tasks involved in query optimization differ based on the type of data being processed, such as relational data or spatial geometries. This tutorial reviews recent learning-based approaches for spatial query optimization tasks. We go over methods designed specifically for spatial data, as well as solutions proposed for high-dimensional data. Additionally, we present learning-based spatial indexing and spatial partitioning methods, which are also vital components in spatial data processing. We also identify several open research problems in these fields.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>In database management systems (DBMSs) and data processing platforms, query optimization involves two main steps: logical transformation and cost estimation. Modern DBMSs and data processing platforms can process heterogeneous data. Each type of data requires specific considerations in its query optimization process. However, the main steps of query optimizations are always the same, regardless of the type of data being processed. The query optimizer returns a query plan with the lowest cost. A query plan is represented by a set of operators that form a query tree. The query transformer maps a query tree into equivalent query trees by following a set of rules inherited from relational algebra. The query optimizer evaluates the overall cost of each candidate plan to determine the optimal plan. The cost function considers access cost, storage cost, computation cost, and communication cost <ref type="bibr">[31]</ref>. To estimate the computation cost, the query estimators compute the selectivity and cardinality through data statistics. The data statistics are extracted and compacted into data synopses, such as histograms, samples, sketches, and wavelets <ref type="bibr">[9]</ref>. The I/O costs of This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit <ref type="url">https://creativecommons.org/licenses/by-nc-nd/4.0/</ref> to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing info@vldb.org. Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. Proceedings of the VLDB Endowment, Vol. 17, No. 12 ISSN 2150-8097. doi:10.14778/3685800.3685846 the query execution are affected by data storage, such as partitioning and indexing. Over the past decades, extensive efforts have been made to incorporate different query optimization tasks and various query operation requirements.</p><p>AI4DB <ref type="bibr">[18]</ref> becomes a very hot direction in the database community. Researchers spend efforts to improve query optimization by Artificial Intelligence (AI) techniques. Learned query optimizers beat the traditional heuristic solutions on generating more efficient query plans among the large search space <ref type="bibr">[8,</ref><ref type="bibr">21,</ref><ref type="bibr">23,</ref><ref type="bibr">40,</ref><ref type="bibr">44,</ref><ref type="bibr">49,</ref><ref type="bibr">50]</ref>. Traditional synopses-based estimators rely on the Attribute Value Independence (AVI) assumption and cannot capture the attribute correlations <ref type="bibr">[11,</ref><ref type="bibr">32]</ref>. To overcome this limitation, researchers propose learned-based estimators including data-driven and querydriven models. The data-driven models <ref type="bibr">[2,</ref><ref type="bibr">17,</ref><ref type="bibr">26,</ref><ref type="bibr">30,</ref><ref type="bibr">34,</ref><ref type="bibr">36,</ref><ref type="bibr">41,</ref><ref type="bibr">42,</ref><ref type="bibr">46,</ref><ref type="bibr">52]</ref> learn the joint data distribution over all attributes. The query-driven models <ref type="bibr">[16,</ref><ref type="bibr">20,</ref><ref type="bibr">25,</ref><ref type="bibr">28,</ref><ref type="bibr">38,</ref><ref type="bibr">41,</ref><ref type="bibr">48]</ref> learn a mapping function from queries to the corresponding cardinalities.</p><p>Existing learned-based query optimization techniques for highdimensional data cannot be directly applied to spatial data because the spatial query operators are more complicated than highdimensional vector data. Spatial data includes raster and vector formats. Raster data is represented by a grid of regularly sized pixels, while vector data uses geometry, such as points, linestrings, and polygons, defined by a set of numerical values. Due to the complexity of spatial attributes, expressing a spatial query in SQL often requires multiple inequality predicates in the WHERE clause. The classification of spatial queries is not unified. One group categorizes spatial queries into five types: basic, join, computational geometry, data mining, and raster operations <ref type="bibr">[13]</ref>. The other group classifies spatial queries on vector data into five types: topologybased, metric-based, and direction-based <ref type="bibr">[7,</ref><ref type="bibr">24]</ref>. Techniques for spatial query optimization differ significantly from those used for relational data query optimization. For query rewriting, due to the complex data types and user-defined functions, several heuristic rules by relational query rewriter are no longer unconditional for spatial query rewriter <ref type="bibr">[31]</ref>. For cost estimators, spatial query estimations are also more complex than relational query estimations. Spatial queries involve both spatial operators and nonspatial operators. Estimating spatial operators is often associated with inequality relations, such as OVERLAP, DISTANCE, etc.</p><p>Previous tutorials present comprehensive studies about learnedbased DBMS <ref type="bibr">[18]</ref>, query optimizer <ref type="bibr">[15]</ref>, and learned-based query optimizers <ref type="bibr">[33,</ref><ref type="bibr">43]</ref>; however, they missed the spatial query operators and requirements. Several tutorials discuss spatial data management <ref type="bibr">[12]</ref> and spatial data applications <ref type="bibr">[6,</ref><ref type="bibr">29]</ref>, but none of them link to spatial query optimization in DBMS and data processing platforms. In this tutorial, we aim to review the existing methods in learning-based query plan generators and cost estimators and discuss the open problems in spatial query optimization. Additionally, we will also cover learned solutions for spatial data management, including learning spatial index and spatial partitioning, to highlight the I/O cost of spatial query execution. A tutorial <ref type="bibr">[1]</ref> reviews learned multi-dimensional indexes. As a supplement to <ref type="bibr">[1]</ref>, we discuss newly learned spatial indexes from recent years and learned spatial partitioning techniques in this tutorial.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">TUTORIAL OUTLINE</head><p>Figure <ref type="figure">1</ref> shows the outline of this tutorial. We plan to spend 90 minutes discussing techniques for spatial query optimization with learning. This tutorial targets students, researchers, and practitioners who are interested in exploring problems in spatial query optimization. In the first part, we will introduce the background of query optimization components and spatial data processing. No prior knowledge of spatial data is required for the audience. We aim to inspire the audience with the following three parts: (1) Key features of spatial data management and spatial query optimizations;</p><p>(2) Existing works on spatial query optimization; (3) Why techniques for relational query optimization cannot be directly applied to spatial query optimization; (4) Gaps for future research in spatial query optimization with learning. Figure <ref type="figure">2</ref> summarizes the works that will be discussed during the presentation and categorizes them based on query optimization goals.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Learned Solutions for Query Optimizer</head><p>We cover two topics about learned solutions for query optimizers: query plan optimizer <ref type="bibr">[5,</ref><ref type="bibr">8,</ref><ref type="bibr">23,</ref><ref type="bibr">34,</ref><ref type="bibr">40,</ref><ref type="bibr">44,</ref><ref type="bibr">49,</ref><ref type="bibr">50]</ref> and query rewriter <ref type="bibr">[4,</ref><ref type="bibr">24,</ref><ref type="bibr">39,</ref><ref type="bibr">51]</ref>.</p><p>SJML <ref type="bibr">[34]</ref> designs a spatial join framework based on several learning models. The proposed models predict the best spatial join algorithm and features, such as the plane-sweep direction (along the x-or y-axis). SpatialEmbedding <ref type="bibr">[5]</ref> proposes a framework based on three learning models, which include an unsupervised model to capture the features of spatial datasets and two supervised models for the cost estimation of spatial operations. Optimizing the query execution plan is an important step in the DBMS and distributed data processing platforms. There are also several works <ref type="bibr">[8,</ref><ref type="bibr">23,</ref><ref type="bibr">40,</ref><ref type="bibr">44,</ref><ref type="bibr">49,</ref><ref type="bibr">50]</ref> that use learned models to evaluate the query plan and select the optimal query execution plan. These works are marked with grey color in Figure <ref type="figure">1</ref> because they are proposed for relational data and queries. Existing spatial optimizers focus on query execution and lack of supporting in considering query plans.</p><p>Maliva <ref type="bibr">[4]</ref> applies the Markov Decision Process model to rewrite the queries. The proposed model can be applied to spatial aggregation queries. SemanticQueryOpt <ref type="bibr">[24]</ref> proposes a strategy for the semantic query optimization of spatial join queries. This technique aims to eliminate unnecessary spatial joins or replace expensive spatial joins with cheaper thematic joins. Since SemanticQueryOpt is not a learning solution, it is marked with a dashed border in Figure <ref type="figure">1</ref>. We mark LearnedRewriter <ref type="bibr">[51]</ref> and WeTune <ref type="bibr">[39]</ref> with grey color because they are rule-based learned query rewriters for relational data. Although these works in grey are not learning-based spatial query optimization techniques, we highlight them to guide future research on rule-based spatial query rewriters.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Learned Solutions for Cost Estimator</head><p>Extensive efforts have been dedicated to learned-based cost estimation. In this subsection, we cover works related to computation cost estimation. The computation cost of the query plan relates to selectivity and cardinality. Selectivity refers to the percentage of tuples among the whole dataset that satisfies the query predicates <ref type="bibr">[3]</ref>. Cardinality refers to the number of results returned by each operation <ref type="bibr">[3]</ref>.</p><p>Researchers usually summarize computation cost estimation learning models into two categories: query-driven and data-driven. Query-driven models treat cardinality estimation as a regression problem and learn a mapping function between the query and its cardinality on a database <ref type="bibr">[16,</ref><ref type="bibr">20,</ref><ref type="bibr">25,</ref><ref type="bibr">28,</ref><ref type="bibr">38,</ref><ref type="bibr">41,</ref><ref type="bibr">48]</ref>. They use query workload as labeled training data to learn supervised query models. Data-driven models learn the join data distribution of attributes directly from the dataset <ref type="bibr">[2,</ref><ref type="bibr">17,</ref><ref type="bibr">26,</ref><ref type="bibr">30,</ref><ref type="bibr">34,</ref><ref type="bibr">36,</ref><ref type="bibr">42,</ref><ref type="bibr">46,</ref><ref type="bibr">52]</ref>. Some models take data as unsupervised information to learn unsupervised data models <ref type="bibr">[2,</ref><ref type="bibr">17,</ref><ref type="bibr">26,</ref><ref type="bibr">30,</ref><ref type="bibr">41,</ref><ref type="bibr">42,</ref><ref type="bibr">46,</ref><ref type="bibr">52]</ref> Some models are unsupervised data models that are directly learned from the data. Some models are supervised data models. LearningToSample <ref type="bibr">[36]</ref> learns a probabilistic classifier by queries. SJML <ref type="bibr">[34]</ref> learns data statistics similar to data synopses.</p><p>In spatial data processing, the statistics used to complete cardinality estimation tasks vary for different types of data. For example, estimating the cardinality of a spatial overlap-join for two sets of polygon datasets requires knowledge of the average volume and other features of the polygons. On the other hand, the cardinality of a distance-join for two sets of point datasets can be estimated using data distribution. In this part, we will highlight the types of cardinality estimation tasks each work addresses and the types of data each model can handle. SJML <ref type="bibr">[34]</ref> is designed for polygon datasets and spatial join selectivity estimation. Some works are designed for high-dimensional datasets and can be applied to range query cardinality estimation for spatial point data <ref type="bibr">[2,</ref><ref type="bibr">16,</ref><ref type="bibr">17,</ref><ref type="bibr">20,</ref><ref type="bibr">25,</ref><ref type="bibr">26,</ref><ref type="bibr">28,</ref><ref type="bibr">32,</ref><ref type="bibr">36,</ref><ref type="bibr">38,</ref><ref type="bibr">41,</ref><ref type="bibr">42,</ref><ref type="bibr">46,</ref><ref type="bibr">48,</ref><ref type="bibr">52]</ref>. Learning-ToSample <ref type="bibr">[36]</ref> leans a classifier based on the sample of the data and can be applied to spatial point datasets. TurboReg <ref type="bibr">[30]</ref> proposes a regression model for predicting the presence or absence of spatial  phenomena. PivNet <ref type="bibr">[2]</ref> uses a regression-based solution to estimate the k-NN queries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">Learned Solutions for Spatial Management</head><p>Data indexing is the most important factor that affects the I/O cost of spatial data processing in DBMSs. In distributed DBMSs and platforms, the I/O cost of spatial data processing is also related to data partitioning. Most recently proposed learned spatial indexes are designed for spatial point data <ref type="bibr">[10,</ref><ref type="bibr">14,</ref><ref type="bibr">19,</ref><ref type="bibr">22,</ref><ref type="bibr">27,</ref><ref type="bibr">37,</ref><ref type="bibr">47]</ref>. RW-Tree <ref type="bibr">[10]</ref> and RLR-Tree <ref type="bibr">[14]</ref> leverage learning-based methods from query workload to help with R-tree construction. LISA <ref type="bibr">[19]</ref>, SLBRIN <ref type="bibr">[37]</ref>, and SPRIG <ref type="bibr">[47]</ref> learn spatial index based on grid cells partitioning. RSMI <ref type="bibr">[27]</ref> and ELSI <ref type="bibr">[22]</ref> learn spatial index based on the z-order model. There are two recent works on learned spatial partitioning. ClusterPar <ref type="bibr">[45]</ref> uses K-Means clustering to partition the spatial point datasets. SpatialJoinFramework <ref type="bibr">[35]</ref> uses regression models to learn the features of spatial join. The models can decide the number of partitions and choose the partitioning strategy for spatial polygon datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4">Open Problems for Future Research</head><p>After reviewing recently learned solutions for query optimization, we want to highlight several gaps in spatial query optimization:</p><p>(1) Query rewriting rules for optimizing the spatial operations;</p><p>(2) Models and frameworks that also consider the spatial query execution;</p><p>(3) More focus on types of spatial data beyond point datasets, such as linestrings, polygons, and spatial raster data, along with their cost estimation tasks; (4) How to apply current learned partitioning models to existing distributed platforms and integrate these learned solutions into the query optimization components of distributed platforms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">PRESENTERS</head><note type="other">Xin</note><p>Zhang is a Ph.D. candidate in the Department of Computer Science and Engineering at the University of California, Riverside. She received her B.S. in Software Engineering from Shandong University in 2016 and her M.S. in Computer Science from Washington State University in 2018. She has interned at IBM Research, Almaden, and at Amazon AWS Redshift. Her research interests include big data management and approximate query processing. Ahmed Eldawy is an Associate Professor in Computer Science at UC Riverside. His research focuses big data management and spatial data processing. Ahmed led the research and development in many open-source projects for big spatial data exploration and visualization including UCR-Star, an interactive repository with nearly four terabytes of publicly available geospatial data. He delivered a series of tutorials on big spatial data in VLDB, ICDE, IEEE BigData, and MDM. He is a recipient of the NSF CAREER award as well as the best demo award in SIGSPATIAL 2020.</p></div></body>
		</text>
</TEI>
