NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A learning-based framework for spatial join processing: estimation, optimization and tuning

https://doi.org/10.1007/s00778-024-00836-1

Vu, Tin; Belussi, Alberto; Migliorini, Sara; Eldawy, Ahmed (February 2024, The VLDB Journal)

Abstract The importance and complexity of spatial join operation resulted in the availability of many join algorithms, some of which are tailored for big-data platforms like Hadoop and Spark. The choice among them is not trivial and depends on different factors. This paper proposes the first machine-learning-based framework for spatial join query optimization which can accommodate both the characteristics of spatial datasets and the complexity of the different algorithms. The main challenge is how to develop portable cost models that once trained can be applied to any pair of input datasets, because they are able to extract the important input characteristics, such as data distribution and spatial partitioning, the logic of spatial join algorithms, and the relationship between the two input datasets. The proposed system defines a set of features that can be computed efficiently for the data to catch the intricate aspects of spatial join. Then, it uses these features to train five machine learning models that are used to identify the best spatial join algorithm. The first two are regression models that estimate two important measures of the spatial join performance and they act as the cost model. The third model chooses the best partitioning strategy to use with spatial join. The fourth and fifth models further tune two important parameters, number of partitions and plane-sweep direction, to get the best performance. Experiments on large-scale synthetic and real data show the efficiency of the proposed models over baseline methods.
more » « less
Full Text Available
Towards a Learned Cost Model for Distributed Spatial Join: Data, Code & Models

https://doi.org/10.1145/3511808.3557712

Vu, Tin; Belussi, Alberto; Migliorini, Sara; Eldawy, Ahmed (October 2022, ACM)

Geospatial data comprise around 60% of all the publicly available data. One of the essential and most complex operations that brings together multiple geospatial datasets is the spatial join operation. Due to its complexity, there is a lot of partitioning techniques and parallel algorithms for the spatial join problem. This leads to a complex query optimization problem: which algorithm to use for a given pair of input datasets that we want to join? With the rise of machine learning, there is a promise in addressing this problem with the use of various learned models. However, one of the concerns is the lack of standard and publicly available data to train and test on, as well as the lack of accessible baseline models. This resource paper helps the research community solve this problem by providing synthetic and real datasets for spatial join, source code for constructing more datasets, and several baseline solutions that researchers can further extend and compare to.
more » « less
Spatial data generators

https://doi.org/10.1145/3548732.3548736

Vu, Tin; Migliorini, Sara; Eldawy, Ahmed; Belussi, Alberto (August 2022, Spatial Gems)

Full Text Available
Incremental Partitioning for Efficient Spatial Data Analytics

https://doi.org/10.14778/3494124.349415

Vu, Tin; Eldawy, Ahmed; Hristidis, Vagelis; Tsotras, Vassilis (January 2022, PVLDB)

Full Text Available
A Learned Query Optimizer for Spatial Join

https://doi.org/10.1145/3474717.3484217

Vu, Tin; Belussi, Alberto; Migliorini, Sara; Eldawy, Ahmed (November 2021, International Conference on Advances in Geographic Information Systems (SIGSPATIAL))

Full Text Available
Incremental partitioning for efficient spatial data analytics

https://doi.org/10.14778/3494124.3494150

Vu, Tin; Eldawy, Ahmed; Hristidis, Vagelis; Tsotras, Vassilis (November 2021, Proceedings of the VLDB Endowment)

Big spatial data has become ubiquitous, from mobile applications to satellite data. In most of these applications, data is continuously growing to huge volumes. Existing systems for big spatial data organize records at either the record-level or block-level. Systems that use record-level structures include key-value stores and LSM-Tree stores, which support insert and delete operations and they are optimized for highly-selective queries. On the other hand, systems like GeoSpark that use block-level structures (e.g. 128 MB each) are more efficient for analytical queries, but they cannot incrementally maintain the partitioned data and do not support delete operations. This paper proposes a general framework that enables block-level systems to incrementally maintain spatial partitions, in the presence of bulk insertions and deletions, in distributed file system (DFS) blocks. We first formally study the incremental spatial partitioning problem for big data and demonstrate its NP-hardness. Then, we propose a cost model to estimate the performance of queries on the partitioned data and the effect of modifying it as the data grows. After that, we provide three different implementations of the incremental partitioning framework. Comprehensive experiments on large real datasets show that our proposed partitioning algorithms outperforms state-of-the-art spatial partitioning methods.
more » « less
Full Text Available
R*-Grove: Balanced Spatial Partitioning for Large-Scale Datasets

https://doi.org/10.3389/fdata.2020.00028

Vu, Tin; Eldawy, Ahmed (August 2020, Frontiers in Big Data)
null (Ed.)
Full Text Available
Beast: Scalable Exploratory Analytics on Spatio-temporal Data

https://doi.org/10.1145/3459637.3481897

Eldawy, Ahmed; Hristidis, Vagelis; Ghosh, Saheli; Saeedan, Majid; Sevim, Akil; Siddique, A.B.; Singla, Samriddhi; Sivaram, Ganesh; Vu, Tin; Zhang, Yaming (October 2021, Conference on Information and Knowledge Management (CIKM))

Full Text Available
Using Deep Learning for Big Spatial Data Partitioning

https://doi.org/10.1145/3402126

Vu, Tin; Belussi, Alberto; Migliorini, Sara; Eldway, Ahmed (October 2020, ACM Transactions on Spatial Algorithms and Systems)
null (Ed.)
Full Text Available
A brief introduction to geospatial big data analytics with apache AsterixDB

https://doi.org/10.1145/3486189.3490018

Sevim, Akil; Mahin, Mehnaz Tabassum; Vu, Tin; Maxon, Ian; Eldawy, Ahmed; Carey, Michael; Tsotras, Vassilis (January 2021, SIGSPATIAL/GIS)

Full Text Available

« Prev Next »

Search for: All records