NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A learning-based framework for spatial join processing: estimation, optimization and tuning

https://doi.org/10.1007/s00778-024-00836-1

Vu, Tin; Belussi, Alberto; Migliorini, Sara; Eldawy, Ahmed (February 2024, The VLDB Journal)

Abstract The importance and complexity of spatial join operation resulted in the availability of many join algorithms, some of which are tailored for big-data platforms like Hadoop and Spark. The choice among them is not trivial and depends on different factors. This paper proposes the first machine-learning-based framework for spatial join query optimization which can accommodate both the characteristics of spatial datasets and the complexity of the different algorithms. The main challenge is how to develop portable cost models that once trained can be applied to any pair of input datasets, because they are able to extract the important input characteristics, such as data distribution and spatial partitioning, the logic of spatial join algorithms, and the relationship between the two input datasets. The proposed system defines a set of features that can be computed efficiently for the data to catch the intricate aspects of spatial join. Then, it uses these features to train five machine learning models that are used to identify the best spatial join algorithm. The first two are regression models that estimate two important measures of the spatial join performance and they act as the cost model. The third model chooses the best partitioning strategy to use with spatial join. The fourth and fifth models further tune two important parameters, number of partitions and plane-sweep direction, to get the best performance. Experiments on large-scale synthetic and real data show the efficiency of the proposed models over baseline methods.
more » « less
Full Text Available
SynopsisLake: Quality-aware Approximate Spatial Query Processing Using Data Synopses

https://doi.org/10.1145/3748636.3762714

Zhang, Xin; Eldawy, Ahmed (November 2025, ACM)

Full Text Available
LASEK: LLM-Assisted Style Exploration Kit for Geospatial Data

Bahadori, Tarlan; Sarvepalli, Sai Sreekar; Eldawy, Ahmed (July 2025, The VLDB Endowment)

Geospatial data visualization on a map is an essential tool for modern data exploration tools. However, these tools require users to manually configure the visualization style including color scheme and attribute selection, a process that is both complex and domain-specific. Large Language Models (LLMs) provide an opportunity to intelligently assist in styling based on the underlying data distribution and characteristics. This paper demonstrates LASEK, an LLM-assisted visualization framework that automates attribute selection and styling in large-scale spatio-temporal datasets. The system leverages LLMs to determine which attributes should be highlighted for visual distinction and even suggests how to integrate them in styling options improving interpretability and efficiency. We demonstrate our approach through interactive visualization scenarios, showing how LLM-driven attribute selection enhances clarity, reduces manual effort, and provides data-driven justifications for styling decisions.
more » « less
Full Text Available
Geospatial Computing from Data Lakes to Deep Learning Applications

Saeedan, Majid (June 2025, UC Riverside)

This thesis explores geospatial vector data, including geometric shapes such as points, lines, and polygons. This data is crucial in navigation, urban planning, and many more applications. Geospatial computing is a multidisciplinary field that focuses on creating techniques and tools to handle large geospatial datasets. Given the reliance on data lakes to store large data sets in their raw formats, it is critical to have full support for geospatial datasets to enable scalable processing. To address this, we make two contributions in this area. First, we propose a column-oriented binary format called Spatial Parquet, which integrates geospatial vector data into Apache Parquet that enables significant data compression and efficient querying. Second, to improve support for semi-structured data, we introduce a distributed JSON processor for scalable SQL queries on large JSON datasets, including GeoJSON. It processes complex datasets like Open Street Map with features such as projection and filter push-down. Advances in Deep Learning (DL), including foundation models and Large Language Models (LLMs), offer opportunities for geospatial data analysis. We make three main contributions in this area. First, we study how to design DL models that can express a wide range of geospatial functions. We explore three representations: an image-based representa- tion using geo-referenced histograms (GeoImg), a graph-based point-set representation (Ge- oGraph), and a vector-based representation using a Fourier encoder (GeoVec). We formal- ize these representations and design corresponding models: ResNet and UNet for the first, PointNet++ for the second, and Poly2Vec with Transformers for the third. We evaluate all approaches on four spatial problems, showing the accuracy and effectiveness of the three approaches. Second, we create a benchmark called GS-QA for evaluating spatial question- answering with LLMs. A semi-automated process generates diverse question-answer pairs that cover various spatial objects, predicates, and complexities. An evaluation methodology is suggested with some experiments. Finally, a prototype for generating geospatial vector data from text prompts, called GeoGen I, is proposed. It has potential for applications such as spatial interpolation, data augmentation, and change analysis. We adapt diffusion models, traditionally used for generating realistic images, as geospatial data generators. We also explore their use for similarity search through geospatial data embeddings, highlighting the potential of vector databases in this domain. This thesis advances geospatial data processing, storage, analysis, and generation, opening new research pathways in geospatial computing.
more » « less
Full Text Available
RDPro : Distributed Processing of Big Raster Data: [Scalable Data Science]

https://doi.org/10.14778/3712221.3712229

Shang, Zhuocheng; Singla, Samriddhi; Eldawy, Ahmed; Scudiero, Elia (November 2024, Proceedings of the VLDB Endowment)

Advancements in remote sensing technology allowed for collecting vast amounts of satellite and aerial imagery with up to 1 cm pixel resolutions, stored in raster format crucial for various research fields. However, processing this data poses challenges, including resolving data dependencies when location, resolution, and coordinate systems do not align and managing large datasets within memory constraints. This paper introduces RDPro, a novel Spark-based system that efficiently processes and analyzes large raster datasets. RDPro features a new data model tailored for data dependencies in a distributed, shared-nothing environment, complete with tools for loading and writing raster data. It also optimizes core raster operations within Spark, allowing users to integrate complex data science workflows. Comparative analysis shows RDPro outperforms existing systems by up to two orders of magnitude.
more » « less
Full Text Available
DynoViz: Dynamic Visualization of Large Scale Satellite Data

https://doi.org/10.1145/3681763.3698475

Shang, Zhuocheng; Shivakumar, Suryaa Charan; Eldawy, Ahmed (October 2024, ACM Digital Library)

Full Text Available
QPV: An Input Control Component For Progressive Visualization Analytics [Work-in-progress]

Zhang, Xin; Eldawy, Ahmed (September 2024, Proceedings of the VLDB Endowment)

Full Text Available
FUDJ: Flexible User-Defined Distributed Joins

Sevim, Akil; Eldawy, Ahmed; Carman, Preston; Carey, Michael; Tsotras, Vassilis (May 2024, IEEE)

Join operations are crucial in data analysis, but can suffer inefficiency with large datasets and complex non- equality-based conditions. Optimized join algorithms have gained traction in database research to address these challenges. One popular choice for implementing join algorithms is distributed data processing frameworks, e.g., Hadoop and Spark, but each implementation is highly tailored for specific query types. As a result, they do not address join queries that involve diverse and complex conditions since they are not integrated into a holistic query optimization engine like in DBMSs. On the other hand, implementing new join algorithms on a DBMS from scratch requires substantial effort and expertise. This paper introduces FUDJ, Flexible User-defined Distributed Joins, a framework for complex distributed join algorithms. The key idea of FUDJ is to allow developers to realize new distributed join algorithms into the database without delving into the database internals. As shown, an algorithm implemented in FUDJ is up to an order of magnitude faster than existing user-defined implementations with an order of magnitude fewer lines of code.
more » « less
Full Text Available
A Generic Machine Learning Model for Spatial Query Optimization based on Spatial Embeddings

https://doi.org/10.1145/3657633

Belussi, Alberto; Migliorini, Sara; Eldawy, Ahmed (April 2024, ACM Transactions on Spatial Algorithms and Systems)

Machine learning (ML) and deep learning (DL) techniques are increasingly applied to produce efficient query optimizers, in particular in regards to big data systems. The optimization of spatial operations is even more challenging due to the inherent complexity of such kind of operations, like spatial join or range query, and the peculiarities of spatial data. Although a few ML-based spatial query optimizers have been proposed in literature, their design limits their use, since each one is tailored for a specific collection of datasets, a specific operation, or a specific hardware setting. Changes to any of these will require building and training a completely new model which entails collecting a new very large training dataset to obtain a good model. This paper proposes a different approach which exploits the use of the novel notion ofspatial embeddingto overcome these limitations. In particular, a preliminary model is defined which captures the relevant features of spatial datasets, independently from the operation to be optimized and in an unsupervised manner. This model is trained with a large amount of both synthetic and real-world data, with the aim to produce meaningful spatial embeddings. The construction of an embedding model could be intended as a preliminary step for the optimization of many different spatial operations, so the cost of its building can be compensated during the subsequent construction of specific models. Indeed, for each considered spatial operation, a specific tailored model will be trained but by using spatial embeddings as input, so a very little amount of training data points is required for them. Three peculiar operations are considered as proof of concept in this paper: range query, self-join, and binary spatial join. Finally, a comparison with an alternative technique, known as transfer learning, is provided and the advantages of the proposed technique over it are highlighted.
more » « less
Full Text Available
Polygonally Anchored Graph Drawing

https://doi.org/10.4230/LIPIcs.GD.2024.52

Chiu, Alvin; Eldawy, Ahmed; Goodrich, Michael T (January 2024, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Felsner, Stefan; Klein, Karsten (Ed.)
We investigate force-directed graph drawing techniques under the constraint that some nodes must be anchored to stay within a given polygonal region associated with it (i.e. some positional information is known). The low energy layouts produced by such algorithms may reveal geographic information about nodes with no such knowledge a priori. Some applications of graph drawing with partial positional information include location-based social networks and rail networks, where the geographical locations need not be precise.
more » « less
Full Text Available

« Prev Next »

Search for: All records