NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

GeoGen I: Towards General Geospatial Point Data Generation from Text

https://doi.org/10.1145/3764921.3770154

Saeedan, Majid; Eldawy, Ahmed (November 2025, ACM)

Generating realistic geospatial vector data is important for evaluat-ing algorithms, index structures, and systems under diverse condi-tions. Existing synthetic data generators typically rely on simplestatistical or procedural models that fail to capture the complexityof real-world spatial patterns. This paper introduces GeoGen I, agenerative framework that produces geospatial point distributionsfrom natural language prompts. The system combines contrastivelearning, region context, and a diffusion-based generator to createplausible datasets. In the experiments, we test variations of themodel and provide both qualitative and quantitative evaluations.Our experiments show that it can generate spatial patterns alignedwith different prompts. While the results are promising, many chal-lenges still remain, including in dataset curation and quality, andthe model’s ability to capture subtle geospatial constraints.
more » « less
Free, publicly-accessible full text available November 2, 2026
Geospatial Computing from Data Lakes to Deep Learning Applications

Saeedan, Majid (June 2025, UC Riverside)

This thesis explores geospatial vector data, including geometric shapes such as points, lines, and polygons. This data is crucial in navigation, urban planning, and many more applications. Geospatial computing is a multidisciplinary field that focuses on creating techniques and tools to handle large geospatial datasets. Given the reliance on data lakes to store large data sets in their raw formats, it is critical to have full support for geospatial datasets to enable scalable processing. To address this, we make two contributions in this area. First, we propose a column-oriented binary format called Spatial Parquet, which integrates geospatial vector data into Apache Parquet that enables significant data compression and efficient querying. Second, to improve support for semi-structured data, we introduce a distributed JSON processor for scalable SQL queries on large JSON datasets, including GeoJSON. It processes complex datasets like Open Street Map with features such as projection and filter push-down. Advances in Deep Learning (DL), including foundation models and Large Language Models (LLMs), offer opportunities for geospatial data analysis. We make three main contributions in this area. First, we study how to design DL models that can express a wide range of geospatial functions. We explore three representations: an image-based representa- tion using geo-referenced histograms (GeoImg), a graph-based point-set representation (Ge- oGraph), and a vector-based representation using a Fourier encoder (GeoVec). We formal- ize these representations and design corresponding models: ResNet and UNet for the first, PointNet++ for the second, and Poly2Vec with Transformers for the third. We evaluate all approaches on four spatial problems, showing the accuracy and effectiveness of the three approaches. Second, we create a benchmark called GS-QA for evaluating spatial question- answering with LLMs. A semi-automated process generates diverse question-answer pairs that cover various spatial objects, predicates, and complexities. An evaluation methodology is suggested with some experiments. Finally, a prototype for generating geospatial vector data from text prompts, called GeoGen I, is proposed. It has potential for applications such as spatial interpolation, data augmentation, and change analysis. We adapt diffusion models, traditionally used for generating realistic images, as geospatial data generators. We also explore their use for similarity search through geospatial data embeddings, highlighting the potential of vector databases in this domain. This thesis advances geospatial data processing, storage, analysis, and generation, opening new research pathways in geospatial computing.
more » « less
Free, publicly-accessible full text available June 20, 2026
Spatial parquet: a column file format for geospatial data lakes

https://doi.org/10.1145/3557915.3561038

Saeedan, Majid; Eldawy, Ahmed (November 2022, he 30th International Conference on Advances in Geographic Information Systems)

Modern data analytics applications prefer to use column-storage formats due to their improved storage efficiency through encoding and compression. Parquet is the most popular file format for col- umn data storage that provides several of these benefits out of the box. However, geospatial data is not readily supported by Parquet. This paper introduces Spatial Parquet, a Parquet extension that efficiently supports geospatial data. Spatial Parquet inherits all the advantages of Parquet for non-spatial data, such as rich data types, compression, and column/row filtering. Additionally, it adds three new features to accommodate geospatial data. First, it introduces a geospatial data type that can encode all standard spatial geome- tries in a column format compatible with Parquet. Second, it adds a new lossless and efficient encoding method, termed FP-delta, that is customized to efficiently store geospatial coordinates stored in floating-point format. Third, it adds a light-weight spatial index that allows the reader to skip non-relevant parts of the file for increased read efficiency. Experiments on large-scale real data showed that Spatial Parquet can reduce the data size by a factor of three even without compression. Compression can further reduce the storage size. Additionally, Spatial Parquet can reduce the reading time by two orders of magnitude when the light-weight index is applied. This initial prototype can open new research directions to further improve geospatial data storage in column format.
more » « less
dsJSON: A Distributed SQL JSON Processor

https://doi.org/10.1145/3588957

Saeedan, Majid; Eldawy, Ahmed; Zhao, Zhijia (May 2023, Proceedings of the ACM on Management of Data)

The popularity of JSON as a data interchange format resulted in big amounts of datasets available for processing. Users would like to analyze this data using SQL queries but existing distributed systems limit their users to only two specific formats, JSONLine and GeoJSON. The complexity of JSON schema makes it challenging to parse arbitrary files in a modern distributed system while producing records with unified schema that can be processed with SQL. To address these challenges, this paper introduces dsJSON, a state-of-the-art distributed JSON processor that overcomes limitations in existing systems and scales to big and complex data. dsJSON introduces the projection tree, a novel data structure that applies selective parsing of nested attributes to produce records that are ready for SQL processors. The key objective of the projection tree is to parse a big JSON file in parallel to produce records with a unified schema that can be processed with SQL. dsJSON is integrated into SparkSQL which enables users to run arbitrary SQL queries on complex JSON files. It also pushes projection and filter down into the parser for full integration between the parser and the processor. Experiments on up-to two terabytes of real data show that dsJSON performs several times faster than existing systems. It can also efficiently parse extremely large files not supported by existing distributed parsers
more » « less
Beast: Scalable Exploratory Analytics on Spatio-temporal Data

https://doi.org/10.1145/3459637.3481897

Eldawy, Ahmed; Hristidis, Vagelis; Ghosh, Saheli; Saeedan, Majid; Sevim, Akil; Siddique, A.B.; Singla, Samriddhi; Sivaram, Ganesh; Vu, Tin; Zhang, Yaming (October 2021, Conference on Information and Knowledge Management (CIKM))

Full Text Available

Search for: All records