This thesis explores geospatial vector data, including geometric shapes such as points, lines, and polygons. This data is crucial in navigation, urban planning, and many more applications. Geospatial computing is a multidisciplinary field that focuses on creating techniques and tools to handle large geospatial datasets. Given the reliance on data lakes to store large data sets in their raw formats, it is critical to have full support for geospatial datasets to enable scalable processing. To address this, we make two contributions in this area. First, we propose a column-oriented binary format called Spatial Parquet, which integrates geospatial vector data into Apache Parquet that enables significant data compression and efficient querying. Second, to improve support for semi-structured data, we introduce a distributed JSON processor for scalable SQL queries on large JSON datasets, including GeoJSON. It processes complex datasets like Open Street Map with features such as projection and filter push-down. Advances in Deep Learning (DL), including foundation models and Large Language Models (LLMs), offer opportunities for geospatial data analysis. We make three main contributions in this area. First, we study how to design DL models that can express a wide range of geospatial functions. We explore three representations: an image-based representa- tion using geo-referenced histograms (GeoImg), a graph-based point-set representation (Ge- oGraph), and a vector-based representation using a Fourier encoder (GeoVec). We formal- ize these representations and design corresponding models: ResNet and UNet for the first, PointNet++ for the second, and Poly2Vec with Transformers for the third. We evaluate all approaches on four spatial problems, showing the accuracy and effectiveness of the three approaches. Second, we create a benchmark called GS-QA for evaluating spatial question- answering with LLMs. A semi-automated process generates diverse question-answer pairs that cover various spatial objects, predicates, and complexities. An evaluation methodology is suggested with some experiments. Finally, a prototype for generating geospatial vector data from text prompts, called GeoGen I, is proposed. It has potential for applications such as spatial interpolation, data augmentation, and change analysis. We adapt diffusion models, traditionally used for generating realistic images, as geospatial data generators. We also explore their use for similarity search through geospatial data embeddings, highlighting the potential of vector databases in this domain. This thesis advances geospatial data processing, storage, analysis, and generation, opening new research pathways in geospatial computing.
more »
« less
This content will become publicly available on November 2, 2026
GeoGen I: Towards General Geospatial Point Data Generation from Text
Generating realistic geospatial vector data is important for evaluat-ing algorithms, index structures, and systems under diverse condi-tions. Existing synthetic data generators typically rely on simplestatistical or procedural models that fail to capture the complexityof real-world spatial patterns. This paper introduces GeoGen I, agenerative framework that produces geospatial point distributionsfrom natural language prompts. The system combines contrastivelearning, region context, and a diffusion-based generator to createplausible datasets. In the experiments, we test variations of themodel and provide both qualitative and quantitative evaluations.Our experiments show that it can generate spatial patterns alignedwith different prompts. While the results are promising, many chal-lenges still remain, including in dataset curation and quality, andthe model’s ability to capture subtle geospatial constraints.
more »
« less
- Award ID(s):
- 2215705
- PAR ID:
- 10656402
- Publisher / Repository:
- ACM
- Date Published:
- Page Range / eLocation ID:
- 70 to 73
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The Hawaiian Islands have been employed as a model system to reconstruct agroecological extents of traditional Polynesian agricultural production systems. However, the reliability of previously modeled agricultural extents is unknown due to limitations in empirical evidence to assess accuracy. Utilizing a geospatial database of 8,561 archaeological sites compiled by the Hawaiʻi State Historic Preservation Department (SHPD), this research assessed the accuracy and reliability of three spatial models that estimate the extents of traditional Hawaiian agricultural systems. The results of the model sensitivity assessment indicate the three geospatial models capture the spatial patterns and relative extents of intensive agricultural systems with substantial infrastructure, while additional work is needed to assess reliability of modeled agricultural systems with more indefinite infrastructure.more » « less
-
Modern data analytics applications prefer to use column-storage formats due to their improved storage efficiency through encoding and compression. Parquet is the most popular file format for col- umn data storage that provides several of these benefits out of the box. However, geospatial data is not readily supported by Parquet. This paper introduces Spatial Parquet, a Parquet extension that efficiently supports geospatial data. Spatial Parquet inherits all the advantages of Parquet for non-spatial data, such as rich data types, compression, and column/row filtering. Additionally, it adds three new features to accommodate geospatial data. First, it introduces a geospatial data type that can encode all standard spatial geome- tries in a column format compatible with Parquet. Second, it adds a new lossless and efficient encoding method, termed FP-delta, that is customized to efficiently store geospatial coordinates stored in floating-point format. Third, it adds a light-weight spatial index that allows the reader to skip non-relevant parts of the file for increased read efficiency. Experiments on large-scale real data showed that Spatial Parquet can reduce the data size by a factor of three even without compression. Compression can further reduce the storage size. Additionally, Spatial Parquet can reduce the reading time by two orders of magnitude when the light-weight index is applied. This initial prototype can open new research directions to further improve geospatial data storage in column format.more » « less
-
In landscape planning and design, geospatial technologies (GSTs) are used to aid in visualizing and interpreting geographic environments, identifying geospatial patterns, and making decisions around information based on maps and geospatial information. GSTs are related to the different tools and technologies used to represent the earth’s surface and have transformed the practice of landscape design and geospatial education. These technologies play an important role in promoting the development and application of STEM-relevant geospatial thinking. Curricula that incorporate GSTs have been used across educational levels, from elementary school through college, and have been shown to support the development of geospatial learning and understanding. The present work discusses the use of one type of GST, virtual globes, as a tool for developing geospatial thinking, with a specific focus on Google Earth. This review highlights outcomes of several studies using Google Earth in the context of disciplines related to landscape design, such as geography and earth science. Furthermore, the potential mechanisms underlying the effectiveness of this technology for supporting the development of geospatial knowledge, such as its role in facilitating data visualization and supporting student’s ability to think flexibly about spatial patterns and relations, are discussed. Finally, the limitations of the current research on Google Earth as a tool for supporting geospatial learning are discussed, and suggestions for future research are provided.more » « less
-
Humans subconsciously engage in geospatial reasoning when reading articles. We recognize place names and their spatial relations in text and mentally associate them with their physical locations on Earth. Although pretrained language models can mimic this cognitive process using linguistic context, they do not utilize valuable geospatial information in large, widely available geographical databases, e.g., OpenStreetMap. This paper introduces GeoLM, a geospatially grounded language model that enhances the understanding of geo-entities in natural language. GeoLM leverages geo-entity mentions as anchors to connect linguistic information in text corpora with geospatial information extracted from geographical databases. GeoLM connects the two types of context through contrastive learning and masked language modeling. It also incorporates a spatial coordinate embedding mechanism to encode distance and direction relations to capture geospatial context. In the experiment, we demonstrate that GeoLM exhibits promising capabilities in supporting toponym recognition, toponym linking, relation extraction, and geo-entity typing, which bridge the gap between natural language processing and geospatial sciences. The code is publicly available at https://github.com/knowledge-computing/geolm.more » « less
An official website of the United States government
