With recent advancements, large language models (LLMs) such as ChatGPT and Bard have shown the potential to disrupt many industries, from customer service to healthcare. Traditionally, humans interact with geospatial data through software (e.g., ArcGIS 10.3) and programming languages (e.g., Python). As a pioneer study, we explore the possibility of using an LLM as an interface to interact with geospatial datasets through natural language. To achieve this, we also propose a framework to (1) train an LLM to understand the datasets, (2) generate geospatial SQL queries based on a natural language question, (3) send the SQL query to the backend database, (4) parse the database response back to human language. As a proof of concept, a case study was conducted on real-world data to evaluate its performance on various queries. The results show that LLMs can be accurate in generating SQL code for most cases, including spatial joins, although there is still room for improvement. As all geospatial data can be stored in a spatial database, we hope that this framework can serve as a proxy to improve the efficiency of spatial data analyses and unlock the possibility of automated geospatial analytics.
more »
« less
This content will become publicly available on June 20, 2026
Geospatial Computing from Data Lakes to Deep Learning Applications
This thesis explores geospatial vector data, including geometric shapes such as points, lines, and polygons. This data is crucial in navigation, urban planning, and many more applications. Geospatial computing is a multidisciplinary field that focuses on creating techniques and tools to handle large geospatial datasets. Given the reliance on data lakes to store large data sets in their raw formats, it is critical to have full support for geospatial datasets to enable scalable processing. To address this, we make two contributions in this area. First, we propose a column-oriented binary format called Spatial Parquet, which integrates geospatial vector data into Apache Parquet that enables significant data compression and efficient querying. Second, to improve support for semi-structured data, we introduce a distributed JSON processor for scalable SQL queries on large JSON datasets, including GeoJSON. It processes complex datasets like Open Street Map with features such as projection and filter push-down. Advances in Deep Learning (DL), including foundation models and Large Language Models (LLMs), offer opportunities for geospatial data analysis. We make three main contributions in this area. First, we study how to design DL models that can express a wide range of geospatial functions. We explore three representations: an image-based representa- tion using geo-referenced histograms (GeoImg), a graph-based point-set representation (Ge- oGraph), and a vector-based representation using a Fourier encoder (GeoVec). We formal- ize these representations and design corresponding models: ResNet and UNet for the first, PointNet++ for the second, and Poly2Vec with Transformers for the third. We evaluate all approaches on four spatial problems, showing the accuracy and effectiveness of the three approaches. Second, we create a benchmark called GS-QA for evaluating spatial question- answering with LLMs. A semi-automated process generates diverse question-answer pairs that cover various spatial objects, predicates, and complexities. An evaluation methodology is suggested with some experiments. Finally, a prototype for generating geospatial vector data from text prompts, called GeoGen I, is proposed. It has potential for applications such as spatial interpolation, data augmentation, and change analysis. We adapt diffusion models, traditionally used for generating realistic images, as geospatial data generators. We also explore their use for similarity search through geospatial data embeddings, highlighting the potential of vector databases in this domain. This thesis advances geospatial data processing, storage, analysis, and generation, opening new research pathways in geospatial computing.
more »
« less
- Award ID(s):
- 2046236
- PAR ID:
- 10611925
- Publisher / Repository:
- UC Riverside
- Date Published:
- Format(s):
- Medium: X
- Institution:
- UC Riverside
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Modern data analytics applications prefer to use column-storage formats due to their improved storage efficiency through encoding and compression. Parquet is the most popular file format for col- umn data storage that provides several of these benefits out of the box. However, geospatial data is not readily supported by Parquet. This paper introduces Spatial Parquet, a Parquet extension that efficiently supports geospatial data. Spatial Parquet inherits all the advantages of Parquet for non-spatial data, such as rich data types, compression, and column/row filtering. Additionally, it adds three new features to accommodate geospatial data. First, it introduces a geospatial data type that can encode all standard spatial geome- tries in a column format compatible with Parquet. Second, it adds a new lossless and efficient encoding method, termed FP-delta, that is customized to efficiently store geospatial coordinates stored in floating-point format. Third, it adds a light-weight spatial index that allows the reader to skip non-relevant parts of the file for increased read efficiency. Experiments on large-scale real data showed that Spatial Parquet can reduce the data size by a factor of three even without compression. Compression can further reduce the storage size. Additionally, Spatial Parquet can reduce the reading time by two orders of magnitude when the light-weight index is applied. This initial prototype can open new research directions to further improve geospatial data storage in column format.more » « less
-
Sun, Xiaoyong (Ed.)Convolutional neural network (CNN)-based deep learning (DL) methods have transformed the analysis of geospatial, Earth observation, and geophysical data due to their ability to model spatial context information at multiple scales. Such methods are especially applicable to pixel-level classification or semantic segmentation tasks. A variety of R packages have been developed for processing and analyzing geospatial data. However, there are currently no packages available for implementing geospatial DL in the R language and data science environment. This paper introduces the geodl R package, which supports pixel-level classification applied to a wide range of geospatial or Earth science data that can be represented as multidimensional arrays where each channel or band holds a predictor variable. geodl is built on the torch package, which supports the implementation of DL using the R and C++ languages without the need for installing a Python/PyTorch environment. This greatly simplifies the software environment needed to implement DL in R. Using geodl, geospatial raster-based data with varying numbers of bands, spatial resolutions, and coordinate reference systems are read and processed using the terra package, which makes use of C++ and allows for processing raster grids that are too large to fit into memory. Training loops are implemented with the luz package. The geodl package provides utility functions for creating raster masks or labels from vector-based geospatial data and image chips and associated masks from larger files and extents. It also defines a torch dataset subclass for geospatial data for use with torch dataloaders. UNet-based models are provided with a variety of optional ancillary modules or modifications. Common assessment metrics (i.e., overall accuracy, class-level recalls or producer’s accuracies, class-level precisions or user’s accuracies, and class-level F1-scores) are implemented along with a modified version of the unified focal loss framework, which allows for defining a variety of loss metrics using one consistent implementation and set of hyperparameters. Users can assess models using standard geospatial and remote sensing metrics and methods and use trained models to predict to large spatial extents. This paper introduces the geodl workflow, design philosophy, and goals for future development.more » « less
-
Computational models of verbal analogy and relational similarity judgments can employ different types of vector representations of word meanings (embeddings) generated by machine-learning algorithms. An important question is whether human-like relational processing depends on explicit representations of relations (i.e., representations separable from those of the concepts being related), or whether implicit relation representations suffice. Earlier machine-learning models produced static embeddings for individual words, identical across all contexts. However, more recent Large Language Models (LLMs), which use transformer architectures applied to much larger training corpora, are able to produce contextualized embeddings that have the potential to capture implicit knowledge of semantic relations. Here we compare multiple models based on different types of embeddings to human data concerning judgments of relational similarity and solutions of verbal analogy problems. For two datasets, a model that learns explicit representations of relations, Bayesian Analogy with Relational Transformations (BART), captured human performance more successfully than either a model using static embeddings (Word2vec) or models using contextualized embeddings created by LLMs (BERT, RoBERTa, and GPT-2). These findings support the proposal that human thinking depends on representations that separate relations from the concepts they relate.more » « less
-
Abstract Modern machine learning (ML) and deep learning (DL) techniques using high-dimensional data representations have helped accelerate the materials discovery process by efficiently detecting hidden patterns in existing datasets and linking input representations to output properties for a better understanding of the scientific phenomenon. While a deep neural network comprised of fully connected layers has been widely used for materials property prediction, simply creating a deeper model with a large number of layers often faces with vanishing gradient problem, causing a degradation in the performance, thereby limiting usage. In this paper, we study and propose architectural principles to address the question of improving the performance of model training and inference under fixed parametric constraints. Here, we present a general deep-learning framework based on branched residual learning (BRNet) with fully connected layers that can work with any numerical vector-based representation as input to build accurate models to predict materials properties. We perform model training for materials properties using numerical vectors representing different composition-based attributes of the respective materials and compare the performance of the proposed models against traditional ML and existing DL architectures. We find that the proposed models are significantly more accurate than the ML/DL models for all data sizes by using different composition-based attributes as input. Further, branched learning requires fewer parameters and results in faster model training due to better convergence during the training phase than existing neural networks, thereby efficiently building accurate models for predicting materials properties.more » « less
An official website of the United States government
