skip to main content


Title: GeoMatch: Efficient Large-scale Map Matching on Apache Spark
We develop GeoMatch as a novel, scalable, and efficient big-data pipeline for large-scale map matching on Apache Spark. GeoMatch improves existing spatial big-data solutions by utilizing a novel spatial partitioning scheme inspired by Hilbert space-filling curves. Thanks to its partitioning scheme, GeoMatch can effectively balance operations across different processing units and achieve significant performance gains. GeoMatch also incorporates a dynamically adjustable error-correction technique that provides robustness against positioning errors. We demonstrate the effectiveness of GeoMatch through rigorous and extensive empirical benchmarks that consider large-scale urban spatial datasets ranging from 166,253 to 3.78B location measurements. We separately assess execution performance and accuracy of map matching and develop a benchmark framework for evaluating large-scale map matching. Results of our evaluation show up to 27.25-fold performance improvements compared to previous works while achieving better processing accuracy than current solutions. We also showcase the practical potential of GeoMatch with two urban management applications. GeoMatch and our benchmark framework are open-source.  more » « less
Award ID(s):
1827505
NSF-PAR ID:
10286816
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
ACM/IMS Transactions on Data Science
Volume:
1
Issue:
3
ISSN:
2691-1922
Page Range / eLocation ID:
1 to 30
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Regional extent and spatiotemporal dynamics of Arctic permafrost disturbances remain poorly quantified. High spatial resolution commercial satellite imagery enables transformational opportunities to observe, map, and document the micro-topographic transitions occurring in Arctic polygonal tundra at multiple spatial and temporal frequencies. The entire Arctic has been imaged at 0.5 m or finer resolution by commercial satellite sensors. The imagery is still largely underutilized, and value-added Arctic science products are rare. Knowledge discovery through artificial intelligence (AI), big imagery, high performance computing (HPC) resources is just starting to be realized in Arctic science. Large-scale deployment of petabyte-scale imagery resources requires sophisticated computational approaches to automated image interpretation coupled with efficient use of HPC resources. In addition to semantic complexities, multitude factors that are inherent to sub-meter resolution satellite imagery, such as file size, dimensions, spectral channels, overlaps, spatial references, and imaging conditions challenge the direct translation of AI-based approaches from computer vision applications. Memory limitations of Graphical Processing Units necessitates the partitioning of an input satellite imagery into manageable sub-arrays, followed by parallel predictions and post-processing to reconstruct the results corresponding to input image dimensions and spatial reference. We have developed a novel high performance image analysis framework –Mapping application for Arctic Permafrost Land Environment (MAPLE) that enables the integration of operational-scale GeoAI capabilities into Arctic science applications. We have designed the MAPLE workflow to become interoperable across HPC architectures while utilizing the optimal use of computing resources.

     
    more » « less
  2. Big spatial data has become ubiquitous, from mobile applications to satellite data. In most of these applications, data is continuously growing to huge volumes. Existing systems for big spatial data organize records at either the record-level or block-level. Systems that use record-level structures include key-value stores and LSM-Tree stores, which support insert and delete operations and they are optimized for highly-selective queries. On the other hand, systems like GeoSpark that use block-level structures (e.g. 128 MB each) are more efficient for analytical queries, but they cannot incrementally maintain the partitioned data and do not support delete operations. This paper proposes a general framework that enables block-level systems to incrementally maintain spatial partitions, in the presence of bulk insertions and deletions, in distributed file system (DFS) blocks. We first formally study the incremental spatial partitioning problem for big data and demonstrate its NP-hardness. Then, we propose a cost model to estimate the performance of queries on the partitioned data and the effect of modifying it as the data grows. After that, we provide three different implementations of the incremental partitioning framework. Comprehensive experiments on large real datasets show that our proposed partitioning algorithms outperforms state-of-the-art spatial partitioning methods. 
    more » « less
  3. Urban and environmental researchers seek to obtain building features (e.g., building shapes, counts, and areas) at large scales. However, blurriness, occlusions, and noise from prevailing satellite images severely hinder the performance of image segmentation, super-resolution, or deep-learning-based translation networks. In this article, we combine globally available satellite images and spatial geometric feature datasets to create a generative modeling framework that enables obtaining significantly improved accuracy in per-building feature estimation and the generation of visually plausible building footprints. Our approach is a novel design that compensates for the degradation present in satellite images by using a novel deep network setup that includes segmentation, generative modeling, and adversarial learning for instance-level building features. Our method has proven its robustness through large-scale prototypical experiments covering heterogeneous scenarios from dense urban to sparse rural. Results show better quality over advanced segmentation networks for urban and environmental planning, and show promise for future continental-scale urban applications. 
    more » « less
  4. null (Ed.)
    We develop a new framework for designing online policies given access to an oracle providing statistical information about an off-line benchmark. Having access to such prediction oracles enables simple and natural Bayesian selection policies and raises the question as to how these policies perform in different settings. Our work makes two important contributions toward this question: First, we develop a general technique we call compensated coupling, which can be used to derive bounds on the expected regret (i.e., additive loss with respect to a benchmark) for any online policy and off-line benchmark. Second, using this technique, we show that a natural greedy policy, which we call the Bayes selector, has constant expected regret (i.e., independent of the number of arrivals and resource levels) for a large class of problems we refer to as “online allocation with finite types,” which includes widely studied online packing and online matching problems. Our results generalize and simplify several existing results for online packing and online matching and suggest a promising pathway for obtaining oracle-driven policies for other online decision-making settings. This paper was accepted by George Shanthikumar, big data analytics. 
    more » « less
  5. A skyline query searches the data points that are not dominated by others in the dataset. It is widely adopted for many applications which require multi-criteria decision making. However, skyline query processing is considerably time-consuming for a high-dimensional large scale dataset. Parallel computing techniques are therefore needed to address this challenge, among which MapReduce is one of the most popular frameworks to process big data. A great number of efficient MapReduce skyline algorithms have been proposed in the literature and most of their designs focus on partitioning and pruning the given dataset. However, there are still opportunities for further parallelism. In this study, we propose two parallel skyline processing algorithms using a novel LShape partitioning strategy and an effective Propagation Filtering method. These two algorithms are 2Phase LShape and 1Phase LShape, used for multiple reducers and single reducer, respectively. By extensive experiments, we verify that our algorithms outperformed the state-of-the-art approaches, especially for high-dimensional large scale datasets. 
    more » « less