skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: RDPro : Distributed Processing of Big Raster Data: [Scalable Data Science]
Advancements in remote sensing technology allowed for collecting vast amounts of satellite and aerial imagery with up to 1 cm pixel resolutions, stored in raster format crucial for various research fields. However, processing this data poses challenges, including resolving data dependencies when location, resolution, and coordinate systems do not align and managing large datasets within memory constraints. This paper introduces RDPro, a novel Spark-based system that efficiently processes and analyzes large raster datasets. RDPro features a new data model tailored for data dependencies in a distributed, shared-nothing environment, complete with tools for loading and writing raster data. It also optimizes core raster operations within Spark, allowing users to integrate complex data science workflows. Comparative analysis shows RDPro outperforms existing systems by up to two orders of magnitude.  more » « less
Award ID(s):
2046236
PAR ID:
10611919
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ACM Digital Library
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
18
Issue:
3
ISSN:
2150-8097
Page Range / eLocation ID:
613 to 622
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract. Processing Earth observation data modelled in a time-series of raster format is critical to solving some of the most complex problems in geospatial science ranging from climate change to public health. Researchers are increasingly working with these large raster datasets that are often terabytes in size. At this scale, traditional GIS methods may fail to handle the processing, and new approaches are needed to analyse these datasets. The objective of this work is to develop methods to interactively analyse big raster datasets with the goal of most efficiently extracting vector data over specific time periods from any set of raster data. In this paper, we describe RINX (Raster INformation eXtraction) which is an end-to-end solution for automatic extraction of information from large raster datasets. RINX heavily utilises open source geospatial techniques for information extraction. It also complements traditional approaches with state-of-the- art high-performance computing techniques. This paper discusses details of achieving big temporal data extraction with RINX, implemented on the use case of air quality and climate data extraction for long term health studies, which includes methods used, code developed, processing time statistics, project conclusions, and next steps. 
    more » « less
  2. We propose Clamor, a functional cluster computing framework that adds support for fine-grained, transparent access to global variables for distributed, data-parallel tasks. Clamor targets workloads that perform sparse accesses and updates within the bulk synchronous parallel execution model, a setting where the standard technique of broadcasting global variables is highly inefficient. Clamor implements a novel dynamic replication mechanism in order to enable efficient access to popular data regions on the fly, and tracks finegrained dependencies in order to retain the lineage-based fault tolerance model of systems like Spark. Clamor can integrate with existing Rust and C++ libraries to transparently distribute programs on the cluster. We show that Clamor is competitive with Spark in simple functional workloads and can improve performance significantly compared to custom systems on workloads that sparsely access large global variables: from 5x for sparse logistic regression to over 100x on distributed geospatial queries. 
    more » « less
  3. Abstract Spatial transcriptomic studies are becoming increasingly common and large, posing important statistical and computational challenges for many analytic tasks. Here, we present SPARK-X, a non-parametric method for rapid and effective detection of spatially expressed genes in large spatial transcriptomic studies. SPARK-X not only produces effective type I error control and high power but also brings orders of magnitude computational savings. We apply SPARK-X to analyze three large datasets, one of which is only analyzable by SPARK-X. In these data, SPARK-X identifies many spatially expressed genes including those that are spatially expressed within the same cell type, revealing new biological insights. 
    more » « less
  4. In the era of big data and cloud computing, large amounts of data are generated from user applications and need to be processed in the datacenter. Data-parallel computing frameworks, such as Apache Spark, are widely used to perform such data processing at scale. Specifically, Spark leverages distributed memory to cache the intermediate results, represented as Resilient Distributed Datasets (RDDs). This gives Spark an advantage over other parallel frameworks for implementations of iterative machine learning and data mining algorithms, by avoiding repeated computation or hard disk accesses to retrieve RDDs. By default, caching decisions are left at the programmer’s discretion, and the LRU policy is used for evicting RDDs when the cache is full. However, when the objective is to minimize total work, LRU is woefully inadequate, leading to arbitrarily suboptimal caching decisions. In this paper, we design an algorithm for multi-stage big data processing platforms to adaptively determine and cache the most valuable intermediate datasets that can be reused in the future. Our solution automates the decision of which RDDs to cache: this amounts to identifying nodes in a direct acyclic graph (DAG) representing computations whose outputs should persist in the memory. Our experiment results show that our proposed cache optimization solution can improve the performance of machine learning applications on Spark decreasing the total work to recompute RDDs by 12%. 
    more » « less
  5. Significant increase in high-resolution satellite data requires more productive analysis methods to benefit data scientists. Interactive exploration is essential to productivity since it keeps the user en- gaged by providing quick responses. This paper addresses the pro- gressive zonal statistics problem that given big satellite data, an aggregate function, and a set of query polygons, zonal statistics computes the aggregate function for each query polygon over raster data. Efficiently querying complex polygons, reading high resolu- tion pixels and process multiple polygons simultaneously are three main challenges. This work introduces Viper, an interactive explo- ration pipeline to overcome these challenges and achieve require- ments. Viper uses a raster-vector index to bootstrap the answer with an accurate result in a short time. Then, it progressively refines the answer using a priority processing algorithm to produce the final answer. Experiments on large-scale real data show that Viper can reach 90% accuracy or higher up-to two orders of magnitude faster than baseline algorithms. 
    more » « less