skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Griddle: Effective Query Support over Voluminous Gridded Spatial Datasets
Gridded datasets occur in several domains. These datasets comprise (un)structured grid points, where each grid point is characterized by XY(Z) coordinates in a spatial referencing system. The data available at individual grid points are high-dimensional encapsulating multiple variables of interest. This study has two thrusts. The first targets supporting effective management of voluminous gridded datasets while reconciling challenges relating to colocation and dispersion. The second thrust is to support sliding (temporal) window queries over the gridded dataset. Such queries involve sliding a temporal window over the data to identify spatial locations and chronological time points where the specified predicate evaluates to true. Our methodology includes support for a space-efficient data structure for organizing information within the data, query decomposition based on dyadic intervals, support for temporal anchoring, query transformations, and effective evaluation of query predicates. Our empirical benchmarks are conducted on representative voluminous high dimensional datasets such as gridMET (historical meteorological data) and MACA (future climate datasets based on the RCP 8.5 greenhouse gas trajectory). In our benchmarks, our system can handle throughputs of over 3000 multi-predicate sliding window queries per second.  more » « less
Award ID(s):
1931363
PAR ID:
10448813
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2022 IEEE International Conference on Big Data (Big Data)
Page Range / eLocation ID:
792 to 799
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Gridded spatial datasets arise naturally in environmental, climatic, meteorological, and ecological settings. Each grid point encapsulates a vector of variables representing different measures of interest. Gridded datasets tend to be voluminous since they encapsulate observations for long timescales. Visualizing such datasets poses significant challenges stemming from the need to preserve interactivity, manage I/O overheads, and cope with data volumes. Here we present our methodology to significantly alleviate I/O requirements by leveraging deep neural network-based models and a distributed, in-memory cache to facilitate interactive visualizations. Our benchmarks demonstrate that deploying our lightweight models coupled with back-end caching and prefetching schemes can reduce the client's query response time by 92.3% while maintaining a high perceptual quality with a PSNR (peak signal-to-noise ratio) of 38.7 dB. 
    more » « less
  2. Summary Remote sensing of plant traits and their environment facilitates non‐invasive, high‐throughput monitoring of the plant's physiological characteristics. However, voluminous observational data generated by such autonomous sensor networks overwhelms scientific users when they have to analyze the data. In order to provide a scalable and effective analysis environment, there is a need for storage and analytics that support high‐throughput data ingestion while preserving spatiotemporal and sensor‐specific characteristics. Also, the framework should enable modelers and scientists to run their analytics while coping with the fast and continuously evolving nature of the dataset. In this paper, we present Radix+ , a high‐throughput distributed data storage system for supporting scalable georeferencing, and interactive query‐based spatiotemporal analytics with trackable data integrity. We include empirical evaluations performed on a commodity machine cluster with up to 1 TB of data. Our benchmarks demonstrate subsecond latency for majority of our evaluated queries and improvement in data ingestion rate over systems such as Geomesa. 
    more » « less
  3. Gridded spatial datasets arise naturally in environmental, climatic, meteorological, and ecological settings. Each grid point encapsulates a vector of variables representing different measures of interest. Gridded datasets tend to be voluminous since they encapsulate observations for long timescales. Visualizing such datasets poses significant challenges stemming from the need to preserve interactivity, manage I/O overheads, and cope with data volumes. Here we present our methodology to significantly alleviate I/O requirements by leveraging deep neural network-based models. 
    more » « less
  4. We consider accelerating machine learning (ML) inference queries on unstructured datasets. Expensive operators such as feature extractors and classifiers are deployed as user-defined functions (UDFs), which are not penetrable with classic query optimization techniques such as predicate push-down. Recent optimization schemes (e.g., Probabilistic Predicates or PP) assume independence among the query predicates, build a proxy model for each predicate offline, and rewrite a new query by injecting these cheap proxy models in the front of the expensive ML UDFs. In such a manner, unlikely inputs that do not satisfy query predicates are filtered early to bypass the ML UDFs. We show that enforcing the independence assumption in this context may result in sub-optimal plans. In this paper, we propose CORE, a query optimizer that better exploits the predicate correlations and accelerates ML inference queries. Our solution builds the proxy models online for a new query and leverages a branch-and-bound search process to reduce the building costs. Results on three real-world text, image and video datasets show that CORE improves the query throughput by up to 63% compared to PP and up to 80% compared to running the queries as it is. 
    more » « less
  5. We are being constantly judged by automated decision systems that have been widely criticised for being discriminatory and unfair. Since an algorithm is only as good as the data it works with, biases in the data can significantly amplify unfairness issues. In this paper, we take initial steps towards integrating fairness conditions into database query processing and data management systems. Specifically, we focus on selection bias in range queries. We formally define the problem of fairness-aware range queries as obtaining a fair query which is most similar to the user's query. We propose a sub-linear time algorithm for single-predicate range queries and efficient algorithms for multi-predicate range queries. Our empirical evaluation on real and synthetic datasets confirms the effectiveness and efficiency of our proposal. 
    more » « less