skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Award ID contains: 1954644

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract The Doubly Connected Edge List (DCEL) is an edge-list structure widely used in spatial applications, primarily for planar topological and geometric computations. However, it is also applicable to various types of data, including 3D models and geographic data. An essential operation is theoverlay operation, which combines the DCELs of two input polygon layers and can easily support spatial queries on polygons like the intersection, union, and difference between these layers. However, existing techniques for spatial overlay operations suffer from two main limitations. First, they fail to handle many large datasets practically used in real applications. Second, they cannot handle arbitrary spatial lines that practically form polygons, e.g., city blocks, but they are given as a set of scattered lines. This work proposes a distributed and scalable way to compute the overlay operation and its related supported queries. Our operations also support arbitrary spatial lines through a scalable polygonization process. We address the issues of efficiently distributing the lines and overlay operators and offer various optimizations that improve performance. Our experiments demonstrate that the proposed scalable solution can efficiently compute the overlay of large real datasets. 
    more » « less
    Free, publicly-accessible full text available July 1, 2026
  2. Abstract Window queries are important analytical tools for ordered data and have been researched both in streaming and stored data environments. By incorporating ideas for window queries from existing streaming and stored data systems, we propose a new window syntax that makes a wide range of window queries easier to write and optimize. We have implemented this new window syntax in SQL++, an SQL extension that supports querying semistructured data, on top of AsterixDB, a Big Data Management System, thus allowing us to process window queries over large datasets in a parallel and efficient manner. 
    more » « less
  3. Abstract We introduce theReverseSpatial Top-kKeyword (RSK)query, which is defined as:given a query term q, an integer k and a neighborhood size find all the neighborhoods of that size where q is in the top-k most frequent terms among the social posts in those neighborhoods. An obvious approach would be to partition the dataset with a uniform grid structure of a given cell size and identify the cells where this term is in the top-k most frequent keywords. However, this answer would be incomplete since it only checks for neighborhoods that are perfectly aligned with the grid. Furthermore, for every neighborhood (square) that is an answer, we can define infinitely more result neighborhoods by minimally shifting the square without including more posts in it. To address that, we need to identify contiguous regions where any point in the region can be the center of a neighborhood that satisfies the query. We propose an algorithm to efficiently answer an RSK query using an index structure consisting of a uniform grid augmented by materialized lists of term frequencies. We apply various optimizations that drastically improve query latency against baseline approaches. We also provide a theoretical model to choose the optimal cell size for the index to minimize query latency. We further examine a restricted version of the problem (RSKR) that limits the scope of the answer and propose efficientapproximatealgorithms. Finally, we examine how parallelism can improve performance by balancing the workload using a smartload slicingtechnique. Extensive experimental performance evaluation of the proposed methods using real Twitter datasets and crime report datasets, shows the efficiency of our optimizations and the accuracy of the proposed theoretical model. 
    more » « less
  4. We present a scalable approach for identifying moving flock patterns in large trajectory databases. A moving flock pattern refers to a group of entities that move closely together within a defined spatial radius for a minimum time interval. We focus on improving the state-of-the-art sequential algorithms, which suffer from high computational costs when dealing with large datasets. By leveraging distributed frameworks and utilizing spatial partitioning, the proposed solution aims to significantly reduce the time required to detect moving flock patterns. We highlight the bottlenecks of the sequential approaches and offer optimizations like partition-based parallelism and strategies for managing flock patterns that span multiple partitions. An experimental evaluation using synthetic trajectory datasets, demonstrates that the proposed methods substantially improve scalability and performance compared to existing sequential algorithms. 
    more » « less
    Free, publicly-accessible full text available August 25, 2026
  5. Honeybees, as natural crop pollinators, play a significant role in biodiversity and food production for human civilization. Bees actively regulate hive temperature (homeostasis) to maintain a colony’s proper functionality. Deviations from usual thermoregulation behavior due to external stressors (e.g., extreme environmental temperature, parasites, pesticide exposure) indicate an impending colony collapse. Anticipating such threats by forecasting hive temperature and finding changes in temperature patterns would allow beekeepers to take early preventive measures and avoid critical issues. In that case, how can we model bees’ thermoregulation behavior for an interpretable and effective hive monitoring system? In this article, we propose theprincipledElectronic Bee-Veterinarian Plus (EBV+) method based on the thermal diffusion equation and a novel “sigmoid” feedback-loop (P) controller for analyzing hive health with the following properties: (i) it iseffectiveon multiple, real-world beehive time sequences (recorded and streaming), (ii) it isexplainablewith only a few parameters (e.g., hive health factor) that beekeepers can easily quantify and trust, (iii) it issuesproactivealerts to beekeepers before any potential issue affecting homeostasis becomes detrimental, and (iv) it isscalablewith a time complexity of\(O(t)\)for reconstructing and\(O(t\times m)\)for findingmcuts of a sequence withttime-ticks. Experimental results on multiple real-world time sequences showcase the potential and practical feasibility of EBV+. Our method yields accurate forecasting (up to72%improvement in RMSE) with up to600times fewer parameters compared to baselines (ARX, seasonal ARX, Holt-winters, and DeepAR), as well as detects discontinuities and raises alerts that coincide with domain experts’ opinions. Moreover, EBV+ is scalable and fast, taking less than1 minuteon a stock laptop to reconstruct 2 months of sensor data. 
    more » « less
    Free, publicly-accessible full text available June 30, 2026
  6. Progressive visual analytics enable data scientists to efficiently explore large datasets and examine progressive results with low latency. Most progressive visualization frameworks use a progressive query processing module that controls the quality of the results and then feeds these results into a visualization module. The goal is to avoid poor-quality progressive results which could mislead data scientists. This method misses some optimization opportunities as it improves the quality of the intermediate result while ignoring how this result affects the final visualization. This work presents a work-in-progress quality-aware progressive visualization input control component, named QPV. The key idea of the proposed framework is to integrate the visualization module into the progressive query results so that the quality control takes into account the final visualization. With limited computational resources, QPV solves an optimization problem to allocate resources and alleviate the misleading effects in the progressive plots. 
    more » « less
  7. This paper demonstratesPynapple-G, an open-source library for scalable spatial grouping queries based on Apache Sedona (formerly known as GeoSpark). We demonstrate two modules, namely,SGPACandDDCEL, that support grouping points, grouping lines, and polygon overlays. TheSGPACmodule provides a large-scale grouping of spatial points by highly complex polygon boundaries. The grouping results aggregate the number of spatial points within the boundaries of each polygon. TheDDCELmodule provides the first parallelized algorithm to group spatial lines into a DCEL data structure and discovers planar polygons from scattered line segments. Exploiting the scalable DCEL, we support scalable overlay operations over multiple polygon layers to compute the layers' intersection, union, or difference. To showcasePyneapple-G, we have developed a frontend web application that enables users to interact with these modules, select their data layers or data points, and view results on an interactive map. We also provide interactive notebooks demonstrating the superiority and simplicity ofPyneapple-Gto help social scientists and developers explore its full potential. 
    more » « less
  8. Join operations are crucial in data analysis, but can suffer inefficiency with large datasets and complex non-equality-based conditions. Optimized join algorithms have gained traction in database research to address these challenges. One popular choice for implementing join algorithms is distributed data processing frameworks, e.g., Hadoop and Spark, but each implementation is highly tailored for specific query types. As a result, they do not address join queries that involve diverse and complex conditions since they are not integrated into a holistic query optimization engine like in DBMSs. On the other hand, implementing new join algorithms on a DBMS from scratch requires substantial effort and expertise. This paper introduces FUDJ, Flexible User-defined Distributed Joins, a framework for complex distributed join algorithms. The key idea of FUDJ is to allow developers to realize new distributed join algorithms into the database without delving into the database internals. As shown, an algorithm implemented in FUDJ is up to an order of magnitude faster than existing user-defined implementations with an order of magnitude fewer lines of code. 
    more » « less
  9. The increasing prevalence of large graph data has produced a variety of research and applications tailored toward graph data management. Users aiming to perform graph analytics will typically start by importing existing data into a separate graph-purposed storage engine. The cost of maintaining a separate system (e.g., the data copy, the associated queries, etc …) just for graph analytics may be prohibitive for users with Big Data. In this paper, we introduce Graphix and show how it enables property graph views of existing document data in AsterixDB, a Big Data management system boasting a partitioned-parallel query execution engine. We explain a) the graph view user model of Graphix, b) gSQL++ , a novel query language extension for synergistic document-based navigational pattern matching, and c) how edge hops are evaluated in a parallel fashion. We then compare queries authored in gSQL++ against versions in other leading query languages. Finally, we evaluate our approach against a leading native graph database, Neo4j, and show that Graphix is appropriate for operational and analytical workloads, especially at scale. 
    more » « less