skip to main content


Search for: All records

Award ID contains: 1954644

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    We introduce theReverseSpatial Top-kKeyword (RSK)query, which is defined as:given a query term q, an integer k and a neighborhood size find all the neighborhoods of that size where q is in the top-k most frequent terms among the social posts in those neighborhoods. An obvious approach would be to partition the dataset with a uniform grid structure of a given cell size and identify the cells where this term is in the top-k most frequent keywords. However, this answer would be incomplete since it only checks for neighborhoods that are perfectly aligned with the grid. Furthermore, for every neighborhood (square) that is an answer, we can define infinitely more result neighborhoods by minimally shifting the square without including more posts in it. To address that, we need to identify contiguous regions where any point in the region can be the center of a neighborhood that satisfies the query. We propose an algorithm to efficiently answer an RSK query using an index structure consisting of a uniform grid augmented by materialized lists of term frequencies. We apply various optimizations that drastically improve query latency against baseline approaches. We also provide a theoretical model to choose the optimal cell size for the index to minimize query latency. We further examine a restricted version of the problem (RSKR) that limits the scope of the answer and propose efficientapproximatealgorithms. Finally, we examine how parallelism can improve performance by balancing the workload using a smartload slicingtechnique. Extensive experimental performance evaluation of the proposed methods using real Twitter datasets and crime report datasets, shows the efficiency of our optimizations and the accuracy of the proposed theoretical model.

     
    more » « less
  2. This paper studies the spatial group-by query over complex polygons. Given a set of spatial points and a set of polygons, the spatial group-by query returns the number of points that lie within the boundaries of each polygon. Groups are selected from a set of non-overlapping complex polygons, typically in the order of thousands, while the input is a large-scale dataset that contains hundreds of millions or even billions of spatial points. This problem is challenging because real polygons (like counties, cities, postal codes, voting regions, etc.) are described by very complex boundaries. We propose a highly-parallelized query processing framework to efficiently compute the spatial group-by query on highly skewed spatial data. We also propose an effective query optimizer that adaptively assigns the appropriate processing scheme based on the query polygons. Our experimental evaluation with real data and queries has shown significant superiority over all existing techniques. 
    more » « less
    Free, publicly-accessible full text available October 1, 2024
  3. With the requirements to enable data analytics and exploration interactively and efficiently, progressive data processing, especially progressive join, became essential to data science. Join queries are particularly challenging due to the correlation between input datasets which causes the results to be biased towards some join keys. Existing methods carefully control which parts of the input to process in order to improve the quality of progressive results. If the quality is not satisfactory, they will process more data to improve the result. In this paper, we propose an alternative approach that initially seems counter-intuitive but surprisingly works very well. After query processing, we intentionally report fewer results to the user with the goal of improving the quality. The key idea is that if the output is deviated from the correct distribution, we temporarily hide some results to correct the bias. As we process more data, the hidden results are inserted back until the full dataset is processed. The main challenge is that we do not know the correct output distribution while the progressive query is running. In this work, we formally define the progressive join problem with quality and progressive result rate constraints. We propose an input&output quality-aware progressive join framework (QPJ) that (1) provides input control that decides which parts of the input to process; (2) estimates the final result distribution progressively; (3) automatically controls the quality of the progressive output rate; and (4) combines input&output control to enable quality control of the progressive results. We compare QPJ with existing methods and show QPJ can provide the progressive output that can represent the final answer better than existing methods. 
    more » « less
  4. Effective query optimization remains an open problem for Big Data Management Systems. In this work, we revisit an old idea, runtime dynamic optimization, and adapt it to a big data management system, AsterixDB. The approach runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created by a stage is then used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimates, thus leading to much better execution plans. While it introduces overhead for materializing intermediate results, experiments show that this overhead is relatively small and is an acceptable price to pay given the optimization benefits. 
    more » « less
    Free, publicly-accessible full text available June 7, 2024
  5. The popularity of JSON as a data interchange format resulted in big amounts of datasets available for processing. Users would like to analyze this data using SQL queries but existing distributed systems limit their users to only two specific formats, JSONLine and GeoJSON. The complexity of JSON schema makes it challenging to parse arbitrary files in a modern distributed system while producing records with unified schema that can be processed with SQL. To address these challenges, this paper introduces dsJSON, a state-of-the-art distributed JSON processor that overcomes limitations in existing systems and scales to big and complex data. dsJSON introduces the projection tree, a novel data structure that applies selective parsing of nested attributes to produce records that are ready for SQL processors. The key objective of the projection tree is to parse a big JSON file in parallel to produce records with a unified schema that can be processed with SQL. dsJSON is integrated into SparkSQL which enables users to run arbitrary SQL queries on complex JSON files. It also pushes projection and filter down into the parser for full integration between the parser and the processor. Experiments on up-to two terabytes of real data show that dsJSON performs several times faster than existing systems. It can also efficiently parse extremely large files not supported by existing distributed parsers 
    more » « less
  6. Modern data analytics applications prefer to use column-storage formats due to their improved storage efficiency through encoding and compression. Parquet is the most popular file format for col- umn data storage that provides several of these benefits out of the box. However, geospatial data is not readily supported by Parquet. This paper introduces Spatial Parquet, a Parquet extension that efficiently supports geospatial data. Spatial Parquet inherits all the advantages of Parquet for non-spatial data, such as rich data types, compression, and column/row filtering. Additionally, it adds three new features to accommodate geospatial data. First, it introduces a geospatial data type that can encode all standard spatial geome- tries in a column format compatible with Parquet. Second, it adds a new lossless and efficient encoding method, termed FP-delta, that is customized to efficiently store geospatial coordinates stored in floating-point format. Third, it adds a light-weight spatial index that allows the reader to skip non-relevant parts of the file for increased read efficiency. Experiments on large-scale real data showed that Spatial Parquet can reduce the data size by a factor of three even without compression. Compression can further reduce the storage size. Additionally, Spatial Parquet can reduce the reading time by two orders of magnitude when the light-weight index is applied. This initial prototype can open new research directions to further improve geospatial data storage in column format. 
    more » « less