skip to main content


Search for: All records

Creators/Authors contains: "Tsotras, Vassilis J."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract

    Window queries are important analytical tools for ordered data and have been researched both in streaming and stored data environments. By incorporating ideas for window queries from existing streaming and stored data systems, we propose a new window syntax that makes a wide range of window queries easier to write and optimize. We have implemented this new window syntax in SQL++, an SQL extension that supports querying semistructured data, on top of AsterixDB, a Big Data Management System, thus allowing us to process window queries over large datasets in a parallel and efficient manner.

     
    more » « less
  2. This paper studies the spatial group-by query over complex polygons. Given a set of spatial points and a set of polygons, the spatial group-by query returns the number of points that lie within the boundaries of each polygon. Groups are selected from a set of non-overlapping complex polygons, typically in the order of thousands, while the input is a large-scale dataset that contains hundreds of millions or even billions of spatial points. This problem is challenging because real polygons (like counties, cities, postal codes, voting regions, etc.) are described by very complex boundaries. We propose a highly-parallelized query processing framework to efficiently compute the spatial group-by query on highly skewed spatial data. We also propose an effective query optimizer that adaptively assigns the appropriate processing scheme based on the query polygons. Our experimental evaluation with real data and queries has shown significant superiority over all existing techniques. 
    more » « less
    Free, publicly-accessible full text available October 1, 2024
  3. ABSTRACT The Doubly Connected Edge List (DCEL) is an edge-list structure that has been widely utilized in spatial applications for planar topological computations. An important operation is the overlay which combines the DCELs of two input layers and can easily support spatial queries like the intersection, union and difference between these layers. However, existing sequential implementations for computing the overlay do not scale and fail to complete for large datasets (for example the US census tracks). In this paper we propose a distributed and scalable way to compute the overlay operation and its related supported queries. We address the issues involved in efficiently distributing the overlay operator and over various optimizations that improve performance. Our scalable solution can compute the overlay of very large real datasets (32M edges) in few minutes. 
    more » « less
    Free, publicly-accessible full text available August 23, 2024
  4. Abstract—The Doubly Connected Edge List (DCEL) is a popular data structure for representing planar subdivisions and is used to accelerate spatial applications like map overlay, graph simplification, and subdivision traversal. Current DCEL imple- mentations assume a standalone machine environment, which does not scale when processing the large dataset sizes that abound in today’s spatial applications. This paper proposes a Distributed Doubly Connected Edge List (DDCEL) data structure extending the DCEL to a distributed environment. The DDCEL constructor undergoes a two-phase paradigm to generate the subdivision’s vertices, half-edges, and faces. After spatially partitioning the input data, the first phase runs the sequential DCEL construction algorithm on each data partition in parallel. The second phase then iteratively merges information from multiple data parti- tions to generate the shared data structure. Our experimental evaluation with real data of road networks of up to 563 million line segments shows significant performance advantages of the proposed approach over the existing techniques. 
    more » « less
    Free, publicly-accessible full text available July 1, 2024
  5. Effective query optimization remains an open problem for Big Data Management Systems. In this work, we revisit an old idea, runtime dynamic optimization, and adapt it to a big data management system, AsterixDB. The approach runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created by a stage is then used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimates, thus leading to much better execution plans. While it introduces overhead for materializing intermediate results, experiments show that this overhead is relatively small and is an acceptable price to pay given the optimization benefits. 
    more » « less
    Free, publicly-accessible full text available June 7, 2024
  6. Abstract

    We introduce theReverseSpatial Top-kKeyword (RSK)query, which is defined as:given a query term q, an integer k and a neighborhood size find all the neighborhoods of that size where q is in the top-k most frequent terms among the social posts in those neighborhoods. An obvious approach would be to partition the dataset with a uniform grid structure of a given cell size and identify the cells where this term is in the top-k most frequent keywords. However, this answer would be incomplete since it only checks for neighborhoods that are perfectly aligned with the grid. Furthermore, for every neighborhood (square) that is an answer, we can define infinitely more result neighborhoods by minimally shifting the square without including more posts in it. To address that, we need to identify contiguous regions where any point in the region can be the center of a neighborhood that satisfies the query. We propose an algorithm to efficiently answer an RSK query using an index structure consisting of a uniform grid augmented by materialized lists of term frequencies. We apply various optimizations that drastically improve query latency against baseline approaches. We also provide a theoretical model to choose the optimal cell size for the index to minimize query latency. We further examine a restricted version of the problem (RSKR) that limits the scope of the answer and propose efficientapproximatealgorithms. Finally, we examine how parallelism can improve performance by balancing the workload using a smartload slicingtechnique. Extensive experimental performance evaluation of the proposed methods using real Twitter datasets and crime report datasets, shows the efficiency of our optimizations and the accuracy of the proposed theoretical model.

     
    more » « less
  7. Abstract

    Today, data is being actively generated by a variety of devices, services, and applications. Such data is important not only for the information that it contains, but also for its relationships to other data and to interested users. Most existing Big Data systems focus onpassivelyanswering queries from users, rather thanactivelycollecting data, processing it, and serving it to users. To satisfy both passive and active requests at scale, application developers need either to heavily customize an existing passive Big Data system or to glue one together with systems likeStreaming EnginesandPub-sub services. Either choice requires significant effort and incurs additional overhead. In this paper, we present the BAD (Big Active Data) system as an end-to-end, out-of-the-box solution for this challenge. It is designed to preserve the merits of passive Big Data systems and introduces new features for actively serving Big Data to users at scale. We show the design and implementation of the BAD system, demonstrate how BAD facilitates providing both passive and active data services, investigate the BAD system’s performance at scale, and illustrate the complexities that would result from instead providing BAD-like services with a “glued” system.

     
    more » « less
  8. null (Ed.)
    Query Optimization remains an open problem for Big Data Management Systems. Traditional optimizers are cost-based and use statistical estimates of intermediate result cardinalities to assign costs and pick the best plan. However, such estimates tend to become less accurate because of filtering conditions caused either from undetected correlations between multiple predicates local to a single dataset, predicates with query parameters, or predicates involving user-defined functions (UDFs). Consequently, traditional query optimizers tend to ignore or miscalculate those settings, thus leading to suboptimal execution plans. Given the volume of today’s data, a suboptimal plan can quickly become very inefficient. In this work, we revisit the old idea of runtime dynamic optimization and adapt it to a shared-nothing distributed database system, AsterixDB. The optimization runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created from each stage is used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimations, thus leading to much better execution plans. While it introduces the overhead for materializing these intermediate results, our experiments show that this overhead is relatively small and it is an acceptable price to pay given the optimization benefits. In fact, our experimental evaluation shows that runtime dynamic optimization leads to much better execution plans as compared to the current default AsterixDB plans as well as to plans produced by static cost-based optimization (i.e. based on the initial dataset statistics) and other state-of-the-art approaches. 
    more » « less
  9. A region \(\mathcal {R} \) is a dwell region for a moving object O if, given a threshold distance r q and duration τ q , every point of \(\mathcal {R} \) remains within distance r q from O for at least time τ q . Points within \(\mathcal {R} \) are likely to be of interest to O , so identification of dwell regions has applications such as monitoring and surveillance. We first present a logarithmic-time online algorithm to find dwell regions in an incoming stream of object positions. Our method maintains the upper and lower bounds for the radius of the smallest circle enclosing the object positions, thereby greatly reducing the number of trajectory points needed to evaluate the query. It approximates the radius of the smallest circle enclosing a given subtrajectory within an arbitrarily small user-defined factor, and is also able to efficiently answer decision queries asking whether or not a dwell region exists. For the offline version of the dwell region problem, we first extend our online approach to develop the ρ -Index, which indexes subtrajectories using query radius ranges. We then refine this approach to obtain the τ -Index, which indexes subtrajectories using both query radius ranges and dwell durations. Our experiments using both real-world and synthetic datasets show that the online approach can scale up to hundreds of thousands of moving objects. For archived trajectories, our indexing approaches speed up queries by many orders of magnitude. 
    more » « less