skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Award ID contains: 1954962

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Abstract Window queries are important analytical tools for ordered data and have been researched both in streaming and stored data environments. By incorporating ideas for window queries from existing streaming and stored data systems, we propose a new window syntax that makes a wide range of window queries easier to write and optimize. We have implemented this new window syntax in SQL++, an SQL extension that supports querying semistructured data, on top of AsterixDB, a Big Data Management System, thus allowing us to process window queries over large datasets in a parallel and efficient manner. 
    more » « less
  2. Efficient multi-join query processing is crucial but remains a complex, ongoing challenge for high-performance data management systems (DBMSs). This paper studies the impact of different memory distribution techniques among join operators on different classes of multi-join query plans under different assumptions regarding memory availability and storage devices such as HDD and SSD on Amazon Web Services (AWS). We re-evaluate the results of one of the early impactful studies from the 1990s that was originally done using a simulator for the Gamma database system. The main goal of our study is to scientifically re-evaluate and build upon previous studies whose results have become the basis for the design of past and modern database systems, and to provide a solid foundation for understanding basic "join physics", which is essential for eventually designing a resource-based scheduler for concurrent complex workloads. 
    more » « less
    Free, publicly-accessible full text available November 20, 2025
  3. Efficient multi-join query processing is crucial but remains a com- plex, ongoing challenge for high-performance data management systems (DBMSs). This paper studies the impact of different memory distribution techniques among join operators on different classes of multi-join query plans under different assumptions regarding memory availability and storage devices such as HDD and SSD on Amazon Web Services (AWS). We re-evaluate the results of one of the early impactful studies from the 1990s that was originally done using a simulator for the Gamma database system. The main goal of our study is to scientifically re-evaluate and build upon previous studies whose results have become the basis for the design of past and modern database systems, and to provide a solid foundation for understanding basic “join physics", which is essential for eventually designing a resource-based scheduler for concurrent complex workloads. 
    more » « less
    Free, publicly-accessible full text available November 20, 2025
  4. Join operations are crucial in data analysis, but can suffer inefficiency with large datasets and complex non-equality-based conditions. Optimized join algorithms have gained traction in database research to address these challenges. One popular choice for implementing join algorithms is distributed data processing frameworks, e.g., Hadoop and Spark, but each implementation is highly tailored for specific query types. As a result, they do not address join queries that involve diverse and complex conditions since they are not integrated into a holistic query optimization engine like in DBMSs. On the other hand, implementing new join algorithms on a DBMS from scratch requires substantial effort and expertise. This paper introduces FUDJ, Flexible User-defined Distributed Joins, a framework for complex distributed join algorithms. The key idea of FUDJ is to allow developers to realize new distributed join algorithms into the database without delving into the database internals. As shown, an algorithm implemented in FUDJ is up to an order of magnitude faster than existing user-defined implementations with an order of magnitude fewer lines of code. 
    more » « less
  5. The increasing prevalence of large graph data has produced a variety of research and applications tailored toward graph data management. Users aiming to perform graph analytics will typically start by importing existing data into a separate graph-purposed storage engine. The cost of maintaining a separate system (e.g., the data copy, the associated queries, etc …) just for graph analytics may be prohibitive for users with Big Data. In this paper, we introduce Graphix and show how it enables property graph views of existing document data in AsterixDB, a Big Data management system boasting a partitioned-parallel query execution engine. We explain a) the graph view user model of Graphix, b) gSQL++ , a novel query language extension for synergistic document-based navigational pattern matching, and c) how edge hops are evaluated in a parallel fashion. We then compare queries authored in gSQL++ against versions in other leading query languages. Finally, we evaluate our approach against a leading native graph database, Neo4j, and show that Graphix is appropriate for operational and analytical workloads, especially at scale. 
    more » « less
  6. SQL is five decades old and has outlasted many programming and query languages that have come and gone during its lifetime. It was born shortly after the introduction of the relational model, and was designed for querying a flat and typed tabular world. Support for modern, flexible data in the SQL standard and in relational database systems has largely been approached via the addition of new column types (e.g. XML or JSON) together with functions to operate on them. It is time for a cleaner solution that retains the benefits that have allowed SQL to be so successful for so long. We describe SQL++, a SQL extension that relaxes SQL's strictness in terms of both object structure (flat → nested) and schema (mandatory → optional), along with a multi-party effort to agree on a core definition and syntax supportable by multiple vendors. SQL++ sees relational data as a subset of a more flexible object model and it sees collections of document data (e.g., JSON) as a natural and supportable relaxation as opposed to a “bolt on” addition via a SQL column type. We describe the core features of SQL++ and explain how its definition can accommodate flexible data, while staying true to SQL in situations where the target data is tabular and strongly typed. Index Terms-semistructured data, query, JSON, SQL, NoSQL 
    more » « less
  7. In database management systems (DBMSs) that handle multiple concurrent queries, adapting to fluctuating workloads is crucial. This flexibility allows the DBMS to revise decisions based on current workload and available resources. As memory availability changes with the arrival or completion of queries, having memory-intensive operators like the Hybrid Hash Join that dynamically adapt is vital. This paper introduces a new memory-adaptive Hash-Based join algorithm design implemented in Apache AsterixDB and evaluates its responsiveness to memory variability. 
    more » « less
  8. Effective query optimization remains an open problem for Big Data Management Systems. In this work, we revisit an old idea, runtime dynamic optimization, and adapt it to a big data management system, AsterixDB. The approach runs in stages (re-optimization points), starting by first executing all predicates local to a single dataset. The intermediate result created by a stage is then used to re-optimize the remaining query. This re-optimization approach avoids inaccurate intermediate result cardinality estimates, thus leading to much better execution plans. While it introduces overhead for materializing intermediate results, experiments show that this overhead is relatively small and is an acceptable price to pay given the optimization benefits. 
    more » « less
  9. In Proc. of the IEEE Int’l. Workshop on Benchmarking, Performance Tuning, and Optimization for Big Data Applications (BPOD 2022) 
    more » « less