NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

HoneyComb: A Parallel Worst-Case Optimal Join on Multicores

https://doi.org/10.1145/3725307

Wu, Jiacheng; Suciu, Dan (June 2025, Proceedings of the ACM on Management of Data)

To achieve true scalability on massive datasets, a modern query engine needs to be able to take advantage of large, shared-memory, multicore systems.Binary joinsare conceptually easy to parallelize on a multicore system; however, several applications require a different approach to query evaluation, using a Worst-Case Optimal Join (WCOJ) algorithm. WCOJ is known to outperform traditional query plans for cyclic queries. However, there is no obvious adaptation of WCOJ to parallel architectures. The few existing systems that parallelizeWCOJ do this by partitioning only the top variable of theWCOJ algorithm. This leads to work skew (since some relations end up being read entirely by every thread), possible contention between threads (when the hierarchical trie index is built lazily, which is the case on most recent WCOJ systems), and exacerbates the redundant computations already existing in WCOJ.
more » « less
Free, publicly-accessible full text available June 17, 2026
Fast Matrix Multiplication meets the Submodular Width

https://doi.org/10.1145/3725235

Abo_Khamis, Mahmoud; Hu, Xiao; Suciu, Dan (June 2025, Proceedings of the ACM on Management of Data)

One fundamental question in database theory is the following: Given a Boolean conjunctive queryQ, what is the best complexity for computing the answer to Q in terms of the input database sizeN? When restricted to the class of combinatorial algorithms, it is known that the best known complexity for any queryQis captured by thesubmodular widthofQ. However, beyond combinatorial algorithms, certain queries are known to admit faster algorithms that often involve a clever combination of fast matrix multiplication and data partitioning. Nevertheless, there is no systematic way to derive and analyze the complexity of such algorithms for arbitrary queriesQ. In this work, we introduce a general framework that captures the best complexity for answering any Boolean conjunctive queryQusing matrix multiplication. Our framework unifies both combinatorial and non-combinatorial techniques under the umbrella of information theory. It generalizes the notion of submodular width to a new stronger notion called the ω-submodular widththat naturally incorporates the power of fast matrix multiplication. We describe a matching algorithm that computes the answer to any queryQin time corresponding to the ω-submodularwidth ofQ. We show that our framework recovers the best known complexities for Boolean queries that have been studied in the literature, to the best of our knowledge, and also discovers new algorithms for some classes of queries that improve upon the best known complexities.
more » « less
Free, publicly-accessible full text available June 9, 2026
Galley: Modern Query Optimization for Sparse Tensor Programs

https://doi.org/10.1145/3725301

Deeds, Kyle; Ahrens, Willow; Balazinska, Magdalena; Suciu, Dan (June 2025, Proceedings of the ACM on Management of Data)

The tensor programming abstraction is a foundational paradigm which allows users to write high performance programs via a high-level imperative interface. Recent work onsparse tensor compilershas extended this paradigm to sparse tensors (i.e., tensors where most entries are not explicitly represented). With these systems, users define the semantics of the program and the algorithmic decisions in a concise language that can be compiled to efficient low-level code. However, these systems still require users to make complex decisions about program structure and memory layouts to write efficient programs. This work presents.Galley, a system for declarative tensor programming that allows users to write efficient tensor programs without making complex algorithmic decisions. Galley is the first system to perform cost based lowering of sparse tensor algebra to the imperative language of sparse tensor compilers, and the first to optimize arbitrary operators beyond Σ and *. First, it decomposes the input program into a sequence of aggregation steps through a novel extension of the FAQ framework. Second, Galley optimizes and converts each aggregation step to a concrete program, which is compiled and executed with a sparse tensor compiler. We show that Galley produces programs that are 1-300x faster than competing methods for machine learning over joins and 5-20x faster than a state-of-the-art relational database for subgraph counting workloads with a minimal optimization overhead.
more » « less
Free, publicly-accessible full text available June 17, 2026
Output-Sensitive Evaluation of Regular Path Queries

https://doi.org/10.1145/3725242

Abo_Khamis, Mahmoud; Kara, Ahmet; Olteanu, Dan; Suciu, Dan (June 2025, Proceedings of the ACM on Management of Data)

We study the classical evaluation problem for regular path queries: Given an edge-labeled graph and a regular path query, compute the set of pairs of vertices that are connected by paths that match the query. The Product Graph (PG) is the established evaluation approach for regular path queries. PG first constructs the product automaton of the data graph and the query and then uses breadth-first search to find the accepting states reachable from each initial state in the product automaton. Its data complexity is O(|V|⋅|E|), where V and E are the sets of vertices and respectively edges in the data graph. This complexity cannot be improved by combinatorial algorithms. In this paper, we introduce OSPG, an output-sensitive refinement of PG, whose data complexity is O(|E|^3/2+ min(OUT⋅√|E|, |V|⋅|E|)), where OUT is the number of distinct vertex pairs in the query output. OSPG's complexity is at most that of PG and can be asymptotically smaller for small output and sparse input. The improvement of OSPG over PG is due to the unnecessary time wasted by PG in the breadth-first search phase, in case a few output pairs are eventually discovered. For queries without Kleene star, the complexity of OSPG can be further improved to O(|E| + |E|⋅√OUT).
more » « less
Free, publicly-accessible full text available June 9, 2026
L p Bound in Action: Cardinality Estimation with One-Sided Guarantees

https://doi.org/10.1145/3722212.3725114

Mayer, Christoph; Zhang, Haozhe; Abo_Khamis, Mahmoud; Olteanu, Dan; Suciu, Dan (June 2025, ACM)

Free, publicly-accessible full text available June 22, 2026
LpBound : Pessimistic Cardinality Estimation Using ℓ _p -Norms of Degree Sequences

https://doi.org/10.1145/3725321

Zhang, Haozhe; Mayer, Christoph; Abo_Khamis, Mahmoud; Olteanu, Dan; Suciu, Dan (June 2025, Proceedings of the ACM on Management of Data)

Cardinality estimation is the problem of estimating the size of the output of a query, without actually evaluating the query. The cardinality estimator is a critical piece of a query optimizer, and is often the main culprit when the optimizer chooses a poor plan. This paper introduces LpBound, a pessimistic cardinality estimator for multi-join queries (acyclic or cyclic) with selection predicates and group-by clauses.LpBoundcomputes a guaranteed upper bound on the size of the query output using simple statistics on the input relations, consisting of ℓ_p-norms of degree sequences. The bound is the optimal solution of a linear program whose constraints encode data statistics and Shannon inequalities. We introduce two optimizations that exploit the structure of the query in order to speed up the estimation time and makeLpBoundpractical. We experimentally evaluateLpBoundagainst a range of traditional, pessimistic, and machine learning-based estimators on the JOB, STATS, and subgraph matching benchmarks. Our main finding is thatLpBoundcan be orders of magnitude more accurate than traditional estimators used in mainstream open-source and commercial database systems. Yet it has comparable low estimation time and space requirements. When injected the estimates ofLpBound, Postgres derives query plans at least as good as those derived using the true cardinalities.
more » « less
Free, publicly-accessible full text available June 17, 2026
PANDA: Query Evaluation in Submodular Width

https://doi.org/10.46298/THEORETICS.25.12

Khamis, Mahmoud Abo; Ngo, Hung Q; Suciu, Dan (April 2025, TheoretiCS)

In recent years, several information-theoretic upper bounds have been introduced on the output size and evaluation cost of database join queries. These bounds vary in their power depending on both the type of statistics on input relations and the query plans that they support. This motivated the search for algorithms that can compute the output of a join query in times that are bounded by the corresponding information-theoretic bounds. In this paper, we describe PANDA, an algorithm that takes a Shannon-inequality that underlies the bound, and translates each proof step into an algorithmic step corresponding to some database operation. PANDA computes answers to a conjunctive query in time given by the the submodular width plus the output size of the query. The version in this paper represents a significant simplification of the original version [ANS, PODS'17]. Comment: 42 pages. This is the TheoretiCS journal version
more » « less
Free, publicly-accessible full text available April 30, 2026
Information Theory Strikes Back: New Development in the Theory of Cardinality Estimation

https://doi.org/10.1145/3733620.3733623

Abo_Khamis, Mahmoud; Nakos, Vasileios; Olteanu, Dan; Suciu, Dan (April 2025, ACM SIGMOD Record)

Estimating the cardinality of the output of a query is a fundamental problem in database query processing. In this article, we overview a recently published contribution that casts the cardinality estimation problem as linear optimization and computes guaranteed upper bounds on the cardinality of the output for any full conjunctive query. The objective of the linear program is to maximize the joint entropy of the query variables and its constraints are the Shannon information inequalities and new information inequalities involving ℓp-norms of the degree sequences of the join attributes. The bounds based on arbitrary norms can be asymptotically lower than those based on the ℓ1 and ℓ∞ norms, which capture the cardinalities and respectively the max-degrees of the input relations. They come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when each degree sequence is on one join attribute.
more » « less
Free, publicly-accessible full text available April 28, 2026
Pessimistic Cardinality Estimation

https://doi.org/10.1145/3712311.3712313

Abo_Khamis, Mahmoud; Deeds, Kyle; Olteanu, Dan; Suciu, Dan (January 2025, ACM SIGMOD Record)

Cardinality Estimation is to estimate the size of the output of a query without computing it, by using only statistics on the input relations. Existing estimators try to return an unbiased estimate of the cardinality: this is notoriously difficult. A new class of estimators have been proposed recently, called pessimistic estimators, which compute a guaranteed upper bound on the query output. Two recent advances have made pessimistic estimators practical. The first is the recent observation that degree sequences of the input relations can be used to compute query upper bounds. The second is a long line of theoretical results that have developed the use of information theoretic inequalities for query upper bounds. This paper is a short overview of pessimistic cardinality estimators, contrasting them with traditional estimators.
more » « less
Free, publicly-accessible full text available January 13, 2026
Semantic Foundations of Equality Saturation

https://doi.org/10.4230/LIPICS.ICDT.2025.11

Suciu, Dan; Wang, Yisu Remy; Zhang, Yihong (January 2025, Schloss Dagstuhl – Leibniz-Zentrum für Informatik)
Roy, Sudeepa; Kara, Ahmet (Ed.)
Equality saturation is an emerging technique for program and query optimization developed in the programming language community. It performs term rewriting over an E-graph, a data structure that compactly represents a program space. Despite its popularity, the theory of equality saturation lags behind the practice. In this paper, we define a fixpoint semantics of equality saturation based on tree automata and uncover deep connections between equality saturation and the chase. We characterize the class of chase sequences that correspond to equality saturation. We study the complexities of terminations of equality saturation in three cases: single-instance, all-term-instance, and all-E-graph-instance. Finally, we define a syntactic criterion based on acyclicity that implies equality saturation termination.
more » « less
Free, publicly-accessible full text available January 1, 2026

« Prev Next »

Search for: All records