NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Output-Sensitive Evaluation of Regular Path Queries

https://doi.org/10.1145/3725242

Abo_Khamis, Mahmoud; Kara, Ahmet; Olteanu, Dan; Suciu, Dan (June 2025, Proceedings of the ACM on Management of Data)

We study the classical evaluation problem for regular path queries: Given an edge-labeled graph and a regular path query, compute the set of pairs of vertices that are connected by paths that match the query. The Product Graph (PG) is the established evaluation approach for regular path queries. PG first constructs the product automaton of the data graph and the query and then uses breadth-first search to find the accepting states reachable from each initial state in the product automaton. Its data complexity is O(|V|⋅|E|), where V and E are the sets of vertices and respectively edges in the data graph. This complexity cannot be improved by combinatorial algorithms. In this paper, we introduce OSPG, an output-sensitive refinement of PG, whose data complexity is O(|E|^3/2+ min(OUT⋅√|E|, |V|⋅|E|)), where OUT is the number of distinct vertex pairs in the query output. OSPG's complexity is at most that of PG and can be asymptotically smaller for small output and sparse input. The improvement of OSPG over PG is due to the unnecessary time wasted by PG in the breadth-first search phase, in case a few output pairs are eventually discovered. For queries without Kleene star, the complexity of OSPG can be further improved to O(|E| + |E|⋅√OUT).
more » « less
Free, publicly-accessible full text available June 9, 2026
L p Bound in Action: Cardinality Estimation with One-Sided Guarantees

https://doi.org/10.1145/3722212.3725114

Mayer, Christoph; Zhang, Haozhe; Abo_Khamis, Mahmoud; Olteanu, Dan; Suciu, Dan (June 2025, ACM)

Free, publicly-accessible full text available June 22, 2026
LpBound : Pessimistic Cardinality Estimation Using ℓ _p -Norms of Degree Sequences

https://doi.org/10.1145/3725321

Zhang, Haozhe; Mayer, Christoph; Abo_Khamis, Mahmoud; Olteanu, Dan; Suciu, Dan (June 2025, Proceedings of the ACM on Management of Data)

Cardinality estimation is the problem of estimating the size of the output of a query, without actually evaluating the query. The cardinality estimator is a critical piece of a query optimizer, and is often the main culprit when the optimizer chooses a poor plan. This paper introduces LpBound, a pessimistic cardinality estimator for multi-join queries (acyclic or cyclic) with selection predicates and group-by clauses.LpBoundcomputes a guaranteed upper bound on the size of the query output using simple statistics on the input relations, consisting of ℓ_p-norms of degree sequences. The bound is the optimal solution of a linear program whose constraints encode data statistics and Shannon inequalities. We introduce two optimizations that exploit the structure of the query in order to speed up the estimation time and makeLpBoundpractical. We experimentally evaluateLpBoundagainst a range of traditional, pessimistic, and machine learning-based estimators on the JOB, STATS, and subgraph matching benchmarks. Our main finding is thatLpBoundcan be orders of magnitude more accurate than traditional estimators used in mainstream open-source and commercial database systems. Yet it has comparable low estimation time and space requirements. When injected the estimates ofLpBound, Postgres derives query plans at least as good as those derived using the true cardinalities.
more » « less
Free, publicly-accessible full text available June 17, 2026
Information Theory Strikes Back: New Development in the Theory of Cardinality Estimation

https://doi.org/10.1145/3733620.3733623

Abo_Khamis, Mahmoud; Nakos, Vasileios; Olteanu, Dan; Suciu, Dan (April 2025, ACM SIGMOD Record)

Estimating the cardinality of the output of a query is a fundamental problem in database query processing. In this article, we overview a recently published contribution that casts the cardinality estimation problem as linear optimization and computes guaranteed upper bounds on the cardinality of the output for any full conjunctive query. The objective of the linear program is to maximize the joint entropy of the query variables and its constraints are the Shannon information inequalities and new information inequalities involving ℓp-norms of the degree sequences of the join attributes. The bounds based on arbitrary norms can be asymptotically lower than those based on the ℓ1 and ℓ∞ norms, which capture the cardinalities and respectively the max-degrees of the input relations. They come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when each degree sequence is on one join attribute.
more » « less
Free, publicly-accessible full text available April 28, 2026
Pessimistic Cardinality Estimation

https://doi.org/10.1145/3712311.3712313

Abo_Khamis, Mahmoud; Deeds, Kyle; Olteanu, Dan; Suciu, Dan (January 2025, ACM SIGMOD Record)

Cardinality Estimation is to estimate the size of the output of a query without computing it, by using only statistics on the input relations. Existing estimators try to return an unbiased estimate of the cardinality: this is notoriously difficult. A new class of estimators have been proposed recently, called pessimistic estimators, which compute a guaranteed upper bound on the query output. Two recent advances have made pessimistic estimators practical. The first is the recent observation that degree sequences of the input relations can be used to compute query upper bounds. The second is a long line of theoretical results that have developed the use of information theoretic inequalities for query upper bounds. This paper is a short overview of pessimistic cardinality estimators, contrasting them with traditional estimators.
more » « less
Full Text Available
Insert-Only versus Insert-Delete in Dynamic Query Evaluation

https://doi.org/10.1145/3695837

Abo_Khamis, Mahmoud; Kara, Ahmet; Olteanu, Dan; Suciu, Dan (November 2024, Proceedings of the ACM on Management of Data)

We study the dynamic query evaluation problem: Given a full conjunctive query Q and a sequence of updates to the input database, we construct a data structure that supports constant-delay enumeration of the tuples in the query output after each update. We show that a sequence of N insert-only updates to an initially empty database can be executed in total time O(N^w(Q)), where w(Q) is the fractional hypertree width of Q. This matches the complexity of the static query evaluation problem for Q and a database of size N. One corollary is that the amortized time per single-tuple insert is constant for acyclic full conjunctive queries. In contrast, we show that a sequence of N inserts and deletes can be executed in total time Õ(N^w(Q')), where Q' is obtained from Q by extending every relational atom with extra variables that represent the lifespans of tuples in the database. We show that this reduction is optimal in the sense that the static evaluation runtime of Q' provides a lower bound on the total update time for the output of Q. Our approach achieves amortized optimal update times for the hierarchical and Loomis-Whitney join queries.
more » « less
Full Text Available
From Shapley Value to Model Counting and Back

https://doi.org/10.1145/3651142

Kara, Ahmet; Olteanu, Dan; Suciu, Dan (May 2024, Proceedings of the ACM on Management of Data)

In this paper we investigate the problem of quantifying the contribution of each variable to the satisfying assignments of a Boolean function based on the Shapley value. Our main result is a polynomial-time equivalence between computing Shapley values and model counting for any class of Boolean functions that are closed under substitutions of variables with disjunctions of fresh variables. This result settles an open problem raised in prior work, which sought to connect the Shapley value computation to probabilistic query evaluation. We show two applications of our result. First, the Shapley values can be computed in polynomial time over deterministic and decomposable circuits, since they are closed under OR-substitutions. Second, there is a polynomial-time equivalence between computing the Shapley value for the tuples contributing to the answer of a Boolean conjunctive query and counting the models in the lineage of the query. This equivalence allows us to immediately recover the dichotomy for Shapley value computation in case of self-join-free Boolean conjunctive queries; in particular, the hardness for non-hierarchical queries can now be shown using a simple reduction from the \#P-hard problem of model counting for lineage in positive bipartite disjunctive normal form.
more » « less
Full Text Available
Join Size Bounds using l _p -Norms on Degree Sequences

https://doi.org/10.1145/3651597

Abo_Khamis, Mahmoud; Nakos, Vasileios; Olteanu, Dan; Suciu, Dan (May 2024, Proceedings of the ACM on Management of Data)

Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate sizes and max-degrees from input relations, yet their main benefit is limited to cyclic queries, because they degenerate to rather trivial formulas on acyclic queries. We introduce a significant extension of the upper bounds, by incorporating l_p-norms of the degree sequences of join attributes. Our bounds are significantly lower than previously known bounds, even when applied to acyclic queries. These bounds are also based on information theory, they come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when all degrees are ''simple''.
more » « less
Full Text Available
Chorus: Foundation Models for Unified Data Discovery and Exploration

https://doi.org/10.14778/3659437.3659461

Kayali, Moe; Lykov, Anton; Fountalis, Ilias; Vasiloglou, Nikolaos; Olteanu, Dan; Suciu, Dan (April 2024, Proceedings of the VLDB Endowment)

We apply foundation models to data discovery and exploration tasks. Foundation models are large language models (LLMS) that show promising performance on a range of diverse tasks unrelated to their training. We show that these models are highly applicable to the data discovery and data exploration domain. When carefully used, they have superior capability on three representative tasks: table-class detection, column-type annotation and join-column prediction. On all three tasks, we show that a foundation-model-based approach outperforms the task-specific models and so the state of the art. Further, our approach often surpasses human-expert task performance. We investigate the fundamental characteristics of this approach including generalizability to several foundation models and the impact of non-determinism on the outputs. All in all, this suggests a future direction in which disparate data management tasks can be unified under foundation models.
more » « less
Full Text Available
Rk-means: Fast Clustering for Relational Data

Curtin, Ryan; Moseley, Benjamin; Ngo, Hung; Nguyen, XuanLong; Olteanu, Dan; Schleich, Maximilian (January 2020, Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records