NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Ranked Enumeration for Database Queries

https://doi.org/10.1145/3703922.3703924

Tziavelis, Nikolaos; Gatterbauer, Wolfgang; Riedewald, Mirek (November 2024, ACM SIGMOD Record)

Ranked enumeration is a query-answering paradigm where the query answers are returned incrementally in order of importance (instead of returning all answers at once). Importance is defined by a ranking function that can be specific to the application, but typically involves either a lexicographic order (e.g., ORDER BY R.A, S.B in SQL) or a weighted sum of attributes (e.g., ORDER BY 3*R.A + 2*S.B). Recent work has introduced any-k algorithms for (multi-way) join queries, which push ranking into joins and avoid materializing intermediate results until necessary. The top-ranked answers are returned asymptotically faster than the common join-then-rank approach of database systems, resulting in orders-of-magnitude speedup in practice.
more » « less
Full Text Available
DomainNet: Homograph Detection and Understanding in Data Lake Disambiguation

https://doi.org/10.1145/3612919

Leventidis, Aristotelis; Di_Rocco, Laura; Gatterbauer, Wolfgang; Miller, Renée J; Riedewald, Mirek (September 2023, ACM Transactions on Database Systems)

Modern data lakes are heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes:How can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph?While word and entity disambiguation have been well studied in computational linguistics, data management, and data science, we show that data lakes provide a new opportunity for disambiguation of data values, because tables implicitly define a massive network of interconnected values. We introduceDomainNet, which efficiently represents this network, and investigate to what extent it can be used to disambiguate values without requiring any supervision. DomainNetleverages network-centrality measures on a bipartite graph whose nodes represent data values and attributes to determine if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs achieves an F1-score of 0.38 versus 0.69 forDomainNet, which separates homographs well from data values that have a unique meaning. On a real data lake, our top-100 precision is 93%. Given a homograph, we also present a novel method for determining the number of meanings of the homograph and for assigning its data lake attributes to a meaning. We show the influence of homographs on two downstream tasks: entity-matching and domain discovery.
more » « less
Full Text Available
Efficient Computation of Quantiles over Joins

https://doi.org/10.1145/3584372.3588670

Tziavelis, Nikolaos; Carmeli, Nofar; Gatterbauer, Wolfgang; Kimelfeld, Benny; Riedewald, Mirek (June 2023, PODS)

Full Text Available
Tractable Orders for Direct Access to Ranked Answers of Conjunctive Queries

https://doi.org/10.1145/3578517

Carmeli, Nofar; Tziavelis, Nikolaos; Gatterbauer, Wolfgang; Kimelfeld, Benny; Riedewald, Mirek (March 2023, ACM Transactions on Database Systems)

We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings , that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem , and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access. We begin with lexicographic orders . For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability and establish the corresponding generalizations of our characterizations for every set of unary FDs.
more » « less
Full Text Available
SANTOS: Relationship-based Semantic Table Union Search

https://doi.org/10.1145/3588689

Khatiwada, Aamod; Fan, Grace; Shraga, Roee; Chen, Zixuan; Gatterbauer, Wolfgang; Miller, Renée J.; Riedewald, Mirek (May 2023, Proceedings of the ACM on Management of Data)

Existing techniques for unionable table search define unionability using metadata (tables must have the same or similar schemas) or column-based metrics (for example, the values in a table should be drawn from the same domain). In this work, we introduce the use of semantic relationships between pairs of columns in a table to improve the accuracy of the union search. Consequently, we introduce a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two new methods to discover the semantic relationships between pairs of columns. The first uses an existing knowledge base (KB), and the second (which we call a "synthesized KB") uses knowledge from the data lake itself. We adopt an existing Table Union Search benchmark and present new (open) benchmarks that represent small and large real data lakes. We show that our new unionability search algorithm, called SANTOS, outperforms a state-of-the-art union search that uses a wide variety of column-based semantics, including word embeddings and regular expressions. We show empirically that our synthesized KB improves the accuracy of union search by representing relationship semantics that may not be contained in an available KB. This result hints at a promising future of creating synthesized KBs from data lakes with limited KB coverage and using them for union search.
more » « less
Full Text Available
Principles of Query Visualization

Gatterbauer, Wolfgang; Dunne, Cody; Jagadish, H V; Riedewald, Mirek (September 2022, IEEE Data Engineering Bulletin)

Full Text Available
Principles of Query Visualization

Gatterbauer, Wolfgang; Dunne, Cody; Jagadish, H V; Riedewald, Mirek (September 2022, A Quarterly bulletin of the Computer Society of the IEEE Technical Committee on Data Engineering)
Roy, Sudeepa; Yang, Jun (Ed.)
Query Visualization (QV) is the problem of transforming a given query into a graphical representation that helps humans understand its meaning. This task is notably different from designing a Visual Query Language (VQL) that helps a user compose a query. This article discusses the principles of relational query visualization and its potential for simplifying user interactions with relational data.
more » « less
Full Text Available
Toward Responsive DBMS: Optimal Join Algorithms, Enumeration, Factorization, Ranking, and Dynamic Programming

https://doi.org/10.1109/ICDE53745.2022.00299

Tziavelis, Nikolaos; Gatterbauer, Wolfgang; Riedewald, Mirek (May 2022, ICDE tutorials)

Full Text Available
STRATISFIMAL LAYOUT: A modular optimization model for laying out layered node-link network visualizations

https://doi.org/10.1109/TVCG.2021.3114756

di Bartolomeo, Sara; Riedewald, Mirek; Gatterbauer, Wolfgang; Dunne, Cody (January 2022, IEEE Transactions on Visualization and Computer Graphics)

Full Text Available
Beyond equi-joins: ranking, enumeration and factorization

https://doi.org/10.14778/3476249.3476306

Tziavelis, Nikolaos; Gatterbauer, Wolfgang; Riedewald, Mirek (July 2021, Proceedings of the VLDB Endowment)

We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with n denoting the number of tuples in the database, we guarantee for acyclic full join queries with inequality conditions that for every value of k , the k top-ranked answers are returned in O ( n polylog n + k log k ) time. This is within a polylogarithmic factor of O ( n + k log k ), i.e., the best known complexity for equi-joins, and even of O ( n + k ), i.e., the time it takes to look at the input and return k answers in any order. Our guarantees extend to join queries with selections and many types of projections (namely those called "free-connex" queries and those that use bag semantics). Remarkably, they hold even when the number of join results is n ℓ for a join of ℓ relations. The key ingredient is a novel O ( n polylog n )-size factorized representation of the query output , which is constructed on-the-fly for a given query and database. In addition to providing the first nontrivial theoretical guarantees beyond equi-joins, we show in an experimental study that our ranked-enumeration approach is also memory-efficient and fast in practice, beating the running time of state-of-the-art database systems by orders of magnitude.
more » « less
Full Text Available

« Prev Next »

Search for: All records