skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.
Attention:The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 7:00 AM ET to 7:30 AM ET on Friday, April 24 due to maintenance. We apologize for the inconvenience.


Title: Output-Sensitive Evaluation of Regular Path Queries
We study the classical evaluation problem for regular path queries: Given an edge-labeled graph and a regular path query, compute the set of pairs of vertices that are connected by paths that match the query. The Product Graph (PG) is the established evaluation approach for regular path queries. PG first constructs the product automaton of the data graph and the query and then uses breadth-first search to find the accepting states reachable from each initial state in the product automaton. Its data complexity is O(|V|⋅|E|), where V and E are the sets of vertices and respectively edges in the data graph. This complexity cannot be improved by combinatorial algorithms. In this paper, we introduce OSPG, an output-sensitive refinement of PG, whose data complexity is O(|E|3/2+ min(OUT⋅√|E|, |V|⋅|E|)), where OUT is the number of distinct vertex pairs in the query output. OSPG's complexity is at most that of PG and can be asymptotically smaller for small output and sparse input. The improvement of OSPG over PG is due to the unnecessary time wasted by PG in the breadth-first search phase, in case a few output pairs are eventually discovered. For queries without Kleene star, the complexity of OSPG can be further improved to O(|E| + |E|⋅√OUT).  more » « less
Award ID(s):
2314527 2312195
PAR ID:
10627044
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ACM SIGMOD/PODS
Date Published:
Journal Name:
Proceedings of the ACM on Management of Data
Volume:
3
Issue:
2
ISSN:
2836-6573
Page Range / eLocation ID:
1 to 20
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We study the dynamic query evaluation problem: Given a full conjunctive query Q and a sequence of updates to the input database, we construct a data structure that supports constant-delay enumeration of the tuples in the query output after each update. We show that a sequence of N insert-only updates to an initially empty database can be executed in total time O(Nw(Q)), where w(Q) is the fractional hypertree width of Q. This matches the complexity of the static query evaluation problem for Q and a database of size N. One corollary is that the amortized time per single-tuple insert is constant for acyclic full conjunctive queries. In contrast, we show that a sequence of N inserts and deletes can be executed in total time Õ(Nw(Q')), where Q' is obtained from Q by extending every relational atom with extra variables that represent the lifespans of tuples in the database. We show that this reduction is optimal in the sense that the static evaluation runtime of Q' provides a lower bound on the total update time for the output of Q. Our approach achieves amortized optimal update times for the hierarchical and Loomis-Whitney join queries. 
    more » « less
  2. Consider the following stochastic matching problem. We are given a known graph G=(V, E). An unknown subgraph Gp = (V, Ep) is realized where Ep includes every edge of E independently with some probability p ∈ (0, 1]. The goal is to query a sparse subgraph H of G, such that the realized edges in H include an approximate maximum matching of Gp. This problem has been studied extensively over the last decade due to its applications in kidney exchange, online dating, and online labor markets. For any fixed є > 0, [BDH STOC’20] showed that any graph G has a subgraph H with (1/p) = (1/p)(log(1/p)) maximum degree, achieving a (1−є)-approximation. A major open question is the best approximation achievable with (1/p)-degree subgraphs. A long line of work has progressively improved the approximation in the (1/p)-degree regime from .5 [BDH+ EC’15] to .501 [AKL EC’17], .656 [BHFR SODA’19], .666 [AB SOSA’19], .731 [BBD SODA’22] (bipartite graphs), and most recently to .68 [DS ’24]. In this work, we show that a (1/p)-degree subgraph can obtain a (1−є)-approximation for any desirably small fixed є > 0, achieving the best of both worlds. Beyond its quantitative improvement, a key conceptual contribution of our work is to connect local computation algorithms (LCAs) to the stochastic matching problem for the first time. While prior work on LCAs mainly focuses on their out-queries (the number of vertices probed to produce the output of a given vertex), our analysis also bounds the in-queries (the number of vertices that probe a given vertex). We prove that the outputs of LCAs with bounded in- and out-queries (in-n-out LCAs for short) have limited correlation, a property that our analysis crucially relies on and might find applications beyond stochastic matchings. 
    more » « less
  3. null (Ed.)
    Graph processing frameworks are typically designed to optimize the evaluation of a single graph query. However, in practice, we often need to respond to multiple graph queries, either from different users or from a single user performing a complex analytics task. Therefore in this paper we develop SimGQ, a system that optimizes simultaneous evaluation of a group of vertex queries that originate at different source vertices (e.g., multiple shortest path queries originating at different source vertices) and delivers substantial speedups over a conventional framework that evaluates and responds to queries one by one. The performance benefits are achieved via batching and sharing. Batching fully utilizes system resources to evaluate a batch of queries and amortizes runtime overheads incurred due to fetching vertices and edge lists, synchronizing threads, and maintaining computation frontiers. Sharing dynamically identifies shared queries that substantially represent subcomputations in the evaluation of different queries in a batch, evaluates the shared queries, and then uses their results to accelerate the evaluation of all queries in the batch. With four input power-law graphs and four graph algorithms SimGQ achieves speedups of up to 45.67 × with batch sizes of up to 512 queries over the baseline implementation that evaluates the queries one by one using the state of the art Ligra system. Moreover, both batching and sharing contribute substantially to the speedups. 
    more » « less
  4. Data summarization is a powerful approach to deal with large-scale data analytics, which has wide applications in web search, recommendation systems, approximate query processing, etc. It computes a small, compact summary that preserves vital properties of the original data. In this paper, we study the data summarization problem of conjunctive query results, i.e., computing a k-size subset of a conjunctive query output, for any given k>0, that optimizes a certain objective. More specifically, we are interested in two commonly studied objectives: cohesion, which measures the maximum distance between a tuple in the query result tuples and its closest tuple in the summary (k-center clustering); and diversity, which measures the pairwise distances between the summary items. A simple approach that computes the entire query output and then applies existing algorithms on top of these materialized tuples suffers from high computational complexity because the query output can be large, e.g., for a relational database of N tuples, the number of result tuples can be NO(1).We propose O(1)-approximation algorithms that compute well-representative summaries of size k in time O(N*kO(1)), or even O(N+ kO(1)) in some cases, without computing all result tuples. We also propose the first efficient (2+\eps)-approximation algorithm for the k-center clustering problem over relational data. Our main idea is to formulate a few oracles that enable us to access specific query result tuples with certain properties, to show how these oracles can be implemented efficiently, and to compute desired summaries with few invocations of these oracles. 
    more » « less
  5. Abstract Community detection is considered for a stochastic block model graph of n vertices, with K vertices in the planted community, edge probability p for pairs of vertices both in the community, and edge probability q for other pairs of vertices. The main focus of the paper is on weak recovery of the community based on the graph G , with o ( K ) misclassified vertices on average, in the sublinear regime n 1- o (1) ≤ K ≤ o ( n ). A critical parameter is the effective signal-to-noise ratio λ = K 2 ( p - q ) 2 / (( n - K ) q ), with λ = 1 corresponding to the Kesten–Stigum threshold. We show that a belief propagation (BP) algorithm achieves weak recovery if λ > 1 / e, beyond the Kesten–Stigum threshold by a factor of 1 / e. The BP algorithm only needs to run for log * n + O (1) iterations, with the total time complexity O (| E |log * n ), where log * n is the iterated logarithm of n . Conversely, if λ ≤ 1 / e, no local algorithm can asymptotically outperform trivial random guessing. Furthermore, a linear message-passing algorithm that corresponds to applying a power iteration to the nonbacktracking matrix of the graph is shown to attain weak recovery if and only if λ > 1. In addition, the BP algorithm can be combined with a linear-time voting procedure to achieve the information limit of exact recovery (correctly classify all vertices with high probability) for all K ≥ ( n / log n ) (ρ BP + o (1)), where ρ BP is a function of p / q . 
    more » « less