skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Query Refinement for Diversity Constraint Satisfaction
Diversity, group representation, and similar needs often apply to query results, which in turn require constraints on the sizes of various subgroups in the result set. Traditional relational queries only specify conditions as part of the query predicate(s), and do not support such restrictions on the output. In this paper, we study the problem of modifying queries to have the result satisfy constraints on the sizes of multiple subgroups in it. This problem, in the worst case, cannot be solved in polynomial time. Yet, with the help of provenance annotation, we are able to develop a query refinement method that works quite efficiently, as we demonstrate through extensive experiments.  more » « less
Award ID(s):
2106176
PAR ID:
10482056
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
VLDB
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
17
Issue:
2
ISSN:
2150-8097
Page Range / eLocation ID:
106 to 118
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate sizes and max-degrees from input relations, yet their main benefit is limited to cyclic queries, because they degenerate to rather trivial formulas on acyclic queries. We introduce a significant extension of the upper bounds, by incorporating lp-norms of the degree sequences of join attributes. Our bounds are significantly lower than previously known bounds, even when applied to acyclic queries. These bounds are also based on information theory, they come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when all degrees are ''simple''. 
    more » « less
  2. Answering complex logical queries on large-scale incomplete knowledge graphs (KGs) is a fundamental yet challenging task. Recently, a promising approach to this problem has been to embed KG entities as well as the query into a vector space such that entities that answer the query are embedded close to the query. However, prior work models queries as single points in the vector space, which is problematic because a complex query represents a potentially large set of its answer entities, but it is unclear how such a set can be represented as a single point. Furthermore, prior work can only handle queries that use conjunctions (^) and existential quantifiers (9). Handling queries with logical disjunctions (_) remains an open problem. Here we propose QUERY2BOX, an embedding-based framework for reasoning over arbitrary queries with ^, _, and 9 operators in massive and incomplete KGs. Our main insight is that queries can be embedded as boxes (i.e., hyper-rectangles), where a set of points inside the box corresponds to a set of answer entities of the query. We show that conjunctions can be naturally represented as intersections of boxes and also prove a negative result that handling disjunctions would require embedding with dimension proportional to the number of KG entities. However, we show that by transforming queries into a Disjunctive Normal Form, QUERY2BOX is capable of handling arbitrary logical queries with ^, _, 9 in a scalable manner. We demonstrate the effectiveness of QUERY2BOX on three large KGs and show that QUERY2BOX achieves up to 25% relative improvement over the state of the art. 
    more » « less
  3. In this paper we investigate the problem of quantifying the contribution of each variable to the satisfying assignments of a Boolean function based on the Shapley value. Our main result is a polynomial-time equivalence between computing Shapley values and model counting for any class of Boolean functions that are closed under substitutions of variables with disjunctions of fresh variables. This result settles an open problem raised in prior work, which sought to connect the Shapley value computation to probabilistic query evaluation. We show two applications of our result. First, the Shapley values can be computed in polynomial time over deterministic and decomposable circuits, since they are closed under OR-substitutions. Second, there is a polynomial-time equivalence between computing the Shapley value for the tuples contributing to the answer of a Boolean conjunctive query and counting the models in the lineage of the query. This equivalence allows us to immediately recover the dichotomy for Shapley value computation in case of self-join-free Boolean conjunctive queries; in particular, the hardness for non-hierarchical queries can now be shown using a simple reduction from the \#P-hard problem of model counting for lineage in positive bipartite disjunctive normal form. 
    more » « less
  4. Given a self-join-free conjunctive queryQand a set of tuplesS, asynthetic witness Dis a database instance such that the result ofQonDisS. In this work, we are interested in two problems. First, the existence problem ESW decides whether any synthetic witnessDexists. Second, given that a synthetic witness exists, the minimization problem SSW computes a synthetic witness of minimal size. The SSW problem is related to thesmallest witness problemrecently studied by Hu and Sintos [22]; however, the objective and the results are inherently different. More specifically, we show that SSW is poly-time solvable for a wider range of queries. Interestingly, in some cases, SSW is related to optimization problems in other domains, such as therole miningproblem in data mining and theedge concentrationproblem in graph drawing. Solutions to ESW and SSW are of practical interest, e.g., fortest database generationfor applications accessing a database and fordata compressionby encoding a datasetSas a pair of a queryQand databaseD. We prove that ESW is in P, presenting a simple algorithm that, given anyS, decides whether a synthetic witness exists in polynomial time in the size ofS. Next, we focus on the SSW problem. We show an algorithm that computes a minimal synthetic witness in polynomial time with respect to the size ofSfor any queryQthat has thehead-dominationproperty. IfQdoes not have such a property, then SSW is generally hard. More specifically, we show that for the class ofpath queries(of any constant length), SSW cannot be solved in polynomial time unless P = NP. We then extend this hardness result to the class ofBerge-acyclicqueries that do not have the head-domination property, obtaining a full dichotomy of SSW for Berge-acyclic queries. Finally, we investigate the hardness of SSW beyond Berge-acyclic queries by showing that SSW cannot be solved in polynomial time for some cyclic queries unless P = NP. 
    more » « less
  5. Developing techniques to infer the behavior of networked social systems has attracted a lot of attention in the literature. Using a discrete dynamical system to model a networked social system, the problem of inferring the behavior of the system can be formulated as the problem of learning the local functions of the dynamical system. We investigate the problem assuming an active form of interaction with the system through queries. We consider two classes of local functions (namely, symmetric and threshold functions) and two interaction modes, namely batch (where all the queries must be submitted together) and adaptive (where the set of queries submitted at a stage may rely on the answers to previous queries). We establish bounds on the number of queries under both batch and adaptive query modes using vertex coloring and probabilistic methods. Our results show that a small number of appropriately chosen queries are provably sufficient to correctly learn all the local functions. We develop complexity results which suggest that, in general, the problem of generating query sets of minimum size is computationally intractable. We present efficient heuristics that produce query sets under both batch and adaptive query modes. Also, we present a query compaction algorithm that identifies and removes redundant queries from a given query set. Our algorithms were evaluated through experiments on over 20 well-known networks. 
    more » « less