skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Integrity Constraints Revisited: From Exact to Approximate Implication
Integrity constraints such as functional dependencies (FD) and multi-valueddependencies (MVD) are fundamental in database schema design. Likewise,probabilistic conditional independences (CI) are crucial for reasoning aboutmultivariate probability distributions. The implication problem studies whethera set of constraints (antecedents) implies another constraint (consequent), andhas been investigated in both the database and the AI literature, under theassumption that all constraints hold exactly. However, many applications todayconsider constraints that hold only approximately. In this paper we define anapproximate implication as a linear inequality between the degree ofsatisfaction of the antecedents and consequent, and we study the relaxationproblem: when does an exact implication relax to an approximate implication? Weuse information theory to define the degree of satisfaction, and prove severalresults. First, we show that any implication from a set of data dependencies(MVDs+FDs) can be relaxed to a simple linear inequality with a factor at mostquadratic in the number of variables; when the consequent is an FD, the factorcan be reduced to 1. Second, we prove that there exists an implication betweenCIs that does not admit any relaxation; however, we prove that everyimplication between CIs relaxes "in the limit". Then, we show that theimplication problem for differential constraints in market basket analysis alsoadmits a relaxation with a factor equal to 1. Finally, we show how some of theresults in the paper can be derived using the I-measure theory, which relatesbetween information theoretic measures and set theory. Our results recover, andsometimes extend, previously known results about the implication problem: theimplication of MVDs and FDs can be checked by considering only 2-tuplerelations.  more » « less
Award ID(s):
2109922
PAR ID:
10340956
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Logical Methods in Computer Science
Volume:
Volume 18, Issue 1
ISSN:
1860-5974
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Integrity constraints such as functional dependencies (FD), and multi-valued dependencies (MVD) are fundamental in database schema design. Likewise, probabilistic conditional independences (CI) are crucial for reasoning about multivariate probability distributions. The implication problem studies whether a set of constraints (antecedents) implies another constraint (consequent), and has been investigated in both the database and the AI literature, under the assumption that all constraints hold exactly. However, many applications today consider constraints that hold only approximately. In this paper we define an approximate implication as a linear inequality between the degree of satisfaction of the antecedents and consequent, and we study the relaxation problem: when does an exact implication relax to an approximate implication? We use information theory to define the degree of satisfaction, and prove several results. First, we show that any implication from a set of data dependencies (MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most quadratic in the number of variables; when the consequent is an FD, the factor can be reduced to 1. Second, we prove that there exists an implication between CIs that does not admit any relaxation; however, we prove that every implication between CIs relaxes “in the limit”. Finally, we show that the implication problem for differential constraints in market basket analysis also admits a relaxation with a factor equal to 1. Our results recover, and sometimes extend, several previously known results about the implication problem: implication of MVDs can be checked by considering only 2-tuple relations, and the implication of differential constraints for frequent item sets can be checked by considering only databases containing a single transaction. 
    more » « less
  2. What is a minimal set of tuples to delete from a database in order to eliminate all query answers? This problem is called the resilience of a query and is one of the key algorithmic problems underlying various forms of reverse data management, such as view maintenance, deletion propagation and causal responsibility. A long-open question is determining the conjunctive queries (CQs) for which resilience can be solved in PTIME. We shed new light on this problem by proposing a unified Integer Linear Programming (ILP) formulation. It is unified in that it can solve both previously studied restrictions (e.g., self-join-free CQs under set semantics that allow a PTIME solution) and new cases (all CQs under set or bag semantics). It is also unified in that all queries and all database instances are treated with the same approach, yet the algorithm is guaranteed to terminate in PTIME for all known PTIME cases. In particular, we prove that for all known easy cases, the optimal solution to our ILP is identical to a simpler Linear Programming (LP) relaxation, which implies that standard ILP solvers return the optimal solution to the original ILP in PTIME. Our approach allows us to explore new variants and obtain new complexity results. 1) It works under bag semantics, for which we give the first dichotomy results in the problem space. 2) We extend our approach to the related problem of causal responsibility and give a more fine-grained analysis of its complexity. 3) We recover easy instances for generally hard queries, including instances with read-once provenance and instances that become easy because of Functional Dependencies in the data. 4) We solve an open conjecture about a unified hardness criterion from PODS 2020 and prove the hardness of several queries of previously unknown complexity. 5) Experiments confirm that our findings accurately predict the asymptotic running times, and that our universal ILP is at times even quicker than a previously proposed dedicated flow algorithm. 
    more » « less
  3. We study the question of when we can provide direct access to the k-th answer to a Conjunctive Query (CQ) according to a specified order over the answers in time logarithmic in the size of the database, following a preprocessing step that constructs a data structure in time quasilinear in database size. Specifically, we embark on the challenge of identifying the tractable answer orderings , that is, those orders that allow for such complexity guarantees. To better understand the computational challenge at hand, we also investigate the more modest task of providing access to only a single answer (i.e., finding the answer at a given position), a task that we refer to as the selection problem , and ask when it can be performed in quasilinear time. We also explore the question of when selection is indeed easier than ranked direct access. We begin with lexicographic orders . For each of the two problems, we give a decidable characterization (under conventional complexity assumptions) of the class of tractable lexicographic orders for every CQ without self-joins. We then continue to the more general orders by the sum of attribute weights and establish the corresponding decidable characterizations, for each of the two problems, of the tractable CQs without self-joins. Finally, we explore the question of when the satisfaction of Functional Dependencies (FDs) can be utilized for tractability and establish the corresponding generalizations of our characterizations for every set of unary FDs. 
    more » « less
  4. Megow, Nicole; Smith, Adam (Ed.)
    We study the mixing time of the single-site update Markov chain, known as the Glauber dynamics, for generating a random independent set of a tree. Our focus is obtaining optimal convergence results for arbitrary trees. We consider the more general problem of sampling from the Gibbs distribution in the hard-core model where independent sets are weighted by a parameter λ > 0; the special case λ = 1 corresponds to the uniform distribution over all independent sets. Previous work of Martinelli, Sinclair and Weitz (2004) obtained optimal mixing time bounds for the complete Δ-regular tree for all λ. However, Restrepo et al. (2014) showed that for sufficiently large λ there are bounded-degree trees where optimal mixing does not hold. Recent work of Eppstein and Frishberg (2022) proved a polynomial mixing time bound for the Glauber dynamics for arbitrary trees, and more generally for graphs of bounded tree-width. We establish an optimal bound on the relaxation time (i.e., inverse spectral gap) of O(n) for the Glauber dynamics for unweighted independent sets on arbitrary trees. Moreover, for λ ≤ .44 we prove an optimal mixing time bound of O(n log n). We stress that our results hold for arbitrary trees and there is no dependence on the maximum degree Δ. Interestingly, our results extend (far) beyond the uniqueness threshold which is on the order λ = O(1/Δ). Our proof approach is inspired by recent work on spectral independence. In fact, we prove that spectral independence holds with a constant independent of the maximum degree for any tree, but this does not imply mixing for general trees as the optimal mixing results of Chen, Liu, and Vigoda (2021) only apply for bounded degree graphs. We instead utilize the combinatorial nature of independent sets to directly prove approximate tensorization of variance/entropy via a non-trivial inductive proof. 
    more » « less
  5. Braverman, Mark (Ed.)
    Grothendieck’s inequality [Grothendieck, 1953] states that there is an absolute constant K > 1 such that for any n× n matrix A, ‖A‖_{∞→1} := max_{s,t ∈ {± 1}ⁿ}∑_{i,j} A[i,j]⋅s(i)⋅t(j) ≥ 1/K ⋅ max_{u_i,v_j ∈ S^{n-1}}∑_{i,j} A[i,j]⋅⟨u_i,v_j⟩. In addition to having a tremendous impact on Banach space theory, this inequality has found applications in several unrelated fields like quantum information, regularity partitioning, communication complexity, etc. Let K_G (known as Grothendieck’s constant) denote the smallest constant K above. Grothendieck’s inequality implies that a natural semidefinite programming relaxation obtains a constant factor approximation to ‖A‖_{∞ → 1}. The exact value of K_G is yet unknown with the best lower bound (1.67…) being due to Reeds and the best upper bound (1.78…) being due to Braverman, Makarychev, Makarychev and Naor [Braverman et al., 2013]. In contrast, the little Grothendieck inequality states that under the assumption that A is PSD the constant K above can be improved to π/2 and moreover this is tight. The inapproximability of ‖A‖_{∞ → 1} has been studied in several papers culminating in a tight UGC-based hardness result due to Raghavendra and Steurer (remarkably they achieve this without knowing the value of K_G). Briet, Regev and Saket [Briët et al., 2015] proved tight NP-hardness of approximating the little Grothendieck problem within π/2, based on a framework by Guruswami, Raghavendra, Saket and Wu [Guruswami et al., 2016] for bypassing UGC for geometric problems. This also remained the best known NP-hardness for the general Grothendieck problem due to the nature of the Guruswami et al. framework, which utilized a projection operator onto the degree-1 Fourier coefficients of long code encodings, which naturally yielded a PSD matrix A. We show how to extend the above framework to go beyond the degree-1 Fourier coefficients, using the global structure of optimal solutions to the Grothendieck problem. As a result, we obtain a separation between the NP-hardness results for the two problems, obtaining an inapproximability result for the Grothendieck problem, of a factor π/2 + ε₀ for a fixed constant ε₀ > 0. 
    more » « less