Integrity constraints such as functional dependencies (FD) and multi-valueddependencies (MVD) are fundamental in database schema design. Likewise,probabilistic conditional independences (CI) are crucial for reasoning aboutmultivariate probability distributions. The implication problem studies whethera set of constraints (antecedents) implies another constraint (consequent), andhas been investigated in both the database and the AI literature, under theassumption that all constraints hold exactly. However, many applications todayconsider constraints that hold only approximately. In this paper we define anapproximate implication as a linear inequality between the degree ofsatisfaction of the antecedents and consequent, and we study the relaxationproblem: when does an exact implication relax to an approximate implication? Weuse information theory to define the degree of satisfaction, and prove severalresults. First, we show that any implication from a set of data dependencies(MVDs+FDs) can be relaxed to a simple linear inequality with a factor at mostquadratic in the number of variables; when the consequent is an FD, the factorcan be reduced to 1. Second, we prove that there exists an implication betweenCIs that does not admit any relaxation; however, we prove that everyimplication between CIs relaxes "in the limit". Then, we show that theimplication problem for differential constraints in market basket analysis alsoadmits a relaxation with a factor equal to 1. Finally, we show how some of theresults in the paper can be derived using the I-measure theory, which relatesbetween information theoretic measures and set theory. Our results recover, andsometimes extend, previously known results about the implication problem: theimplication of MVDs and FDs can be checked by considering only 2-tuplerelations.
more »
« less
Integrity Constraints Revisited: From Exact to Approximate Implication
Integrity constraints such as functional dependencies (FD), and multi-valued dependencies (MVD) are fundamental in database schema design. Likewise, probabilistic conditional independences (CI) are crucial for reasoning about multivariate probability distributions. The implication problem studies whether a set of constraints (antecedents) implies another constraint (consequent), and has been investigated in both the database and the AI literature, under the assumption that all constraints hold exactly. However, many applications today consider constraints that hold only approximately. In this paper we define an approximate implication as a linear inequality between the degree of satisfaction of the antecedents and consequent, and we study the relaxation problem: when does an exact implication relax to an approximate implication? We use information theory to define the degree of satisfaction, and prove several results. First, we show that any implication from a set of data dependencies (MVDs+FDs) can be relaxed to a simple linear inequality with a factor at most quadratic in the number of variables; when the consequent is an FD, the factor can be reduced to 1. Second, we prove that there exists an implication between CIs that does not admit any relaxation; however, we prove that every implication between CIs relaxes “in the limit”. Finally, we show that the implication problem for differential constraints in market basket analysis also admits a relaxation with a factor equal to 1. Our results recover, and sometimes extend, several previously known results about the implication problem: implication of MVDs can be checked by considering only 2-tuple relations, and the implication of differential constraints for frequent item sets can be checked by considering only databases containing a single transaction.
more »
« less
- Award ID(s):
- 1907997
- PAR ID:
- 10164543
- Date Published:
- Journal Name:
- International Conference on Database Theory
- Page Range / eLocation ID:
- 18:1-18:20
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Megow, Nicole; Smith, Adam (Ed.)We study the mixing time of the single-site update Markov chain, known as the Glauber dynamics, for generating a random independent set of a tree. Our focus is obtaining optimal convergence results for arbitrary trees. We consider the more general problem of sampling from the Gibbs distribution in the hard-core model where independent sets are weighted by a parameter λ > 0; the special case λ = 1 corresponds to the uniform distribution over all independent sets. Previous work of Martinelli, Sinclair and Weitz (2004) obtained optimal mixing time bounds for the complete Δ-regular tree for all λ. However, Restrepo et al. (2014) showed that for sufficiently large λ there are bounded-degree trees where optimal mixing does not hold. Recent work of Eppstein and Frishberg (2022) proved a polynomial mixing time bound for the Glauber dynamics for arbitrary trees, and more generally for graphs of bounded tree-width. We establish an optimal bound on the relaxation time (i.e., inverse spectral gap) of O(n) for the Glauber dynamics for unweighted independent sets on arbitrary trees. Moreover, for λ ≤ .44 we prove an optimal mixing time bound of O(n log n). We stress that our results hold for arbitrary trees and there is no dependence on the maximum degree Δ. Interestingly, our results extend (far) beyond the uniqueness threshold which is on the order λ = O(1/Δ). Our proof approach is inspired by recent work on spectral independence. In fact, we prove that spectral independence holds with a constant independent of the maximum degree for any tree, but this does not imply mixing for general trees as the optimal mixing results of Chen, Liu, and Vigoda (2021) only apply for bounded degree graphs. We instead utilize the combinatorial nature of independent sets to directly prove approximate tensorization of variance/entropy via a non-trivial inductive proof.more » « less
-
We consider semidefinite programs (SDPs) of size n with equality constraints. In order to overcome scalability issues, Burer and Monteiro proposed a factorized approach based on optimizing over a matrix Y of size nk such that X = Y Y is the SDP variable. The advantages of such formulation are twofold: the dimension of the optimization variable is reduced, and positive semidefiniteness is naturally enforced. However, optimization in Y is non-convex. In prior work, it has been shown that, when the constraints on the factorized variable regularly define a smooth manifold, provided k is large enough, for almost all cost matrices, all second-order stationary points (SOSPs) are optimal. Importantly, in practice, one can only compute points which approximately satisfy necessary optimality conditions, leading to the question: are such points also approximately optimal? To answer it, under similar assumptions, we use smoothed analysis to show that approximate SOSPs for a randomly perturbed objective function are approximate global optima, with k scaling like the square root of the number of constraints (up to log factors). Moreover, we bound the optimality gap at the approximate solution of the perturbed problem with respect to the original problem. We particularize our results to an SDP relaxation of phase retrieval.more » « less
-
Nikhil, Bansal; Nagarajan, Viswanath (Ed.)We initiate a study of the streaming complexity of constraint satisfaction problems (CSPs) when the constraints arrive in a random order. We show that there exists a CSP, namely Max-DICUT, for which random ordering makes a provable difference. Whereas a 4/9 ≈ 0.445 approximation of DICUT requires space with adversarial ordering, we show that with random ordering of constraints there exists a 0.483-approximation algorithm that only needs O(log n) space. We also give new algorithms for Max-DICUT in variants of the adversarial ordering setting. Specifically, we give a two-pass O(log n) space 0.483-approximation algorithm for general graphs and a single-pass space 0.483-approximation algorithm for bounded-degree graphs. On the negative side, we prove that CSPs where the satisfying assignments of the constraints support a one-wise independent distribution require -space for any non-trivial approximation, even when the constraints are randomly ordered. This was previously known only for adversarially ordered constraints. Extending the results to randomly ordered constraints requires switching the hard instances from a union of random matchings to simple Erdős-Renyi random (hyper)graphs and extending tools that can perform Fourier analysis on such instances. The only CSP to have been considered previously with random ordering is Max-CUT where the ordering is not known to change the approximability. Specifically it is known to be as hard to approximate with random ordering as with adversarial ordering, for space algorithms. Our results show a richer variety of possibilities and motivate further study of CSPs with randomly ordered constraints.more » « less
-
Leonardi, Stefano; Gupta, Anupam (Ed.)We consider the approximability of constraint satisfaction problems in the streaming setting. For every constraint satisfaction problem (CSP) on n variables taking values in {0,…,q−1}, we prove that improving over the trivial approximability by a factor of q requires Ω(n) space even on instances with O(n) constraints. We also identify a broad subclass of problems for which any improvement over the trivial approximability requires Ω(n) space. The key technical core is an optimal, q−(k−1)-inapproximability for the Max k-LIN-mod q problem, which is the Max CSP problem where every constraint is given by a system of k−1 linear equations mod q over k variables. Our work builds on and extends the breakthrough work of Kapralov and Krachun (Proc. STOC 2019) who showed a linear lower bound on any non-trivial approximation of the MaxCut problem in graphs. MaxCut corresponds roughly to the case of Max k-LIN-mod q with k=q=2. For general CSPs in the streaming setting, prior results only yielded Ω(√n) space bounds. In particular no linear space lower bound was known for an approximation factor less than 1/2 for any CSP. Extending the work of Kapralov and Krachun to Max k-LIN-mod q to k>2 and q>2 (while getting optimal hardness results) is the main technical contribution of this work. Each one of these extensions provides non-trivial technical challenges that we overcome in this work.more » « less
An official website of the United States government

