skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Scaling Package Queries to a Billion Tuples via Hierarchical Partitioning and Customized Optimization
A package query returns a package---a multiset of tuples---that maximizes or minimizes a linear objective function subject to linear constraints, thereby enabling in-database decision support. Prior work has established the equivalence of package queries to Integer Linear Programs (ILPs) and developed the SketchRefine algorithm for package query processing. While this algorithm was an important first step toward supporting prescriptive analytics scalably inside a relational database, it struggles when the data size grows beyond a few hundred million tuples or when the constraints become very tight. In this paper, we present Progressive Shading, a novel algorithm for processing package queries that can scale efficiently to billions of tuples and gracefully handle tight constraints. Progressive Shading solves a sequence of optimization problems over a hierarchy of relations, each resulting from an ever-finer partitioning of the original tuples into homogeneous groups until the original relation is obtained. This strategy avoids the premature discarding of high-quality tuples that can occur with SketchRefine. Our novel partitioning scheme, Dynamic Low Variance, can handle very large relations with multiple attributes and can dynamically adapt to both concentrated and spread-out sets of attribute values, provably outperforming traditional partitioning schemes such as kd-tree. We further optimize our system by replacing our off-the-shelf optimization software with customized ILP and LP solvers, called Dual Reducer and Parallel Dual Simplex respectively, that are highly accurate and orders of magnitude faster.  more » « less
Award ID(s):
2211918 1943971
PAR ID:
10539930
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
PVLDB
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
17
Issue:
5
ISSN:
2150-8097
Page Range / eLocation ID:
1146 to 1158
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Estimating the cardinality of the output of a query is a fundamental problem in database query processing. In this article, we overview a recently published contribution that casts the cardinality estimation problem as linear optimization and computes guaranteed upper bounds on the cardinality of the output for any full conjunctive query. The objective of the linear program is to maximize the joint entropy of the query variables and its constraints are the Shannon information inequalities and new information inequalities involving ℓp-norms of the degree sequences of the join attributes. The bounds based on arbitrary norms can be asymptotically lower than those based on the ℓ1 and ℓ∞ norms, which capture the cardinalities and respectively the max-degrees of the input relations. They come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when each degree sequence is on one join attribute. 
    more » « less
  2. Decision making under uncertainty often requires choosing packages, or bags of tuples, that collectively optimize expected outcomes while limiting risks. Processing Stochastic Package Queries (SPQs) involves solving very large optimization problems on uncertain data. Monte Carlo methods create numerous scenarios, or sample realizations of the stochastic attributes of all the tuples, and generate packages with optimal objective values across these scenarios. The number of scenarios needed for accurate approximation---and hence the size of the optimization problem when using prior methods---increases with variance in the data, and the search space of the optimization problem increases exponentially with the number of tuples in the relation. Existing solvers take hours to process SPQs on large relations containing stochastic attributes with high variance. Besides enriching the SPaQL language to capture a broader class of risk specifications, we make two fundamental contributions toward scalable SPQ processing. First, we propose risk-constraint linearization (RCL), which converts SPQs into Integer Linear Programs (ILPs) whose size is independent of the number of scenarios used. Solving these ILPs gives us feasible and near-optimal packages. Second, we propose Stochastic Sketch Refine, a divide and conquer framework that breaks down a large stochastic optimization problem into subproblems involving smaller subsets of tuples. Our experiments show that, together, RCL and Stochastic Sketch Refine produce high-quality packages in orders of magnitude lower runtime than the state of the art. 
    more » « less
  3. Estimating the output size of a query is a fundamental yet longstanding problem in database query processing. Traditional cardinality estimators used by database systems can routinely underestimate the true output size by orders of magnitude, which leads to significant system performance penalty. Recently, upper bounds have been proposed that are based on information inequalities and incorporate sizes and max-degrees from input relations, yet their main benefit is limited to cyclic queries, because they degenerate to rather trivial formulas on acyclic queries. We introduce a significant extension of the upper bounds, by incorporating lp-norms of the degree sequences of join attributes. Our bounds are significantly lower than previously known bounds, even when applied to acyclic queries. These bounds are also based on information theory, they come with a matching query evaluation algorithm, are computable in exponential time in the query size, and are provably tight when all degrees are ''simple''. 
    more » « less
  4. Given a database of vectors, a cosine threshold query returns all vectors in the database having cosine similarity to a query vector above a given threshold {\theta}. These queries arise naturally in many applications, such as document retrieval, image search, and mass spectrometry. The present paper considers the efficient evaluation of such queries, providing novel optimality guarantees and exhibiting good performance on real datasets. We take as a starting point Fagin's well-known Threshold Algorithm (TA), which can be used to answer cosine threshold queries as follows: an inverted index is first built from the database vectors during pre-processing; at query time, the algorithm traverses the index partially to gather a set of candidate vectors to be later verified for {\theta}-similarity. However, directly applying TA in its raw form misses significant optimization opportunities. Indeed, we first show that one can take advantage of the fact that the vectors can be assumed to be normalized, to obtain an improved, tight stopping condition for index traversal and to efficiently compute it incrementally. Then we show that one can take advantage of data skewness to obtain better traversal strategies. In particular, we show a novel traversal strategy that exploits a common data skewness condition which holds in multiple domains including mass spectrometry, documents, and image databases. We show that under the skewness assumption, the new traversal strategy has a strong, near-optimal performance guarantee. The techniques developed in the paper are quite general since they can be applied to a large class of similarity functions beyond cosine. 
    more » « less
  5. Data summarization is a powerful approach to deal with large-scale data analytics, which has wide applications in web search, recommendation systems, approximate query processing, etc. It computes a small, compact summary that preserves vital properties of the original data. In this paper, we study the data summarization problem of conjunctive query results, i.e., computing a k-size subset of a conjunctive query output, for any given k>0, that optimizes a certain objective. More specifically, we are interested in two commonly studied objectives: cohesion, which measures the maximum distance between a tuple in the query result tuples and its closest tuple in the summary (k-center clustering); and diversity, which measures the pairwise distances between the summary items. A simple approach that computes the entire query output and then applies existing algorithms on top of these materialized tuples suffers from high computational complexity because the query output can be large, e.g., for a relational database of N tuples, the number of result tuples can be NO(1).We propose O(1)-approximation algorithms that compute well-representative summaries of size k in time O(N*kO(1)), or even O(N+ kO(1)) in some cases, without computing all result tuples. We also propose the first efficient (2+\eps)-approximation algorithm for the k-center clustering problem over relational data. Our main idea is to formulate a few oracles that enable us to access specific query result tuples with certain properties, to show how these oracles can be implemented efficiently, and to compute desired summaries with few invocations of these oracles. 
    more » « less