Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Analyzing database access logs is a key part of performance tuning, intrusion detection, benchmark development, and many other database administration tasks. Unfortunately, it is common for production databases to deal with millions or more queries each day, so these logs must be summarized before they can be used. Designing an appropriate summary encoding requires trading off between conciseness and information content. For example: simple workload sampling may miss rare, but high impact queries. In this paper, we present LogR, a lossy log compression scheme suitable for use in many automated log analytics tools, as well as for human inspection. We formalize and analyze the space/fidelity trade-off in the context of a broader family of “pattern” and “pattern mixture” log encodings to which LogR belongs. We show through a series of experiments that LogR compressed encodings can be created efficiently, come with provable information-theoretic bounds on their accuracy, and outperform state-of-art log summarization strategies.more » « less
-
Database access logs are the starting point for many forms of database administration, from database performance tuning, to security auditing, to benchmark design, and many more. Unfortunately, query logs are also large and unwieldy, and it can be difficult for an analyst to extract broad patterns from the set of queries found therein. Clustering is a natural first step towards understanding the massive query logs. However, many clustering methods rely on the notion of pairwise similarity, which is challenging to compute for SQL queries, especially when the underlying data and database schema is unavailable. We investigate the problem of computing similarity between queries, relying only on the query structure. We conduct a rigorous evaluation of three query similarity heuristics proposed in the literature applied to query clustering on multiple query log datasets, representing different types of query workloads. To improve the accuracy of the three heuristics, we propose a generic feature engineering strategy, using classical query rewrites to standardize query structure. The proposed strategy results in a significant improvement in the performance of all three similarity heuristics.more » « less
-
One of the primary challenges in graphical models is inference, or re-constructing a marginal probability from the graphical model’s factorized representation. While tractable for some graphs, the cost of inference grows exponentially with the graphical model’s complexity, necessitating approximation for more complex graphs. For interactive applications, latency is the dominant concern, making approximate inference the only feasible option. Unfortunately, approximate inference can be wasteful for interactive applications, as exact inference can still converge faster, even for moderately complex inference problems. In this paper, we propose a new family of convergent inference algorithms (CIAs) that bridge the gap between approximations and exact solutions, providing early, incrementally improving approximations that become exact after a finite period of time. We describe two specific CIAs based on a cryptographic technique called linear congruential generators, including a novel incremental join algorithm for dense relations called Leaky Joins. We conclude with experiments that demonstrate the utility of Leaky Joins for convergent inference: On both synthetic and real-world probabilistic graphical models, Leaky Joins converge to exact marginal probabilities almost as fast as state of the art exact inference algorithms, while simultaneously achieving approximations that are almost as good as state of the art approximation algorithms.more » « less
-
Tuple-independent probabilistic databases (TI-PDBs) han- dle uncertainty by annotating each tuple with a probability parameter; when the user submits a query, the database de- rives the marginal probabilities of each output-tuple, assum- ing input-tuples are statistically independent. While query processing in TI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results . Addressing this problem is the main focus of this paper. We introduce Beta Probabilistic Databases (B-PDBs), a general- ization of TI-PDBs designed to support both (i) belief updat- ing and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and pa- rameter learning in a principled way, without imposing any burden on regular query processing. We use this model to provide the following key contributions: (i) we show how to scalably compute the posterior densities of the parameters given new evidence; (ii) we study the complexity of perform- ing Bayesian belief updates, devising efficient algorithms for tractable classes of queries; (iii) we propose a soft-EM algo- rithm for computing maximum-likelihood estimates of the parameters; (iv) we show how to embed the proposed algo- rithms into a standard relational engine; (v) we support our conclusions with extensive experimental results.more » « less
-
Insider threats to databases in the financial sector have become a very serious and pervasive security problem. This paper proposes a framework to analyze access patterns to databases by clustering SQL queries issued to the database. Our system Ettu works by grouping queries with other similarly structured queries. The small number of intent groups that result can then be efficiently labeled by human operators. We show how our system is designed and how the components of the system work. Our preliminary results show that our system accurately models user intent.more » « less
An official website of the United States government
