Database access logs are the starting point for many forms of database administration, from database performance tuning, to security auditing, to benchmark design, and many more. Unfortunately, query logs are also large and unwieldy, and it can be difficult for an analyst to extract broad patterns from the set of queries found therein. Clustering is a natural first step towards understanding the massive query logs. However, many clustering methods rely on the notion of pairwise similarity, which is challenging to compute for SQL queries, especially when the underlying data and database schema is unavailable. We investigate the problem of computing similarity between queries, relying only on the query structure. We conduct a rigorous evaluation of three query similarity heuristics proposed in the literature applied to query clustering on multiple query log datasets, representing different types of query workloads. To improve the accuracy of the three heuristics, we propose a generic feature engineering strategy, using classical query rewrites to standardize query structure. The proposed strategy results in a significant improvement in the performance of all three similarity heuristics.
more »
« less
This content will become publicly available on May 12, 2026
Evaluating Molecular Similarity Measures: Do Similarity Measures Reflect Electronic Structure Properties?
- Award ID(s):
- 2019574
- PAR ID:
- 10651831
- Publisher / Repository:
- American Chemical Society
- Date Published:
- Journal Name:
- Journal of Chemical Information and Modeling
- Volume:
- 65
- Issue:
- 9
- ISSN:
- 1549-9596
- Page Range / eLocation ID:
- 4311 to 4319
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Denison, S.; Mack, M.; Xu, Y.; Armstrong, B.C. (Ed.)Do people perceive shapes to be similar based purely on their physical features? Or is visual similarity influenced by top-down knowledge? In the present studies, we demonstrate that top-down information – in the form of verbal labels that people associate with visual stimuli – predicts visual similarity as measured using subjective (Experiment 1) and objective (Experiment 2) tasks. In Experiment 1, shapes that were previously calibrated to be (putatively) perceptually equidistant were more likely to be grouped together if they shared a name. In Experiment 2, more nameable shapes were easier for participants to discriminate from other images, again controlling for their perceptual distance. We discuss what these results mean for constructing visual stimuli spaces that are perceptually uniform and discuss theoretical implications of the fact that perceptual similarity is sensitive to top-down information such as the ease with which an object can be named.more » « less
-
The problem of comparing database instances with incom- pleteness is prevalent in applications such as analyzing how a dataset has evolved over time (e.g., data versioning), evaluating data cleaning solutions (e.g., compare an instance produced by a data repair algorithm against a gold standard), or comparing solu- tions generated by data exchange systems (e.g., universal vs core solutions). In this work, we propose a framework for computing similarity of instances with labeled nulls, even of those without primary keys. As a side-effect, the similarity score computation returns a mapping between the instances’ tuples, which explains the score. We demonstrate that computing the similarity of two incomplete instances is NP-hard in the instance size in general. To be able to compare instances of realistic size, we present an approximate PTIME algorithm for instance comparison. Exper- imental results demonstrate that the approximate algorithm is up to three orders of magnitude faster than an exact algorithm for the computation of the similarity score, while the difference between approximate and exact scores is always smaller than 1%.more » « less
-
Unary computing is a relatively new method for implementing non-linear functions using few hardware resources compared to binary computing. In its original form, unary computing provides no trade-off between accuracy and hardware cost. In this work, we propose a novel self-similarity-based method to optimize the previous hybrid binary-unary method and provide it with the trade-off between accuracy and hardware cost by introducing controlled levels of approximation. Given a target maximum error, our method breaks a function into sub-functions and tries to find the minimum set of unique sub-functions that can derive all the other ones through trivial bit-wise transformations. We compare our method to previous works such as HBU (hybrid binary-unary) and FloPoCo-PPA (piece-wise polynomial approximation) on a number of non-linear functions including Log, Exp, Sigmoid, GELU, Sin, and Sqr, which are used in neural networks and image processing applications. Without any loss of accuracy, our method can improve the area-delay-product hardware cost of HBU on average by 7% at 8-bit, 20% at 10-bit, and 35% at 12-bit resolutions. Given the approximation of the least significant bit, our method reduces the hardware cost of HBU on average by 21% at 8-bit, 49% at 10-bit, and 60% at 12-bit resolutions, and using the same error budget as given to FloPoCo-PPA, it reduces the hardware cost of FloPoCo-PPA on average by 79% at 8-bit, 58% at 10-bit, and 9% at 12-bit resolutions. We finally show the benefits of our method by implementing a 10-bit homomorphic filter, which is used in image processing applications. Our method can implement the filter with no quality loss at lower hardware cost than what the previous approximate and exact methods can achieve.more » « less
-
Abstract The development of example-based design support tools, such as those used for design-by-analogy, relies heavily on the computation of similarity between designs. Various vector- and graph-based similarity measures operationalize different principles to assess the similarity of designs. Despite the availability of various types of similarity measures and the widespread adoption of some, these measures have not been tested for cross-measure agreement, especially in a design context. In this paper, several vector- and graph-based similarity measures are tested across two datasets of functional models of products to explore the ways in which they find functionally similar designs. The results show that the network-based measures fundamentally operationalize functional similarity in a different way than vector-based measures. Based upon the findings, we recommend a graph-based similarity measure such as NetSimile in the early stages of design when divergence is desirable and a vector-based measure such as cosine similarity in a period of convergence, when the scope of the desired function implementation is clearer.more » « less
An official website of the United States government
