skip to main content


This content will become publicly available on April 5, 2025

Title: Towards an Objective Metric for Data Value Through Relevance
The rate at which humanity is producing data has increased sig- nificantly over the last decade. As organizations generate unprece- dented amounts of data, storing, cleaning, integrating, and ana- lyzing this data consumes significant (human and computational) resources. At the same time organizations extract significant value from their data. In this work, we present our vision for develop- ing an objective metric for the value of data based on the recently introduced concept of data relevance, outline proposals for how to efficiently compute and maintain such metrics, and how to utilize data value to improve data management including storage organi- zation, query performance, intelligent allocation of data collection and curation efforts, improving data catalogs, and for making pric- ing decisions in data markets. While we mostly focus on tabular data, the concepts we introduce can also be applied to other data models such as semi-structure data (e.g., JSON) or property graphs. Furthermore, we discuss strategies for dealing with data and work- loads that evolve and discuss how to deal with data that is currently not relevant, but has potential value (we refer to this as dark data). Furthermore, we sketch ideas for measuring the value that a query / workload has for an organization and reason about the interaction between query and data value.  more » « less
Award ID(s):
2420577 2420691 2107107 1956123
PAR ID:
10544899
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
CIDR
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Xiao, Xiaokui (Ed.)
    Individuals and organizations are accumulating data at an unprecedented rate owing to the advent of inexpensive cloud computing. Data owners are increasingly turning to secure and privacy-preserving collaborative analytics to maximize the value of their records. In this paper, we will survey the state-of-the- art of this growing area. We will describe how researchers are bringing security and privacy-enhancing technologies, such as differential privacy, secure multiparty computation, and zero-knowledge proofs, into the query lifecycle. We also touch upon some of the challenges and opportunities associated with deploying these technologies in the field. 
    more » « less
  2. Tuple-independent probabilistic databases (TI-PDBs) han- dle uncertainty by annotating each tuple with a probability parameter; when the user submits a query, the database de- rives the marginal probabilities of each output-tuple, assum- ing input-tuples are statistically independent. While query processing in TI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results . Addressing this problem is the main focus of this paper. We introduce Beta Probabilistic Databases (B-PDBs), a general- ization of TI-PDBs designed to support both (i) belief updat- ing and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and pa- rameter learning in a principled way, without imposing any burden on regular query processing. We use this model to provide the following key contributions: (i) we show how to scalably compute the posterior densities of the parameters given new evidence; (ii) we study the complexity of perform- ing Bayesian belief updates, devising efficient algorithms for tractable classes of queries; (iii) we propose a soft-EM algo- rithm for computing maximum-likelihood estimates of the parameters; (iv) we show how to embed the proposed algo- rithms into a standard relational engine; (v) we support our conclusions with extensive experimental results. 
    more » « less
  3. Explaining why an answer is (or is not) returned by a query is important for many applications including auditing, debugging data and queries, and answering hypothetical questions about data. In this work, we present the first practical approach for answering such questions for queries with negation (first-order queries). Specifically, we introduce a graph-based provenance model that, while syntactic in nature, supports reverse reasoning and is proven to encode a wide range of provenance models from the literature. The implementation of this model in our PUG (Provenance Unification through Graphs) system takes a provenance question and Datalog query as an input and generates a Datalog program that computes an explanation, i.e., the part of the provenance that is relevant to answer the question. Furthermore, we demonstrate how a desirable factorization of provenance can be achieved by rewriting an input query. We experimentally evaluate our approach demonstrating its efficiency. 
    more » « less
  4. Many content-based image search and instance retrieval systems implement bag-of-visual-words strategies for candidate selection. Visual processing of an image results in hundreds of visual words that make up a document, and these words are used to build an inverted index. Query processing then consists of an initial candidate selection phase that queries the inverted index, followed by more complex reranking of the candidates using various image features. The initial phase typically uses disjunctive top-k query processing algorithms originally proposed for searching text collections. Our objective in this paper is to optimize the performance of disjunctive top-k computation for candidate selection in content-based instance retrieval systems. While there has been extensive previous work on optimizing this phase for textual search engines, we are unaware of any published work that studies this problem for instance retrieval, where both index and query data are quite different from the distributions commonly found and exploited in the textual case. Using data from a commercial large-scale instance retrieval system, we address this challenge in three steps. First, we analyze the quantitative properties of index structures and queries in the system, and discuss how they differ from the case of text retrieval. Second, we describe an optimized term-at-a-time retrieval strategy that significantly outperforms baseline term-at-a-time and document-at-a-time strategies, achieving up to 66% speed-up over the most efficient baseline. Finally, we show that due to the different properties of the data, several common safe and unsafe early termination techniques from the literature fail to provide any significant performance benefits. 
    more » « less
  5. The increasing use of databases in the storage of critical and sensitive information in many organizations has lead to an increase in the rate at which databases are exploited in computer crimes. While there are several techniques and tools available for database forensics, they mostly assume apriori database preparation, such as relying on tamper-detection software to be in place or use of detailed logging. Investigators, alternatively, need forensic tools and techniques that work on poorly-configured databases and make no assumptions about the extent of damage in a database. In this paper, we present DBCarver, a tool for reconstructing database content from a database image without using any log or system metadata. The tool uses page carving to reconstruct both query-able data and non-queryable data (deleted data). We describe how the two kinds of data can be combined to enable a variety of forensic analysis questions hitherto unavailable to forensic investigators. We show the generality and efficiency of our tool across several databases through a set of robust experiments. 
    more » « less