Title: Towards an Objective Metric for Data Value Through Relevance
The rate at which humanity is producing data has increased sig- nificantly over the last decade. As organizations generate unprece- dented amounts of data, storing, cleaning, integrating, and ana- lyzing this data consumes significant (human and computational) resources. At the same time organizations extract significant value from their data. In this work, we present our vision for develop- ing an objective metric for the value of data based on the recently introduced concept of data relevance, outline proposals for how to efficiently compute and maintain such metrics, and how to utilize data value to improve data management including storage organi- zation, query performance, intelligent allocation of data collection and curation efforts, improving data catalogs, and for making pric- ing decisions in data markets. While we mostly focus on tabular data, the concepts we introduce can also be applied to other data models such as semi-structure data (e.g., JSON) or property graphs. Furthermore, we discuss strategies for dealing with data and work- loads that evolve and discuss how to deal with data that is currently not relevant, but has potential value (we refer to this as dark data). Furthermore, we sketch ideas for measuring the value that a query / workload has for an organization and reason about the interaction between query and data value. more »« less
Individuals and organizations are accumulating data at an unprecedented rate owing to the advent of inexpensive cloud computing. Data owners are increasingly turning to secure and privacy-preserving collaborative analytics to maximize the value of their records. In this paper, we will survey the state-of-the- art of this growing area. We will describe how researchers are bringing security and privacy-enhancing technologies, such as differential privacy, secure multiparty computation, and zero-knowledge proofs, into the query lifecycle. We also touch upon some of the challenges and opportunities associated with deploying these technologies in the field.
Meneghetti, Niccolo'; Kennedy, Oliver; Gatterbauer, Wolfgang
(, Proceedings of the 2017 ACM International Conference on Management of Data)
Tuple-independent probabilistic databases (TI-PDBs) han- dle uncertainty by annotating each tuple with a probability parameter; when the user submits a query, the database de- rives the marginal probabilities of each output-tuple, assum- ing input-tuples are statistically independent. While query processing in TI-PDBs has been studied extensively, limited research has been dedicated to the problems of updating or deriving the parameters from observations of query results . Addressing this problem is the main focus of this paper. We introduce Beta Probabilistic Databases (B-PDBs), a general- ization of TI-PDBs designed to support both (i) belief updat- ing and (ii) parameter learning in a principled and scalable way. The key idea of B-PDBs is to treat each parameter as a latent, Beta-distributed random variable. We show how this simple expedient enables both belief updating and pa- rameter learning in a principled way, without imposing any burden on regular query processing. We use this model to provide the following key contributions: (i) we show how to scalably compute the posterior densities of the parameters given new evidence; (ii) we study the complexity of perform- ing Bayesian belief updates, devising efficient algorithms for tractable classes of queries; (iii) we propose a soft-EM algo- rithm for computing maximum-likelihood estimates of the parameters; (iv) we show how to embed the proposed algo- rithms into a standard relational engine; (v) we support our conclusions with extensive experimental results.
Wagner, James; Rasin, Alexander; Malik, Tanu; Heart, Karen; Jehle, Hugo; Grier, Jonathan
(, CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research)
The increasing use of databases in the storage of critical and sensitive information in many organizations has lead to an increase in the rate at which databases are exploited in computer crimes. While there are several techniques and tools available for database forensics, they mostly assume apriori database preparation, such as relying on tamper-detection software to be in place or use of detailed logging. Investigators, alternatively, need forensic tools and techniques that work on poorly-configured databases and make no assumptions about the extent of damage in a database. In this paper, we present DBCarver, a tool for reconstructing database content from a database image without using any log or system metadata. The tool uses page carving to reconstruct both query-able data and non-queryable data (deleted data). We describe how the two kinds of data can be combined to enable a variety of forensic analysis questions hitherto unavailable to forensic investigators. We show the generality and efficiency of our tool across several databases through a set of robust experiments.
Siedlaczek, Michal; Wang, Qi; Chen, Yen-Yu; Suel, Torsten
(, 2018 IEEE International Conference on Big Data)
Many content-based image search and instance retrieval systems implement bag-of-visual-words strategies for candidate selection. Visual processing of an image results in hundreds of visual words that make up a document, and these words are used to build an inverted index. Query processing then consists of an initial candidate selection phase that queries the inverted index, followed by more complex reranking of the candidates using various image features. The initial phase typically uses disjunctive top-k query processing algorithms originally proposed for searching text collections. Our objective in this paper is to optimize the performance of disjunctive top-k computation for candidate selection in content-based instance retrieval systems. While there has been extensive previous work on optimizing this phase for textual search engines, we are unaware of any published work that studies this problem for instance retrieval, where both index and query data are quite different from the distributions commonly found and exploited in the textual case. Using data from a commercial large-scale instance retrieval system, we address this challenge in three steps. First, we analyze the quantitative properties of index structures and queries in the system, and discuss how they differ from the case of text retrieval. Second, we describe an optimized term-at-a-time retrieval strategy that significantly outperforms baseline term-at-a-time and document-at-a-time strategies, achieving up to 66% speed-up over the most efficient baseline. Finally, we show that due to the different properties of the data, several common safe and unsafe early termination techniques from the literature fail to provide any significant performance benefits.
Lenard, Ben; Rasin, Alexander; Scope, Nick; Al_Johani, Thamer
(, The 36th International Conference on Scientific and Statistical Database Management (SSDBM))
Most organizations rely on relational database(s) for their day-to-day business functions. Data management policies fall under the umbrella of IT Operations, dictated by a combination of internal organizational policies and government regulations. Many privacy laws (such as Europe’s General Data Protection Regulation and California’s Consumer Privacy Act) establish policy requirements for organizations, requiring the preservation or purging of certain customer data across their systems. Organization disaster recovery policies also mandate backup policies to prevent data loss. Thus, the data in these databases are subject to a range of policies, including data retention and data purging rules, which may come into conflict with the need for regular backups. In this paper, we discuss the trade-offs between different compliance mechanisms to maintain IT Operational policies. We consider the practical availability of data in an active relational database and in a backup, including: 1) supporting data privacy rules with respect to preserving or purging customer data, and 2) the application performance impact caused by the database policy implementation. We first discuss the state of data privacy compliance in database systems. We then look at enforcement of common IT operational policies with regard to database backups. We consider different implementations used to enforce privacy rule compliance combined with a detailed discussion for how these approaches impact the performance of a database at different phases. We demonstrate that naive compliance implementations will incur a prohibitively high cost and impose onerous restrictions on backup and restore process, but will not affect daily user query transaction cost. However, we also show that other solutions can achieve a far lower backup and restore costs at a price of a small (<5%) overhead to non-SELECT queries.
Glavic, B, Li, P, Liu, Z, Gawlick, D, Krishnaswamy, V, Porobic, D, and Liu, Z H. Towards an Objective Metric for Data Value Through Relevance. Retrieved from https://par.nsf.gov/biblio/10544899.
Glavic, B, Li, P, Liu, Z, Gawlick, D, Krishnaswamy, V, Porobic, D, & Liu, Z H. Towards an Objective Metric for Data Value Through Relevance. Retrieved from https://par.nsf.gov/biblio/10544899.
Glavic, B, Li, P, Liu, Z, Gawlick, D, Krishnaswamy, V, Porobic, D, and Liu, Z H.
"Towards an Objective Metric for Data Value Through Relevance". Country unknown/Code not available: CIDR. https://par.nsf.gov/biblio/10544899.
@article{osti_10544899,
place = {Country unknown/Code not available},
title = {Towards an Objective Metric for Data Value Through Relevance},
url = {https://par.nsf.gov/biblio/10544899},
abstractNote = {The rate at which humanity is producing data has increased sig- nificantly over the last decade. As organizations generate unprece- dented amounts of data, storing, cleaning, integrating, and ana- lyzing this data consumes significant (human and computational) resources. At the same time organizations extract significant value from their data. In this work, we present our vision for develop- ing an objective metric for the value of data based on the recently introduced concept of data relevance, outline proposals for how to efficiently compute and maintain such metrics, and how to utilize data value to improve data management including storage organi- zation, query performance, intelligent allocation of data collection and curation efforts, improving data catalogs, and for making pric- ing decisions in data markets. While we mostly focus on tabular data, the concepts we introduce can also be applied to other data models such as semi-structure data (e.g., JSON) or property graphs. Furthermore, we discuss strategies for dealing with data and work- loads that evolve and discuss how to deal with data that is currently not relevant, but has potential value (we refer to this as dark data). Furthermore, we sketch ideas for measuring the value that a query / workload has for an organization and reason about the interaction between query and data value.},
journal = {},
publisher = {CIDR},
author = {Glavic, B and Li, P and Liu, Z and Gawlick, D and Krishnaswamy, V and Porobic, D and Liu, Z H},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.