NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Interpretable Data-Based Explanations for Fairness Debugging

Pradhan, R; Zhu, J; Glavic, B; Salimi, B (June 2024, SIGMOD Conference)

Full Text Available
Towards an Objective Metric for Data Value Through Relevance

Glavic, B; Li, P; Liu, Z; Gawlick, D; Krishnaswamy, V; Porobic, D; Liu, Z H (April 2024, CIDR)

The rate at which humanity is producing data has increased sig- nificantly over the last decade. As organizations generate unprece- dented amounts of data, storing, cleaning, integrating, and ana- lyzing this data consumes significant (human and computational) resources. At the same time organizations extract significant value from their data. In this work, we present our vision for develop- ing an objective metric for the value of data based on the recently introduced concept of data relevance, outline proposals for how to efficiently compute and maintain such metrics, and how to utilize data value to improve data management including storage organi- zation, query performance, intelligent allocation of data collection and curation efforts, improving data catalogs, and for making pric- ing decisions in data markets. While we mostly focus on tabular data, the concepts we introduce can also be applied to other data models such as semi-structure data (e.g., JSON) or property graphs. Furthermore, we discuss strategies for dealing with data and work- loads that evolve and discuss how to deal with data that is currently not relevant, but has potential value (we refer to this as dark data). Furthermore, we sketch ideas for measuring the value that a query / workload has for an organization and reason about the interaction between query and data value.
more » « less
Full Text Available
GProM - A Swiss Army Knife for Your Provenance Needs

Arab, B. S.; Feng, S.; Glavic, B.; Lee, S.; Niu, X.; Zeng, Q. (March 2018, A Quarterly bulletin of the Computer Society of the IEEE Technical Committee on Data Engineering)

We present an overview of GProM, a generic provenance middleware for relational databases. The sys- tem supports diverse provenance and annotation management tasks through query instrumentation, i.e., compiling a declarative frontend language with provenance-specific features into the query language of a backend database system. In addition to introducing GProM, we also discuss research contributions related to GProM including the first provenance model and capture mechanism for transaction prove- nance, a unified framework for answering why- and why-not provenance questions, and provenance- aware query optimization. Furthermore, by means of the example of post-mortem debugging of transac- tions, we demonstrate how novel applications of provenance are made possible by GProM.
more » « less
Full Text Available
Adaptive Schema Databases

Spoth, W.; Arab, B. S.; Chan, E. S.; Gawlick, D.; Ghoneimy, A.; Glavic, B.; Hammerschmidt, B.; Kennedy, O.; Lee, S.; Liu, Z. H.; et al (January 2017, CIDR 2017, 8th Biennial Conference on Innovative Data Systems Research, Chaminade, CA, USA, January 8-11, 2017, Online Proceedings)

The rigid schemas of classical relational databases help users in specifying queries and inform the storage organization of data. However, the advantages of schemas come at a high upfront cost through schema and ETL process design. In this work, we propose a new paradigm where the database system takes a more active role in schema development and data integration. We refer to this approach as adaptive schema databases (ASDs). An ASD ingests semi-structured or unstructured data directly using a pluggable combination of extraction and data integration techniques. Over time it discovers and adapts schemas for the ingested data using information provided by data integration and information extraction techniques, as well as from queries and user-feedback. In contrast to relational databases, ASDs maintain multiple schema workspaces that represent individualized views over the data, which are fine-tuned to the needs of a particular user or group of users. A novel aspect of ASDs is that probabilistic database techniques are used to encode ambiguity in automatically generated data extraction workflows and in generated schemas. ASDs can provide users with context-dependent feedback on the quality of a schema, both in terms of its ability to satisfy a user's queries, and the quality of the resulting answers. We outline our vision for ASDs, and present a proof of concept implementation as part of the Mimir probabilistic data curation system.
more » « less
Full Text Available

Search for: All records