<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Towards an Objective Metric for Data Value Through Relevance</title></titleStmt>
			<publicationStmt>
				<publisher>CIDR</publisher>
				<date>04/05/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10544899</idno>
					<idno type="doi"></idno>
					
					<author>B Glavic</author><author>P Li</author><author>Z Liu</author><author>D Gawlick</author><author>V Krishnaswamy</author><author>D Porobic</author><author>Z H Liu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The rate at which humanity is producing data has increased sig- nificantly over the last decade. As organizations generate unprece- dented amounts of data, storing, cleaning, integrating, and ana- lyzing this data consumes significant (human and computational) resources. At the same time organizations extract significant value from their data. In this work, we present our vision for develop- ing an objective metric for the value of data based on the recently introduced concept of data relevance, outline proposals for how to efficiently compute and maintain such metrics, and how to utilize data value to improve data management including storage organi- zation, query performance, intelligent allocation of data collection and curation efforts, improving data catalogs, and for making pric- ing decisions in data markets. While we mostly focus on tabular data, the concepts we introduce can also be applied to other data models such as semi-structure data (e.g., JSON) or property graphs. Furthermore, we discuss strategies for dealing with data and work- loads that evolve and discuss how to deal with data that is currently not relevant, but has potential value (we refer to this as dark data). Furthermore, we sketch ideas for measuring the value that a query / workload has for an organization and reason about the interaction between query and data value.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>As organizations produce unprecedented amounts of data, managing that data and utilizing it to improve the operations of the organization has become a critical challenge. While data catalogs and metadata management systems like Apache Atlas <ref type="bibr">[1]</ref> and Goods <ref type="bibr">[15]</ref> enable organizations to track their data and its usage, these organizations still have to decide how to spend limited resources (storage, compute, human time, . . . ) most eectively on what data. For instance, data collection and curation require signicant manual labor and, thus, should be focused on data that of high importance to the operations of a business. What is missing, is an objective metric for the value that data has to an organization or, more specically, for a particular purpose within the organization (e.g., deciding what products to promote). Such a metric would allow resources to be spent where they are most needed. For example, it is sensible to spend human eort on cleaning and collecting data that is important while data that is of little value can be subjected to lossy compression, stored on slow storage media, and may not need to be curated.</p><p>In this work, we present our vision for an objective metric of data value. We argue that the only value of a piece of data is its contribution to producing the results of queries (and other computations). Data that is only stored, but not used in computations in any meaningful way, is unobservable to the user: deleting this data would not aect the results of computations. I 1. Data is typically accessed through some type of interface, e.g., a declarative query language. Thus, data only has value by contributing to a result returned through such an interface.</p><p>We base our denition of data value on the notion of data relevance that we recently introduced in <ref type="bibr">[19]</ref>. Data relevance uses provenance <ref type="bibr">[9]</ref> to track which data is needed to produce the answer of a computation. Note that this is dierent from techniques that track how "hot" data is: just because data is accessed by a query that does not mean that the data is actually relevant for producing the query result. I 2. Data access is not the same as data relevance. Only data that contributes to the result of a computation is of value to this computation.</p><p>For instance, a database system may evaluate a selection query using a full table scan as this can be more ecient than using an index, but data items not fullling the selection conditions of the query do not contribute to the query's result. E 1. Consider the webshop database and workload shown in Figure <ref type="figure">1</ref>. The workload consists of two queries: &amp; C&gt;? (executed daily) determines products that achieve signicant revenue. &amp; ?A&gt;1;4&lt; (executed once per week) identies the product with the lowest average rating of reviews from Europe that mention the word "problem". &amp; C&gt;? returns two products (with pid 1 and 3). The relevant data for this query is highlighted in light blue. The product rows ? 1 and ? 3 and all three order items for these products are needed to produce the query's result. As the reader may verify, any other product as well as all review tuples can be deleted without changing the result. The relevant data for &amp; ?A&gt;1;4&lt; for this database is the product with pid</p><p>sum ( quantity * price ) AS rev FROM products NATURAL JOIN orders GROUP BY pid HAVING sum ( quantity * price ) &gt; 1000 Q problem SELECT pid , name , avg ( rating ) AS avg_rating FROM products NATURAL JOIN reviews WHERE region = Europe AND review LIKE % problem % GROUP BY pid , name ORDER BY avg_rating ASC LIMIT 1; products pid name category price ? 1 1 Asus S10 laptop 499 ? 2 2 Dell SP4400 desktop 1230 ? 3 3 Samsung T6 ssd 199 orders oid pid quantity &gt; 1 1 1 1 &gt; 2 2 1 3 &gt; 3 2 3 10 reviews pid region rating review A 1 1 Europe 3.4 machine problems . . . A 2 1 Asia 4.0 problem with . . . A 3 3 Europe 1.3 problems after . . . A 4 3 Europe 3.8 the problem is . . . 3 and the two ratings from Europe for this product that contain the word "problem". As shown in this example, only a fraction of the data in the database is needed to produce the results of the queries for the example application and, thus, only this subset of the data is of value to the organization.</p><p>By tracking what data is needed to answer a query, relevance provides an objective framework for data value.</p><p>In Section 2 we will briey discuss how the notion of relevance can be formalized and how it relates to provenance and attribution measures for data. By aggregating the data relevance weighted by query frequency, we obtain the expected / mean relevance of data which we use as a metric of data value. Data 3 that is not relevant for any query of the workload will have a value V (3) = 0 while data that is needed for answering every query in the workload will have the maximum achievable value of V (3) = 1. Note that this denition of data value assumes that all queries are of equal importance to the organization which may not be the case. Note that our relevance-based metric does help us to identify data that is irrelevant for the current workload, but that has potential value if the workload were to be updated to use this data. We will discuss query value and potential value in Section 1.4 and section 6, respectively. E 2 (D V). Recall that &amp; C&gt;? is executed once per day and &amp; ?A&gt;1;4&lt; once per week. Thus, their relative frequencies are &amp; C&gt;? = 7  8 and &amp; ?A&gt;1;4&lt; = 1 8 . Consider computing the value V (? 3 ) of tuple ? 3 as shown below. As ? 3 is relevant for both queries, it has value 1. As another example consider tuple &gt; 1 . This tuple is only relevant for &amp; C&gt;? and, thus, its value is equal to &amp; C&gt;? 's frequency ( 7  8 ).</p><p>An important property of this metric of value is that under certain assumptions it induces a monetary metric of data value. For instance, assume that the monetary value of getting the correct result for query &amp; is $(&amp;), the value of getting an incorrect result is $0, and that missing any relevant input to a query will lead to an incorrect result. Under these assumptions, if we additionally weight each query by $(&amp;), then the value V (C) of a row C has a precise monetary interpretation: if we delete C, then we loose</p><p>Let us assume the following monetary benets associated with receiving the correct answers to &amp; C&gt;? and &amp; ?A&gt;1;4&lt; :</p><p>By weighting the contribution of each query &amp; by $(&amp;) in the computation of the value for each row, we get the following value for rows ? 3 and &gt; 1 :</p><p>Over the course of a week (7 instances of &amp; C&gt;? and 1 instance of &amp; ?A&gt;1;4&lt; ), we would loose 8 &#8226; $175 = $1400 if &gt; 1 is deleted, e.g., we decide to save storage by deleting some rows or determine that getting a cleaned version of &gt; 1 is not worth the eort.</p><p>Of course, the above example relies on several idealized assumptions that are unlikely to hold in the real world: (i) we are able to determine the monetary value of a query result and (ii) query results have zero value unless they are 100% correct. Assumption (ii) can be relaxed if the value of partially incorrect answers is known and using attribution metrics for facts like Shapley value <ref type="bibr">[3]</ref> or responsibility <ref type="bibr">[6,</ref><ref type="bibr">8]</ref> to measure the relative contribution of a row towards a query result. Nonetheless, the example illustrates the benets of basing data value on an objective metric that models relevance of rows for query results based on the well-established concepts of data provenance <ref type="bibr">[9]</ref> and/or responsibility notions. I 4. A data value metric based on relevance extends naturally to monetary metrics when the value of query results is known.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.1">Utilizing Data Value</head><p>An objective metric for the data value has many potential applications in data management. It can be used to improve query performance by restricting query evaluation to data that is relevant, for informed allocation of resources for data curation, integration, and collection based on how valuable data is, to enrich data discovery and data catalogs, to make value-based data life-cycle and self-tuning decisions, and to help buyers in data markets to decide how valuable a dataset is to them. E 4 (A D V). Observe that in our running example roughly half of the data has zero value (all rows without colored background). Such data could be heavily compressed to save storage space. Using the techniques from <ref type="bibr">[19]</ref>, that we will introduce in more detail in Section 2, we can lter irrelevant data early-on during query processing (while state-of-the-art DBMS can only lter data based on selection conditions). Furthermore, we can adapt the physical design of the data to improve access to valuable data, e.g., we could partition the data to cluster data based on their value and index only partitions with high data value.</p><p>In summary, in any use case where decisions are based on how frequently data is accessed, we can improve the decision-making process by replacing data access frequency with our data value metric. For instance, self-tuning physical design techniques, e.g., automatic data partitioning <ref type="bibr">[20]</ref>, typically rely on access patterns ignoring whether the accessed data is relevant or not.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E 5 (S D LC M).</head><p>Continuing with our running example, let us assume that we want to implement an in-memory cache for data that is optimized wrt. our workload (&amp; C&gt;? and &amp; ?A&gt;1;4&lt; ). Let us assume that the DBMS uses index nested-loop joins for both queries utilizing a primary key index for table products on column pid. Query &amp; ?A&gt;1;4&lt; has to scan the full reviews table even though only rows A 3 and A 4 are needed to produce the query answer. An approach that makes caching decisions based on data access would assume that caching any review row has the same benet. However, when using provenance-based data skipping to determine what irrelevant data can be skipped, we can decide to only cache rows A 3 and A 4 as the other two rows of the review table are not needed to produce the result of &amp; ?A&gt;1;4&lt; .</p><p>While we made some simplifying assumptions in the example above, it nonetheless demonstrates how data relevance can be used to optimize self-tuning decisions like caching, index and materializeview selection, automated partitioning and many others. I 5. Many aspects of data management can be improved using an objective metric of data value. Virtually every self-tuning approach that is based on data access can make more informed decisions by utilizing data value.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2">Measuring and Managing Data Value</head><p>As our data value metric relies on tracking relevance instead of tracking data access only, it is more challenging to compute as it requires capturing data provenance. Extending the idea of provenance sketches introduced in <ref type="bibr">[19]</ref>, we discuss strategies for measuring data value at coarser granularity and outline our vision for ecient management of data value information. Furthermore, we discuss the impact of such over-approximations on data value metrics and applications of data value. Computing data value for each query in a workload is often infeasible. We will discuss potential strategies for reducing the overhead by measuring data value for a carefully selected sample of queries, e.g., grouping queries based on similar data relevance patterns.</p><p>Furthermore, we need to reason about whether the relevant data for a query &amp; 1 can be utilized to answer a dierent query &amp; 2 . In <ref type="bibr">[19]</ref>, we did discuss one such technique that determines statically whether the set of data items that are relevant for &amp; 1 is sucient for answering &amp; 2 . We can exploit such information to reuse data relevance information across queries in a workload, thus, reduce the cost of managing data value. I 6. It is important to understand how the queries in a workload relate to each other with respect to what data is relevant for them. For instance, if the relevant data for one query &amp; 1 is guaranteed to be a subset of the relevant data for another query &amp; 2 , then we can answer &amp; 2 over the relevant data for &amp; 1 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3">Maintaining Data Value</head><p>Our metric for data value is specic to a particular workload and state of the database. However, data and workloads are typically not static, but evolve over time. We will discuss in Section 4.3 how (approximate) data value metrics can be maintained when the data and/or workload changes. Using standard provenance techniques <ref type="bibr">[9,</ref><ref type="bibr">19]</ref>, data value can be computed using queries. Thus, standard incremental view maintenance techniques can be applied to maintain data value metrics under data changes by maintaining the results of the queries that compute these metrics. However, novel optimizations are possible if data value is computed at a coarser granularity or if we allow for approximation. To deal with evolving workloads, we may be able to exploit similarities of queries wrt. their relevance "proles". I 7. Data value metrics have to be maintained when the workload or the data evolve. Novel incremental maintenance techniques are needed for the maintenance of approximate and coarse-grained value information, and to handle changes in the workload.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.4">Query Value and Dealing with Dark Data</head><p>Initially, we make the simplifying assumption that all queries in a workload are of equal importance. However, in practice this is often not the case, e.g., queries for order processing and advertisement may be more important for the operations of an online retailer than queries that analyze the eectiveness of user support. We discuss how to extend our data value metric to incorporate information about the importance of queries and outline techniques for determining data and query value if both are unknown upfront. Furthermore, data value can identify data that is very infrequently used or not used at all (we refer to this as dark data). We envision the integration of data value into data discovery and recommendation systems that systematically recommend data to users to identify whether data is dark because it is of no use what data has potential value, but has been underutilized.</p><p>We make the following contributions in this work:</p><p>&#8226; We introduce an objective metric of data value based on what data is relevant for which query of a given workload. &#8226; We discuss strategies for eciently computing data value and maintaining it when workloads and / or data evolve. &#8226; We discuss several applications where an objective data value metric brings signicant benets including query processing, resource allocation for data cleaning and preparation, self-tuning and data life-cycle management, enriching data catalogs, and assessment of data value in data markets &#8226; We discuss how to extend our framework to also model the value of queries. &#8226; We sketch ideas for utilizing data value to address the problem of dark data, i.e., data that is not utilized or underutilized, and discuss potential techniques for measuring the value of queries for an organization and how to integrate this with data value metrics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">DATA RELEVANCE &amp; PROVENANCE</head><p>We now briey review the concept of data relevance from <ref type="bibr">[19]</ref>.</p><p>Intuitively, a data item 3 from a database &#8673; is relevant for a query &amp; if "we need 3 to compute &#8673;". The notion of relevance is closely related to data provenance and indeed we will use "belonging to the provenance of some output of query &amp;" as one way to determine the relevance of a data item. We use R(&amp;; &#8673;) to denote the subset of the database &#8673; that is relevant for answering &amp;. At the minimum, R(&amp;; &#8673;) needs to be sucient for answering query, i.e., evaluating &amp; over R(&amp;; &#8673;) yields the same result as evaluating &amp; over the full database:</p><p>Obviously, suciency is not a strict enough condition to ensure that R(&amp;; &#8673;) contains only data that is needed to produce the result, because suciency does not prevent us from including irrelevant data in R(&amp;; &#8673;) that does not aect the result of &amp;. While also requiring necessity would seem to be the right approach for excluding irrelevant data, this fails in the presence of operators that are disjunctive in nature, e.g., union. Indeed, this is a well-known fact in database provenance. As standard notions of provenance for databases, e.g., provenance polynomials or lineage <ref type="bibr">[13]</ref> handle this issue, we will assume here that R(&amp;; &#8673;) is computed using one of these models. We do not introduce formal denitions of provenance here, but refer the interested reader to a recent survey on the topic <ref type="bibr">[9]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E 6 (D R S).</head><p>Revisiting Example 1, as the reader can verify, the rows highlighted in blue (red) are sucient for producing the same result for &amp; C&gt;? (&amp; ?A&gt;1;4&lt; ). Indeed, these rows are the Lineage <ref type="bibr">[9]</ref> for these two queries.</p><p>Database systems already make use of relevance information in a limited way during query optimization. By analyzing the selection conditions of queries, query optimizers determine statically what data is relevant for answering a query. In the example above, any reasonable query optimizer will determine that only reviews from region "Europe" containing the word "problem" are relevant for answering &amp; ?A&gt;1;4&lt; . The optimizer may then decide whether to make use of the database's physical design artifacts, e.g., indexes, zone maps, or partitions, to lter out irrelevant data early-on during query processing. I 8. Database systems statically analyze what data is relevant for answering a query to determine how to lter out irrelevant data early-on. However, static analysis often yields a coarse over-approximation of what data is relevant, because data relevance is often data-dependent which necessitates dynamic capture of relevance information at query runtime.</p><p>As mentioned before, "being accessed by a database system to evaluate a query" in general does not imply data relevance. For instance, the database may choose a full table scan even though only a fraction of the data in a table may be actually relevant for producing a query result. To see why this is the case, note that for many provenance models, the provenance of a query is sucient for producing the query's result and the provenance of a query is typically only a fraction of the tables accessed by the query. I 9. Data access is not the same as data relevance as typically not all data accessed to answer a query aects the result returned by the query.</p><p>For many use cases of data relevance, it is benecial to understand the relationship of two queries wrt. their relevant data. We write &amp; 1 v A4; &amp; 2 to denote that query &amp; 1 is relevance-contained in &amp; 2 and formally dene relevance containment as shown below.</p><p>That is, relevance-containment implies that the relevant data for &amp; 2 is sucient for answering &amp; 1 . Note that for monotone queries, relevance containment is equivalent to containment of the relevant data, i.e., containment of provenance <ref type="bibr">[12,</ref><ref type="bibr">19]</ref>:</p><p>In <ref type="bibr">[19]</ref> we did present a static method that checks relevance containment for two instances of the same query template, i.e., queries share the same structure and only dier in constants used in selection conditions. We will explain further in Section 5.1 how this check is utilized when using relevance information for a query to answer subsequent queries. This method relies on constraint solvers and is similar in spirit to recent work for checking query equivalence <ref type="bibr">[24]</ref>. Checking relevance containment for queries with dierent structures remains an interesting open problem. I 10. Relevance containment enables relevance information for one query to be reused for another, similar query.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">OBJECTIVE DATA VALUE METRICS</head><p>We argue that an objective metric for the value of data should not assign some intrinsic value to the data, but rather should be based on the actual benet to the organization gained by possessing the data. For instance, a dataset that is stored by an organization but is not accessed by any query / computation has zero value to the organization as deleting the data would not change the outcome of any computations, i.e., whether the dataset exists or not has no bearing on the operations of the organization. However, as discussed in Section 2 being accessed by a computation is a necessary, but not sucient condition for data to be relevant and, thus, of value. Put dierently, data that is accessed by a query (or other type of computation) but is not relevant to that query (computation) can be deleted without aecting the outcome of the query.</p><p>So far we have argued that data that is not relevant to any computation run by an organization is of zero value. Let us start by dening a data value metric that assumes that data is immutable and that the distribution of queries evaluated over this data is known upfront. Assume a workload Q consisting of queries &amp; 8 :</p><p>where each query &amp; 8 is each associated with a frequency &amp; 8 2 [0, 1] (out of all queries executed by the workload what is the relative frequency of &amp; 8 ). This workload could constitute all queries run by the organization or just the queries used by a particular set of users or applications for which we want to understand the value of data towards Q. The value V (3) of a data item 3 (e.g., we could measure value at the granularity of rows or at a coarser granularity such as fragments of a table) is dened as the fraction of queries weighted by their frequency for which the data item is relevant:</p><p>As already discussed in Section 1, it would be possible to replace the binary indicator 1[R (&amp;; 3)] with a metric that measures the contribution of a data item to the result of the query. Several such attribution measures have been discussed in related work including Shapley values <ref type="bibr">[3]</ref> and causal responsibility <ref type="bibr">[8]</ref>. However, these techniques are in general of high computational complexity. It remains to be seen whether it is possible to develop ecient approximations of such metrics for complex computations that could be utilized for measuring data value.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Updates</head><p>Let us now relax the assumption that data is immutable. If data is subject to change, then the value of data will change over time too. For simplicity, we will assume a linear history D of database versions:</p><p>Then the value of data is specic to a database version &#8673; 8 , i.e., V 8 (&amp;; 3) denotes the value of data item 3 wrt. query &amp; evaluated over database version &#8673; 8 . For transactional histories executed under weaker isolation levels, e.g., read-committed snapshot isolation <ref type="bibr">[22]</ref>, data value becomes specic to a statement within a particular transactions. However, it is probably sensible to restrict data value measurements to committed data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Changes to the Workload</head><p>Note that when we also consider an evolving workload, then the relevance of data items wrt. queries, i.e., R(&amp;; 3), does not change.</p><p>However, we can no longer assume that the frequency of queries is constant. One way to approach changing workloads is to weight instances of a query &amp; based on time, e.g., we could replace &amp; with a weighted sum over weights that decrease over time. Let T &amp; = g 1 , . . . , g &lt; denote the points in time when query &amp; was executed and let F () be is a function of the current time g =&gt;F and a time point g such that F (g =&gt;F , g 1 ) &lt; F (g =&gt;F , g 2 ) if g 1 &lt; g 2 (the weight is antimonotonic wrt. the temporal order, e.g., F (g =&gt;F , g) = 1 1+(g =&gt;F g ) ).</p><p>We can then redene &amp; as a function of the current time g =&gt;F as shown below:</p><p>Of course, one can imagine many other reasonable weight denitions. Putting this together with a temporal version of relevance based on updates to the data, i.e., assuming that &#8673; g is the database version at time g, we get a denition of data value that takes changes to both the data and workload into account:</p><p>That is, the value of data item 3 at time g is the contribution of 3 to each query at the time of the query's execution (if that execution happened before g) weighted using F (&#8226;, &#8226;) based on how recent that execution is. Note that here R(&amp;; 3; g) denotes R(&amp;; 3) computed for &#8673; g .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">MEASURING AND MANAGING DATA VALUE</head><p>In this section, we discuss strategies for managing data value information eciently including capturing (approximate) data value information at dierent granularities, how to approximate data value for workloads through sampling and clustering of queries, and how to deal with evolving workloads and data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Measuring (Approximate) Data Value</head><p>For now consider a static workload and data that is not subject to updates. A naive way to measure data value according to Equation (1) would be to capture provenance information for every query evaluated over the data and maintain a counter for each tuple that tracks the number of queries for which the tuple is relevant. However, this would result in signicant runtime overhead and require additional storage that is linear in the size of the database. Data Granularity. Our value metric for data can be applied at any level of granularity, e.g., rows, pages, or larger data chunks. Thus, an obvious optimization is to record data value at a coarser granularity at the cost of loosing some delity of the value information. In addition to reducing storage consumption, this also reduces the overhead of capturing provenance for computing data value. For instance, in <ref type="bibr">[19]</ref> we did capture relevance information at the granularity of fragments of a horizontal partitioning of a table resulting in provenance sketches which can be encoded compactly as bitvectors (one bit per fragment that indicates whether the fragment is relevant or not). Another alternative is to use approximate data structures representing sets such as bloom lters and store rowlevel provenance using such a data structure. To give a concrete example, in <ref type="bibr">[19]</ref> we did evaluate the storage requirements of provenance sketches against materialized views for a given workload. The total amount of storage required for all materialized views required to support the workload eciently ranged between &#8672;2GB and &#8672;44GB while the total size of provenance sketches was no more than &#8672;200KB. To capture data value instead of relevance we can simply replace the bit vector with an array of counters (integer values). Whenever a data item is found to be relevant for a query, its counter is increased and we also maintain a global counter for queries.</p><p>Example 4.1 (Provenance Sketches for Approximating Value). Consider the provenance (relevant rows) in table products for our running example query &amp; C&gt;? . Furthermore, consider the horizontal partition = {5 ;0?C&gt;? , 5 34B:C&gt;? , 5 BB3 , . . .} of this table based on the value of attribute category (a so-called range-partitioning):</p><p>Using this partition of table products, we can create a provenance sketch P &amp; C&gt;? , , for &amp; C&gt;? by recording which of the three fragments contain relevant rows (provenance). In our example, these are fragments 5 ;0?C&gt;? and 5 BB3 . Thus,</p><p>The data in the sketch is sucient for producing the result (it contains all relevant rows). However, in general such a sketch will contain some irrelevant data as typically not all data in a fragment will belong to the provenance of the query. For instance, for a larger database instance, there may be some laptops and ssds do not have a suciently high revenue (HAVING sum(quantity * price) &gt; 1000).</p><p>Provenance sketches are very storage-ecient: a provenance sketch requires one bit of storage per fragment (is the fragment in the sketch or not) in addition to storing information about the partitioning, e.g., the ranges of domain values for a range-based horizontal partitioning. However, note that partition information can be shared across sketches that are based on the same partitioning.</p><p>While most provenance models have been shown to produce provenance that is sucient, this often only holds for the granularity (most often rows) for which the model was designed for. Thus, one challenge with recording data relevance at coarser granularity is that we need to ensure that the over-approximation of row-level provenance encoded by a sketch is sucient which is guaranteed for monotone queries, but requires more care for nonmonotone queries (e.g., negation or aggregation). We envision an approach that statically analyzes queries to determine whether an over-approximation is safe (preserves suciency). In <ref type="bibr">[19]</ref>, we did introduce a sound safety check that can determine statically whether a provenance sketch based on range-partitioning a table on a set of attributes is sucient for the query it was produced for. We plan to extend this technique to other types of partitioning of input tables. One challenge with using provenance sketches to measure data value is that the accuracy of the over-approximation of relevant data for a query is quite sensitive to the choice of partition. That is, to get an accurate estimate of data value we may have to choose dierent partitions for dierent queries. It not clear immediately how to combine sketches that are based on dierent partitions without loosing accuracy. I 11. Tracking approximate data value at a coarser granularity can signicantly improve the cost of measuring data value.</p><p>Sampling &amp; Clustering Queries. We demonstrated in <ref type="bibr">[19]</ref> that capturing provenance sketches is signicantly faster than capturing full provenance, but still has non-trivial overhead compared to regular query execution. Piggy-backing provenance capture onto regular query processing <ref type="bibr">[9]</ref> can further reduce the overhead, but requires extensions to the query engine. To further reduce the cost of measuring data value, we can avoid capturing relevance information for every query of a workload. A straight-forward strategy would be to just compute relevance for a randomly sampled subset of queries to approximate data value for a workload. The problem with this approach is that it ignores that some queries may be quite similar in terms of their relevance proles and that the amount of relevant data per query may be skewed, e.g., it is common that heavy analytical queries are executed less frequently than OLTP queries, but access signicantly more data. Such infrequent queries can easily be missed by a naive sampling mechanism.</p><p>One possible strategy for addressing this problem is to cluster queries in a workload based on having similar data relevance proles. Prior work on clustering queries could be of use here, e.g., see <ref type="bibr">[17]</ref> for a comparison of similarity metrics for clustering queries. However, most prior work tries to cluster queries based on similar structure while for our use case we need to cluster queries based on what subset of the data is relevant for the query. Assuming such a similarity metric, we could cluster queries in the workload and then sample data relevance stratied by cluster to get a good approximation of data value for the whole workload. An alternative to using clustering is to train embeddings for queries and data <ref type="bibr">[23]</ref> and use similarity in the embedding space instead. A generalization of the notion of relevance containment discussed in Section 3 to a relevance overlap measure could be the basis of a distance metric for queries.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Non-relational Data</head><p>While we have only discussed data value for relational data so far, the concept of data value extends to non-relational data models such as semi-structured data models like JSON, graph data, and potentially vector databases. Specically, for semi-structured data it is common to ingest raw data produced by applications, e.g., logs of web API requests, directly into a database or key-value store. Relevant parts of the data are then extracted at query time. Such applications would benet signicantly from data relevance tracking to prune irrelevant data early-on during query processing. Furthermore, data relevance information could be used to enrich structure-based data guides <ref type="bibr">[11]</ref> to guide users in formulating queries based on summarized information about data content in addition to information about the data's structure.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Dealing with Evolving Workloads and Data</head><p>For many applications, data and workloads are not static, but evolve over time. Thus, data value information has to be maintained to reect these changes to the workload and the data. As relevance information can be computed using queries, existing incremental view maintenance techniques <ref type="bibr">[14]</ref> are applicable to maintain data value metrics when data changes. However, as we would typically want to approximate data value for performance reasons, the main dierence to classical incremental maintenance is that we may be able to trade accuracy of the data value metric for improved performance. For instance, we may optimistically assume that data that is inserted contributes to a query result without actually incrementally maintaining the result.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Example 4.2 (Approximate Incremental Maintenance of Data Value).</head><p>Continuing with Example 4.1, assume that a new product of category ssd with pid 4 got inserted into the table products. As the fragment 5 BB3 for this category value is already part of the provenance sketch we have the choice to compute the aggregation result for the group pid = 4 if the new row has any join partners in table orders. Alternatively, we can skip this step as 5 BB3 is already part of the sketch. Choosing the second option would result in an approximate sketch on which we can no longer process deletions accurately: if the group ?83 = 3 is deleted, then we do not know whether 5 BB3 will be still part of the sketch. I 12. Data value needs to be incrementally maintained when data or workloads evolve. If tracked through provenance sketches, novel optimizations are possible that trade maintenance performance for size of the maintained sketches.</p><p>To maintain data value under evolving workloads, we have to maintain the clustering of queries taking also the changes to data into account as cluster membership may change depending on the data distribution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Legal Requirements</head><p>Organizations may be subject to legal requirements about how long they have to retain data. As we will discuss in Section 5, data value will be used to make data management decisions, e.g., we may delete or compress irrelevant data. To ensure that legal requirements about data retention are fullled we could assign innite value to data that should be retained to ensure it will not be deleted. However, such data may no longer be in use and, thus, could be of low value otherwise. An alternative is to track retention requirements in addition to tracking data value and always ensure that data management decisions do not violate retention requirements.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">UTILIZING DATA VALUE</head><p>We now discuss how data value information can be leveraged to improve many aspects of data management for improving query performance by ltering irrelevant data, over informed allocation of resources (both human and computational), enriching metadata management, and improving data markets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Filtering Irrelevant Data</head><p>An important application of data value is provenance-based data skipping (PBDS), i.e., ltering irrelevant data early-on during query processing to improve query performance. We have explored this use case of data relevance / value in <ref type="bibr">[19]</ref>. In this work, we capture provenance sketches, compact over-approximations of what data is relevant for a query &amp;. A provenance sketch P encodes a subset of the database &#8673; P that is guaranteed to be sucient, i.e., evaluating the query over &#8673; P yields the same result as evaluating the query over the full database. For example, consider the case of our running example query &amp; C&gt;? from Figure <ref type="figure">1</ref>. Without a provenance sketch, evaluating this query requires computing the revenue for each product in the database. However, only data for products with a revenue above $1,000 are relevant for producing &amp; C&gt;? 's result. A provenance sketch records data relevance at the granularity of fragments of a horizontal partitioning of the input tables. For instance, we may build a sketch on the product table's category attribute. As explained in Example 4.1, only the fragments for categories laptop and ssd do contain relevant data and, thus, belong to the sketch. Capturing a provenance sketch for a query requires instrumenting the query to generate the sketch: we spend time to create the sketch. Once a sketch for a query &amp; has been created it can be used to speedup future executions of &amp; and queries similar to &amp;. For example, the sketch for &amp; C&gt;? based on product category could be used to answer a variant of &amp; C&gt;? with a more restrictive HAVING clause condition, say HAVING sum(quantity * price) &gt; 2000. To use a sketch P to answer a query &amp;, we add additional WHERE clause conditions to the query to lter out data that does not belong to the sketch. For instance, to use the sketch for &amp; C&gt;? on product category we would add the following lter condition:</p><p>We present in <ref type="bibr">[19]</ref> techniques for generating sketches, using sketches, and discuss how to determine when a sketch can be reused. Furthermore, we demonstrated that query performance can be signicantly improved with PBDS, even for standard benchmark workloads for which database systems are heavily optimized for. By capturing relevance information, PBDS is also a main enabler for many of the advanced applications of data value discussed in the following.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Information Life-cycle Management &amp; Adaptive Physical Design</head><p>Information life-cycle management decides what where to store data (what device) and what format to use as well as what data to cache when and in which format. Typically, information lifecycle management decisions are based on tracking data access. For instance, infrequently accessed data can be heavily compressed to save storage space without much eect on query performance.</p><p>We argue that decisions on where to place data and whether to compress it should be based on data value instead. This has the potential to signicantly improve life-cycle management as only data that is actually needed to answer queries will be placed on faster storage or will be cached in-memory. For example, consider again the running example query &amp; C&gt;? . We explained above that without data value, &amp; C&gt;? has to compute the revenue for every product in the database. However, as only products with a revenue above 1,000 are relevant for producing the result of this query, it is more ecient to prune irrelevant data before computing aggregation results. Furthermore, data that is not relevant for the two queries of our example workload could be compressed (or even deleted) while data that is relevant for these two queries should be cached in memory. Additionally, data value information can be utilized for adaptive physical design and other self-tuning tasks. For example, instead of indexing full tables, we can only build indexes over data that is relevant for a signicant fraction of queries executed as part of a workload to reduce index storage size and improve look-up performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Intelligent Resource Allocation</head><p>Data collection and curation requires extensive human and computational resources, but is critical for maintaining data quality and, thus, trustworthy analysis results. As the resources of an organization are limited, it is important to allocate resources where they are most needed. While this idea has been explored in on-demand data curation / integration, we argue for the use of data value as a means for selecting how to spend resources most eectively. Specically, resources should be allocated to the most valuable data. However, additional challenges have to be overcome for this to be eective as the impact of spending a given amount of resources may not be uniform across dierent parts of the data and the improvement in data quality gained may not be linear in the amount of resources spent on curating a subset of the data. We envision that curation eorts can be monitored to learn over time how much eort is required for certain curation tasks. Furthermore, techniques like reinforcement learning could be applied to explore the impact of data curation eorts and then optimize for curation eorts that eectively improve the quality of the most valuable data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.4">Enriching Metadata Management</head><p>Metadata management systems like Goods <ref type="bibr">[15]</ref> and Apache Atlas <ref type="bibr">[1]</ref> enable users to keep track of data in their data lakes, record provenance information, and facilitate dataset discovery. We argue that data value information should be also managed by such systems. For facilitating data discovery, data value should be recorded for several representative workloads to enable users to select data that is relevant for their tasks. For instance, an organization may dene a set of use cases (e.g., HR, advertizing, or logistics) each associated with a workload for which data value is tracked. When a user is searching for data that may be used for a particular task, we could then determine which workload is most similar to the user's task and then use the value of a dataset for this workload as an estimate for the relevance of the data for the user's task. Query similarity metrics and clustering of workload queries as explained in Section 4.1 could be utilized to identify which workload is most similar to the user's task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.5">Data Markets</head><p>Data markets <ref type="bibr">[2,</ref><ref type="bibr">7]</ref> enable data owners to generate revenue from their data by making it available for purchase to buyers who benet by gaining access to the data they need. The price of data in data markets is subject to economic forces and considerations (e.g., ensuring arbitrage-free pricing <ref type="bibr">[16]</ref>). Nonetheless, we argue that our data value metric could be critical for buyers and sellers to make informed decisions. Assuming that data value is tracked for representative workloads as outlined above, data buyers can take the value of data for workloads similar to the workload they want to purchase data for into account to decide how valuable the data would be to them and, thus, how much they are willing to spend on the data. Furthermore, organizations running data marketplaces can measure data value for representative workloads on the data uploaded to their platform to enable their buyers to make informed choices which increases the utility of the market place.</p><p>6 DARK DATA AND QUERY VALUE 6.1 Dealing with Dark Data</p><p>With growing amount of data, it also becomes harder for users in an organization to identify which data is relevant for their use. While our data value metrics can identify which data is most relevant for an application, this only works under the assumption that the datasets containing the data are already used by this application.</p><p>That is, some data may have currently no value as it is not relevant in the context of a workload, but has potential value, i.e., it has intrinsic value for an application, but this value has not been realized yet as the application did not make use of the data. We refer to data that currently has no / low value, but may have signicant value if utilized for the right application as dark data. We envision a dataset discovery system that helps users to distinguish between not valuable and potentially valuable dark data through controlled recommendations <ref type="bibr">[18]</ref>. That is, a dataset discovery system would recommend dark data to users and then measure data value of this data for the queries run by users over the recommended data. This would enable the system to discover over time which data is useful and which data is of no potential value.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">The Value of Computations / Queries</head><p>So far we assumed that the results of all queries / computations have equal value to an organization. Of course, this assumption may not be realistic. In this section we consider how data value metrics are aected by dropping this assumption. Furthermore, we outline strategies for measuring the value of a query either by assuming it as input given by the user (if there is a feasible way for the users to assign value to their queries which may or may not be realistic) or how to automatically compute it.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.1">Modelling ery</head><p>Value. To model the value of a query for an organization we associate a numeric value V (&amp;) 2 [0, 1] to each query &amp; of a workload Q. Intuitively, a value of zero indicates that the query's result is of no value to the organization while a value of one indicates that the result of the query is of maximal importance. These values are used as weights to adjust the value of a data item 3 2 &#8673; for a database &#8673;. Thus, the denition of the value of 3 is updated as follows:</p><p>Query Value as User Input. In the simplest possible case, the value for queries in a workload Q is provided as input by the user.</p><p>In this case we can directly apply the formula shown above to compute the value of a data item 3.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2.2">Automatically Determining ery</head><p>Value. Even if the value of a query is not known apriori, we can still make progress based on the following observations: (i) query value depends on valuable data being relevant. When a highly valuable data item is relevant to a query &amp;, then this increases the value of the query; and (ii) being relevant to valuable queries increases data value. If data is relevant to an important query, then this increases the value of the data as this data is used to compute query results that are important to the organization. To summarize, data and query value are intrinsically connected. However, in the absence of ground truth information about either query or data value, we are facing the following dilemma: to compute data value we need to know the value of queries and vice versa! To overcome this dilemma, we envision to adapt techniques, e.g., PageRank <ref type="bibr">[4]</ref>, which compute (approximate) solutions for xpoint equations similar to ours.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">RELATED WORK</head><p>Data value is also an important concept in data markets <ref type="bibr">[2,</ref><ref type="bibr">7,</ref><ref type="bibr">16]</ref>. However, that line of work has focused more on economic aspects of data pricing rather than on providing an objective metric for how valuable data is for an organization. The approximation techniques we propose for approximating data value in addition to their basis in provenance sketches <ref type="bibr">[19]</ref> are also closely related to data summarization techniques for explanations <ref type="bibr">[10]</ref>. The problem of determining query and data value that are mutually depend is similar to other problems for computing xpoints for recursive equation systems such as PageRank <ref type="bibr">[4]</ref> and related algorithms such as detecting spam reviews based on trust <ref type="bibr">[21]</ref> (both the trustworthiness of reviewers and quality of items is not known upfront). Automated techniques for physical design and data life-cycle management <ref type="bibr">[5,</ref><ref type="bibr">20]</ref> are often based on data access patterns. Any such technique can be improved by using data value to ignoring data that is not relevant for generating query results. To selectively measure data value for only some queries, we need techniques for clustering queries and computing similarity between queries. Existing techniques <ref type="bibr">[17]</ref> for measuring query similarity could be applied to this problem. However, for our use case, query similarity should be based on what data is relevant.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">CONCLUSIONS</head><p>In this work, we have introduced an objective metric for data value based on provenance <ref type="bibr">[9]</ref> and the recently introduced concept of data relevance <ref type="bibr">[19]</ref>. We argue that data is only of value by being used to generate results of queries (or more generally computations). These results, and not the raw data, are what is observed by applications and users of a database (or data lake). As data relevance records what data is required to answer a query, relevance provides us with the means to determine the value of data based on its role of generating query results. We discuss the signicant benets of objective data value metrics in improving various aspects of data management like data life-cycle management, pruning of irrelevant data during query execution (we explored this aspect in detail in <ref type="bibr">[19]</ref>), and for self-tuning. We also present possible strategies for measuring and managing data value and how to approximate value to reduce the overhead of measuring and maintaining these metrics. In future work, we plan to integrate data value management into database (and data lake) systems, investigate techniques for measuring query value, and explore additional use cases of data value to improve data management operations.</p></div></body>
		</text>
</TEI>
