NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Ver: View Discovery in the Wild

Yue Gong, Zhiru Zhu (January 2023, Proceedings International Conference on Data Engineering)

We present Ver, a data discovery system that identifies project-join views over large repositories of tables that do not contain join path information, and even when input queries are inaccurate. Ver implements a reference architecture to solve both the technical (scale and search) and human (semantic ambiguity, navigating a large number of results) problems of view discovery. We demonstrate users find the view they want when using Ver with a user study and we demonstrate its performance with large-scale end-to-end experiments on real-world datasets containing tens of millions of join paths.
more » « less
Full Text Available
Data-Sharing Markets: Model, Protocol, and Algorithms to Incentivize the Formation of Data-Sharing Consortia

Raul Castro Fernandez (January 2023, Proceedings ACMSIGMOD International Conference on Management of Data)

Organizations that would mutually benefit from pooling their data are otherwise wary of sharing. This is because sharing data is costly—in time and effort—and, at the same time, the benefits of sharing are not clear. Without a clear cost-benefit analysis, participants default in not sharing. As a consequence, many opportunities to create valuable data-sharing consortia never materialize and the value of data remains locked. We introduce a new sharing model, market protocol, and algorithms to incentivize the creation of data-sharing markets. The combined contributions of this paper, which we call DSC, incentivize the creation of data-sharing markets that unleash the value of data for its participants. The sharing model introduces two incentives; one that guarantees that participating is better than not doing so, and another that compensates participants according to how valuable is their data. Because operating the consortia is costly, we are also concerned with ensuring its operation is sustainable: we design a protocol that ensures that valuable data-sharing consortia form when it is sustainable. We introduce algorithms to elicit the value of data from the participants, which is used to: first, cover the costs of operating the consortia, and second compensate data contributions. For the latter, we challenge the use of the Shapley value to allocate revenue. We offer analytical and empirical evidence for this and introduce an alternative method that compensates participants better and leads to the formation of more data-sharing consortia.
more » « less
Full Text Available
Protecting Data Markets from Strategic Buyers

https://doi.org/10.1145/3514221.3517855

Raul Castro Fernandez (January 2023, Proceedings ACMSIGMOD International Conference on Management of Data)

The growing adoption of data analytics platforms and machine learning-based solutions for decision-makers creates a significant demand for datasets, which explains the appearance of data markets. In a well-functioning data market, sellers share data in exchange for money, and buyers pay for datasets that help them solve problems. The market raises sufficient money to compensate sellers and incentivize them to keep sharing datasets. This low-friction matching of sellers and buyers distributes the value of data among participants. But designing online data markets is challenging because they must account for the strategic behavior of participants. In this paper, we introduce techniques to protect data markets from strategic participants, even when the asset traded is data. We combine those techniques into a pricing algorithm specifically designed to trade data. The evaluation includes a user study and extensive simulations. Together, the evaluation demonstrates how participants strategize and the effectiveness of our techniques.
more » « less
Full Text Available
Data station: delegated, trustworthy, and auditable computation to enable data-sharing consortia with a data escrow

https://doi.org/10.14778/3551793.3551861

Xia, Siyuan; Zhu, Zhiru; Zhu, Chris; Zhao, Jinjin; Chard, Kyle; Elmore, Aaron J.; Foster, Ian; Franklin, Michael; Krishnan, Sanjay; Fernandez, Raul Castro (July 2022, Proceedings of the VLDB Endowment)

Pooling and sharing data increases and distributes its value. But since data cannot be revoked once shared, scenarios that require controlled release of data for regulatory, privacy, and legal reasons default to not sharing. Because selectively controlling what data to release is difficult, the few data-sharing consortia that exist are often built around data-sharing agreements resulting from long and tedious one-off negotiations. We introduce Data Station, a data escrow designed to enable the formation of data-sharing consortia. Data owners share data with the escrow knowing it will not be released without their consent. Data users delegate their computation to the escrow. The data escrow relies on delegated computation to execute queries without releasing the data first. Data Station leverages hardware enclaves to generate trust among participants, and exploits the centralization of data and computation to generate an audit log. We evaluate Data Station on machine learning and data-sharing applications while running on an untrusted intermediary. In addition to important qualitative advantages, we show that Data Station: i) outperforms federated learning baselines in accuracy and runtime for the machine learning application; ii) is orders of magnitude faster than alternative secure data-sharing frameworks; and iii) introduces small overhead on the critical path.
more » « less
Full Text Available

Search for: All records