skip to main content


Title: Dealer: an end-to-end model marketplace with differential privacy
Data-driven machine learning has become ubiquitous. A marketplace for machine learning models connects data owners and model buyers, and can dramatically facilitate data-driven machine learning applications. In this paper, we take a formal data marketplace perspective and propose the first en D -to-end mod e l m a rketp l ace with diff e rential p r ivacy ( Dealer ) towards answering the following questions: How to formulate data owners' compensation functions and model buyers' price functions? How can the broker determine prices for a set of models to maximize the revenue with arbitrage-free guarantee, and train a set of models with maximum Shapley coverage given a manufacturing budget to remain competitive ? For the former, we propose compensation function for each data owner based on Shapley value and privacy sensitivity, and price function for each model buyer based on Shapley coverage sensitivity and noise sensitivity. Both privacy sensitivity and noise sensitivity are measured by the level of differential privacy. For the latter, we formulate two optimization problems for model pricing and model training, and propose efficient dynamic programming algorithms. Experiment results on the real chess dataset and synthetic datasets justify the design of Dealer and verify the efficiency and effectiveness of the proposed algorithms.  more » « less
Award ID(s):
2027783 1952192
NSF-PAR ID:
10225109
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
14
Issue:
6
ISSN:
2150-8097
Page Range / eLocation ID:
957 to 969
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The increasing demand for data-driven machine learning (ML) models has led to the emergence of model markets, where a broker collects personal data from data owners to produce high-usability ML models. To incentivize data owners to share their data, the broker needs to price data appropriately while protecting their privacy. For equitable data valuation , which is crucial in data pricing, Shapley value has become the most prevalent technique because it satisfies all four desirable properties in fairness: balance, symmetry, zero element, and additivity. For the right to be forgotten , which is stipulated by many data privacy protection laws to allow data owners to unlearn their data from trained models, the sharded structure in ML model training has become a de facto standard to reduce the cost of future unlearning by avoiding retraining the entire model from scratch. In this paper, we explore how the sharded structure for the right to be forgotten affects Shapley value for equitable data valuation in model markets. To adapt Shapley value for the sharded structure, we propose S-Shapley value, a sharded structure-based Shapley value, which satisfies four desirable properties for data valuation. Since we prove that computing S-Shapley value is #P-complete, two sampling-based methods are developed to approximate S-Shapley value. Furthermore, to efficiently update valuation results after data owners unlearn their data, we present two delta-based algorithms that estimate the change of data value instead of the data value itself. Experimental results demonstrate the efficiency and effectiveness of the proposed algorithms. 
    more » « less
  2. Personal information and other types of private data are valuable for both data owners and institutions interested in providing targeted and customized services that require analyzing such data. In this context, privacy is sometimes seen as a commodity: institutions (data buyers) pay individuals (or data sellers) in exchange for private data. In this study, we examine the problem of designing such data contracts, through which a buyer aims to minimize his payment to the sellers for a desired level of data quality, while the latter aim to obtain adequate compensation for giving up a certain amount of privacy. Specifically, we use the concept of differential privacy and examine a model of linear and nonlinear queries on private data. We show that conventional algorithms that introduce differential privacy via zero-mean noise fall short for the purpose of such transactions as they do not provide sufficient degree of freedom for the contract designer to negotiate between the competing interests of the buyer and the sellers. Instead, we propose a biased differentially private algorithm which allows us to customize the privacy-accuracy tradeoff for each individual. We use a contract design approach to find the optimal contracts when using this biased algorithm to provide privacy, and show that under this combination the buyer can achieve the same level of accuracy with a lower payment as compared to using the unbiased algorithms, while incurring lower privacy loss for the sellers. 
    more » « less
  3. We develop a new nonparametric approach for discrete choice and use it to analyze the demand for health insurance in the California Affordable Care Act marketplace. The model allows for endogenous prices and instrumental variables, while avoiding parametric functional form assumptions about the unobserved components of utility. We use the approach to estimate bounds on the effects of changing premiums or subsidies on coverage choices, consumer surplus, and government spending on subsidies. We find that a $10 decrease in monthly premium subsidies would cause a decline of between 1.8% and 6.7% in the proportion of subsidized adults with coverage. The reduction in total annual consumer surplus would be between $62 and $74 million, while the savings in yearly subsidy outlays would be between $207 and $602 million. We estimate the demand impacts of linking subsidies to age, finding that shifting subsidies from older to younger buyers would increase average consumer surplus, with potentially large impacts on enrollment. We also estimate the consumer surplus impact of removing the highly‐subsidized plans in the Silver metal tier, where we find that a nonparametric model is consistent with a wide range of possibilities. We find that comparable mixed logit models tend to yield price sensitivity estimates toward the lower end of the nonparametric bounds, while producing consumer surplus impacts that can be both higher and lower than the nonparametric bounds depending on the specification of random coefficients. 
    more » « less
  4. Abstract

    Africa's continental crust hosts a variety of geologic terrains and is crucial for understanding the evolution of its longest‐lived cratons. However, few of its seismological models are yet to incorporate the largest continent‐wide noise dispersion data sets. Here, we report on new insights into Africa's crustal architecture obtained using a new data set and model assessment product, ADAMA, which comprises a large ensemble of short‐period surface wave dispersion measurements: 5–40 s. We construct a continent‐wide model ofAfrica'sCrustEvaluated with ADAMA'sRayleighPhase maps (ACE‐ADAMA‐RP). Dispersion maps, and uncertainties, are obtained with a probabilistic approach. This model update, and a crustal taxonomy derived from unsupervised machine learning, reveals that the architecture of Africa's crust can be classified into two main types:primitive(C1: faster velocities with little gradients) andmodified(C2–C4: slower velocities in the shallow crust with more pronounced gradients). The Archean shields are “primitive,” showing little variation or secular evolution. The basins, orogens, and continental margins are “modified” and retain imprints of surface deformation. The crustal taxonomy is obtained without a‐priori geological information and differs from previous classification schemes. While most of our reported features are robust, probabilistic modeling suggests caution in the quantitative interpretations where illumination is compromised by low‐quality measurements, sparse coverage or both. Future extension of our approach to other complementary seismological and geophysical data sets—for example, multimode earthquake dispersion, receiver functions, gravity, and mineral physics, will enable continent‐wide lithospheric modeling that extends resolution to the upper mantle.

     
    more » « less
  5. Label differential privacy is a relaxation of differential privacy for machine learning scenarios where the labels are the only sensitive information that needs to be protected in the training data. For example, imagine a survey from a participant in a university class about their vaccination status. Some attributes of the students are publicly available but their vaccination status is sensitive information and must remain private. Now if we want to train a model that predicts whether a student has received vaccination using only their public information, we can use label-DP. Recent works on label-DP use different ways of adding noise to the labels in order to obtain label-DP models. In this work, we present novel techniques for training models with label-DP guarantees by leveraging unsupervised learning and semi-supervised learning, enabling us to inject less noise while obtaining the same privacy, therefore achieving a better utility-privacy trade-off. We first introduce a framework that starts with an unsupervised classifier f0 and dataset D with noisy label set Y , reduces the noise in Y using f0 , and then trains a new model f using the less noisy dataset. Our noise reduction strategy uses the model f0 to remove the noisy labels that are incorrect with high probability. Then we use semi-supervised learning to train a model using the remaining labels. We instantiate this framework with multiple ways of obtaining the noisy labels and also the base classifier. As an alternative way to reduce the noise, we explore the effect of using unsupervised learning: we only add noise to a majority voting step for associating the learned clusters with a cluster label (as opposed to adding noise to individual labels); the reduced sensitivity enables us to add less noise. Our experiments show that these techniques can significantly outperform the prior works on label-DP. 
    more » « less