skip to main content


Title: Differentially Private Simple Linear Regression
Abstract Economics and social science research often require analyzing datasets of sensitive personal information at fine granularity, with models fit to small subsets of the data. Unfortunately, such fine-grained analysis can easily reveal sensitive individual information. We study regression algorithms that satisfy differential privacy , a constraint which guarantees that an algorithm’s output reveals little about any individual input data record, even to an attacker with side information about the dataset. Motivated by the Opportunity Atlas , a high-profile, small-area analysis tool in economics research, we perform a thorough experimental evaluation of differentially private algorithms for simple linear regression on small datasets with tens to hundreds of records—a particularly challenging regime for differential privacy. In contrast, prior work on differentially private linear regression focused on multivariate linear regression on large datasets or asymptotic analysis. Through a range of experiments, we identify key factors that affect the relative performance of the algorithms. We find that algorithms based on robust estimators—in particular, the median-based estimator of Theil and Sen—perform best on small datasets (e.g., hundreds of datapoints), while algorithms based on Ordinary Least Squares or Gradient Descent perform better for large datasets. However, we also discuss regimes in which this general finding does not hold. Notably, the differentially private analogues of Theil–Sen (one of which was suggested in a theoretical work of Dwork and Lei) have not been studied in any prior experimental work on differentially private linear regression.  more » « less
Award ID(s):
1763786 2120667
NSF-PAR ID:
10336471
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings on Privacy Enhancing Technologies
Volume:
2022
Issue:
2
ISSN:
2299-0984
Page Range / eLocation ID:
184 to 204
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A large amount of data is often needed to train machine learning algorithms with confidence. One way to achieve the necessary data volume is to share and combine data from multiple parties. On the other hand, how to protect sensitive personal information during data sharing is always a challenge. We focus on data sharing when parties have overlapping attributes but non-overlapping individuals. One approach to achieve privacy protection is through sharing differentially private synthetic data. Each party generates synthetic data at its own preferred privacy budget, which is then released and horizontally merged across the parties. The total privacy cost for this approach is capped at the maximum individual budget employed by a party. We derive the mean squared error bounds for the parameter estimation in common regression analysis based on the merged sanitized data across parties. We identify through theoretical analysis the conditions under which the utility of sharing and merging sanitized data outweighs the perturbation introduced for satisfying differential privacy and surpasses that based on individual party data. The experiments suggest that sanitized HOMM data obtained at a practically reasonable small privacy cost can lead to smaller prediction and estimation errors than individual parties, demonstrating the benefits of data sharing while protecting privacy. 
    more » « less
  2. Data sets and statistics about groups of individuals are increasingly collected and released, feeding many optimization and learning algorithms. In many cases, the released data contain sensitive information whose privacy is strictly regulated. For example, in the U.S., the census data is regulated under Title 13, which requires that no individual be identified from any data released by the Census Bureau. In Europe, data release is regulated according to the General Data Protection Regulation, which addresses the control and transfer of personal data. Differential privacy has emerged as the de-facto standard to protect data privacy. In a nutshell, differentially private algorithms protect an individual’s data by injecting random noise into the output of a computation that involves such data. While this process ensures privacy, it also impacts the quality of data analysis, and, when private data sets are used as inputs to complex machine learning or optimization tasks, they may produce results that are fundamentally different from those obtained on the original data and even rise unintended bias and fairness concerns. In this talk, I will first focus on the challenge of releasing privacy-preserving data sets for complex data analysis tasks. I will introduce the notion of Constrained-based Differential Privacy (C-DP), which allows casting the data release problem to an optimization problem whose goal is to preserve the salient features of the original data. I will review several applications of C-DP in the context of very large hierarchical census data, data streams, energy systems, and in the design of federated data-sharing protocols. Next, I will discuss how errors induced by differential privacy algorithms may propagate within a decision problem causing biases and fairness issues. This is particularly important as privacy-preserving data is often used for critical decision processes, including the allocation of funds and benefits to states and jurisdictions, which ideally should be fair and unbiased. Finally, I will conclude with a roadmap to future work and some open questions. 
    more » « less
  3. null (Ed.)
    Mobile devices have been an integral part of our everyday lives. Users' increasing interaction with mobile devices brings in significant concerns on various types of potential privacy leakage, among which location privacy draws the most attention. Specifically, mobile users' trajectories constructed by location data may be captured by adversaries to infer sensitive information. In previous studies, differential privacy has been utilized to protect published trajectory data with rigorous privacy guarantee. Strong protection provided by differential privacy distorts the original locations or trajectories using stochastic noise to avoid privacy leakage. In this paper, we propose a novel location inference attack framework, iTracker, which simultaneously recovers multiple trajectories from differentially private trajectory data using the structured sparsity model. Compared with the traditional recovery methods based on single trajectory prediction, iTracker, which takes advantage of the correlation among trajectories discovered by the structured sparsity model, is more effective in recovering multiple private trajectories simultaneously. iTracker successfully attacks the existing privacy protection mechanisms based on differential privacy. We theoretically demonstrate the near-linear runtime of iTracker, and the experimental results using two real-world datasets show that iTracker outperforms existing recovery algorithms in recovering multiple trajectories. 
    more » « less
  4. Bubeck, S ; Perchet, V ; Rigollet, P (Ed.)
    Ensuring differential privacy of models learned from sensitive user data is an important goal that has been studied extensively in recent years. It is now known that for some basic learning problems, especially those involving high-dimensional data, producing an accurate private model requires much more data than learning without privacy. At the same time, in many applications it is not necessary to expose the model itself. Instead users may be allowed to query the prediction model on their inputs only through an appropriate interface. Here we formulate the problem of ensuring privacy of individual predictions and investigate the overheads required to achieve it in several standard models of classification and regression. We first describe a simple baseline approach based on training several models on disjoint subsets of data and using standard private aggregation techniques to predict. We show that this approach has nearly optimal sample complexity for (realizable) PAC learning of any class of Boolean functions. At the same time, without strong assumptions on the data distribution, the aggregation step introduces a substantial overhead. We demonstrate that this overhead can be avoided for the well-studied class of thresholds on a line and for a number of standard settings of convex regression. The analysis of our algorithm for learning thresholds relies crucially on strong generalization guarantees that we establish for all differentially private prediction algorithms. 
    more » « less
  5. The emergence of mobile apps (e.g., location-based services, geo-social networks, ride-sharing) led to the collection of vast amounts of trajectory data that greatly benefit the understanding of individual mobility. One problem of particular interest is next-location prediction, which facilitates location-based advertising, point-of-interest recommendation, traffic optimization,etc. However, using individual trajectories to build prediction models introduces serious privacy concerns, since exact whereabouts of users can disclose sensitive information such as their health status or lifestyle choices. Several research efforts focused on privacy-preserving next-location prediction, but they have serious limitations: some use outdated privacy models (e.g., k-anonymity), while others employ learning models with limited expressivity (e.g., matrix factorization). More recent approaches(e.g., DP-SGD) integrate the powerful differential privacy model with neural networks, but they provide only generic and difficult-to-tune methods that do not perform well on location data, which is inherently skewed and sparse.We propose a technique that builds upon DP-SGD, but adapts it for the requirements of next-location prediction. We focus on user-level privacy, a strong privacy guarantee that protects users regardless of how much data they contribute. Central to our approach is the use of the skip-gram model, and its negative sampling technique. Our work is the first to propose differentially-private learning with skip-grams. In addition, we devise data grouping techniques within the skip-gram framework that pool together trajectories from multiple users in order to accelerate learning and improve model accuracy. Experiments conducted on real datasets demonstrate that our approach significantly boosts prediction accuracy compared to existing DP-SGD techniques. 
    more » « less