skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Differentially Private Simple Linear Regression
Abstract Economics and social science research often require analyzing datasets of sensitive personal information at fine granularity, with models fit to small subsets of the data. Unfortunately, such fine-grained analysis can easily reveal sensitive individual information. We study regression algorithms that satisfy differential privacy , a constraint which guarantees that an algorithm’s output reveals little about any individual input data record, even to an attacker with side information about the dataset. Motivated by the Opportunity Atlas , a high-profile, small-area analysis tool in economics research, we perform a thorough experimental evaluation of differentially private algorithms for simple linear regression on small datasets with tens to hundreds of records—a particularly challenging regime for differential privacy. In contrast, prior work on differentially private linear regression focused on multivariate linear regression on large datasets or asymptotic analysis. Through a range of experiments, we identify key factors that affect the relative performance of the algorithms. We find that algorithms based on robust estimators—in particular, the median-based estimator of Theil and Sen—perform best on small datasets (e.g., hundreds of datapoints), while algorithms based on Ordinary Least Squares or Gradient Descent perform better for large datasets. However, we also discuss regimes in which this general finding does not hold. Notably, the differentially private analogues of Theil–Sen (one of which was suggested in a theoretical work of Dwork and Lei) have not been studied in any prior experimental work on differentially private linear regression.  more » « less
Award ID(s):
1763786 2120667
PAR ID:
10336471
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings on Privacy Enhancing Technologies
Volume:
2022
Issue:
2
ISSN:
2299-0984
Page Range / eLocation ID:
184 to 204
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We perform a rigorous study of private matrix analysis when only the last 𝑊 updates to matrices are considered useful for analysis. We show the existing framework in the non-private setting is not robust to noise required for privacy. We then propose a framework robust to noise and use it to give first efficient 𝑜(𝑊) space differentially private algorithms for spectral approximation, principal component analysis (PCA), multi-response linear regression, sparse PCA, and non-negative PCA. Prior to our work, no such result was known for sparse and non-negative differentially private PCA even in the static data setting. We also give a lower bound to demonstrate the cost of privacy in the sliding window model. 
    more » « less
  2. Differential privacy has emerged as a leading theoretical framework for privacy-preserving data gathering and analysis. It allows meaningful statistics to be collected for a population without revealing ``too much'' information about any individual member of the population. For software profiling, this machinery allows profiling data from many users of a deployed software system to be collected and analyzed in a privacy-preserving manner. Such a solution is appealing to many stakeholders, including software users, software developers, infrastructure providers, and government agencies. We propose an approach for differentially-private collection of frequency vectors from software executions. Frequency information is reported with the addition of random noise drawn from the Laplace distribution. A key observation behind the design of our scheme is that event frequencies are closely correlated due to the static code structure. Differential privacy protections must account for such relationships; otherwise, a seemingly-strong privacy guarantee is actually weaker than it appears. Motivated by this observation, we propose a novel and general differentially-private profiling scheme when correlations between frequencies can be expressed through linear inequalities. Using a linear programming formulation, we show how to determine the magnitude of random noise that should be added to achieve meaningful privacy protections under such linear constraints. Next, we develop an efficient instance of this general machinery for an important subclass of constraints. Instead of LP, our solution uses a reachability analysis of a constraint graph. As an exemplar, we employ this approach to implement differentially-private method frequency profiling for Android apps. Any differentially-private scheme has to balance two competing aspects: privacy and accuracy. Through an experimental study to characterize these trade-offs, we (1) show that our proposed randomization achieves much higher accuracy compared to related prior work, (2) demonstrate that high accuracy and high privacy protection can be achieved simultaneously, and (3) highlight the importance of linear constraints in the design of the randomization. These promising results provide evidence that our approach is a good candidate for privacy-preserving frequency profiling of deployed software. 
    more » « less
  3. We extend the framework of augmented distribution testing (Aliakbarpour, Indyk, Rubinfeld, and Silwal, NeurIPS 2024) to the differentially private setting. This captures scenarios where a data ana- lyst must perform hypothesis testing tasks on sensitive data, but is able to leverage prior knowledge (public, but possibly erroneous or untrusted) about the data distribution. We design private algorithms in this augmented setting for three flagship distribution testing tasks, uniformity, identity, and closeness testing, whose sample complexity smoothly scales with the claimed quality of the auxiliary information. We complement our algorithms with information- theoretic lower bounds, showing that their sample complexity is optimal (up to logarithmic factors). Keywords: distribution testing, identity testing, closeness testing, differential privacy, learning- augmented algorithms 
    more » « less
  4. A large amount of data is often needed to train machine learning algorithms with confidence. One way to achieve the necessary data volume is to share and combine data from multiple parties. On the other hand, how to protect sensitive personal information during data sharing is always a challenge. We focus on data sharing when parties have overlapping attributes but non-overlapping individuals. One approach to achieve privacy protection is through sharing differentially private synthetic data. Each party generates synthetic data at its own preferred privacy budget, which is then released and horizontally merged across the parties. The total privacy cost for this approach is capped at the maximum individual budget employed by a party. We derive the mean squared error bounds for the parameter estimation in common regression analysis based on the merged sanitized data across parties. We identify through theoretical analysis the conditions under which the utility of sharing and merging sanitized data outweighs the perturbation introduced for satisfying differential privacy and surpasses that based on individual party data. The experiments suggest that sanitized HOMM data obtained at a practically reasonable small privacy cost can lead to smaller prediction and estimation errors than individual parties, demonstrating the benefits of data sharing while protecting privacy. 
    more » « less
  5. Local differential privacy is a widely studied restriction on distributed algorithms that collect aggregates about sensitive user data, and is now deployed in several large systems. We initiate a systematic study of a fundamental limitation of locally differentially private protocols: they are highly vulnerable to adversarial manipulation. While any algorithm can be manipulated by adversaries who lie about their inputs, we show that any noninteractive locally differentially private protocol can be manipulated to a much greater extent---when the privacy level is high, or the domain size is large, a small fraction of users in the protocol can completely obscure the distribution of the honest users' input. We also construct protocols that are optimally robust to manipulation for a variety of common tasks in local differential privacy. Finally, we give simple experiments validating our  theoretical results, and demonstrating that protocols that are optimal without manipulation can have dramatically different levels of robustness to manipulation. Our results suggest caution when deploying local differential privacy and reinforce the importance of efficient cryptographic  techniques for the distributed emulation of centrally differentially private mechanisms. 
    more » « less