skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 1, 2025

Title: Public-data Assisted Private Stochastic Optimization: Power and Limitations
We study the limits and capability of public-data assisted differentially private (PA-DP) algorithms. Specifically, we focus on the problem of stochastic convex optimization (SCO) with either labeled or unlabeled public data. For complete/labeled public data, we show lower bounds on the excess risk for any PA-DP algorithm in terms of the dimension d, the number of public samples, npub and the number of private samples, npriv. These lower bounds are established via our new lower bounds for PA-DP mean estimation. Up to constant factors, these lower bounds show that the simple strategy of either treating all data as private or discarding the private data, is optimal. We also study PA-DP supervised learning with unlabeled public samples. In contrast to our previous result, we here show novel methods for leveraging public data in private supervised learning. For generalized linear models (GLM) with unlabeled public data, we show an efficient algorithm which achieves a dimension independent rate. We develop new lower bounds for this setting which shows that this rate cannot be improved with more public samples, and any fewer public samples leads to a worse rate. Finally, we provide extensions of this result to general hypothesis classes with finite fat-shattering dimension with applications to neural networks and non-Euclidean geometries.  more » « less
Award ID(s):
1943251
PAR ID:
10572973
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
38th Conference on Neural Information Processing Systems (NeurIPS 2024)
Date Published:
ISSN:
1049-5258
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider learning problems where the training set consists of two types of examples: private and public. The goal is to design a learning algorithm that satisfies differential privacy only with respect to the private examples. This setting interpolates between private learning (where private) and classical learning (where all examples are public). We study the limits of learning in this setting in terms of private and public sample complexities. We show that any hypothesis class of VC-dimension d can be agnostically learned up to an excess error of α using only (roughly) d/α public examples and d/α2 private labeled examples. This result holds even when the public examples are unlabeled. This gives a quadratic improvement over the standard d/α2 upper bound on the public sample complexity (where private examples can be ignored altogether if the public examples are labeled). Furthermore, we give a nearly matching lower bound, which we prove via a generic reduction from this setting to the one of private learning without public data. 
    more » « less
  2. In supervised learning, we leverage a labeled dataset to design methods for function estimation. In many practical situations, we are able to obtain alternative feedback, possibly at a low cost. A broad goal is to understand the usefulness of, and to design algorithms to exploit, this alternative feedback. We focus on a semi-supervised setting where we obtain additional ordinal (or comparison) information for potentially unlabeled samples. We consider ordinal feedback of varying qualities where we have either a perfect ordering of the samples, a noisy ordering of the samples or noisy pairwise comparisons between the samples. We provide a precise quantification of the usefulness of these types of ordinal feedback in non-parametric regression, showing that in many cases it is possible to accurately estimate an underlying function with a very small labeled set, effectively escaping the curse of dimensionality. We develop an algorithm called Ranking-Regression (RR) and analyze its accuracy as a function of size of the labeled and unlabeled datasets and various noise parameters. We also present lower bounds, that establish fundamental limits for the task and show that RR is optimal in a variety of settings. Finally, we present experiments that show the efficacy of RR and investigate its robustness to various sources of noise and model-misspecification. 
    more » « less
  3. Recent work of Klivans, Stavropoulos, and Vasilyan initiated the study of testable learning with distribution shift (TDS learning), where a learner is given labeled samples from training distribution D, unlabeled samples from test distribution D′, and the goal is to output a classifier with low error on D′ whenever the training samples pass a corresponding test. Their model deviates from all prior work in that no assumptions are made on D′. Instead, the test must accept (with high probability) when the marginals of the training and test distributions are equal. Here we focus on the fundamental case of intersections of halfspaces with respect to Gaussian training distributions and prove a variety of new upper bounds including a 2(k/ϵ)O(1)poly(d)-time algorithm for TDS learning intersections of k homogeneous halfspaces to accuracy ϵ (prior work achieved d(k/ϵ)O(1)). We work under the mild assumption that the Gaussian training distribution contains at least an ϵ fraction of both positive and negative examples (ϵ-balanced). We also prove the first set of SQ lower-bounds for any TDS learning problem and show (1) the ϵ-balanced assumption is necessary for poly(d,1/ϵ)-time TDS learning for a single halfspace and (2) a dΩ~(log1/ϵ) lower bound for the intersection of two general halfspaces, even with the ϵ-balanced assumption. Our techniques significantly expand the toolkit for TDS learning. We use dimension reduction and coverings to give efficient algorithms for computing a localized version of discrepancy distance, a key metric from the domain adaptation literature. 
    more » « less
  4. Recent work of Klivans, Stavropoulos, and Vasilyan initiated the study of testable learning with distribution shift (TDS learning), where a learner is given labeled samples from training distribution , unlabeled samples from test distribution ′, and the goal is to output a classifier with low error on ′ whenever the training samples pass a corresponding test. Their model deviates from all prior work in that no assumptions are made on ′. Instead, the test must accept (with high probability) when the marginals of the training and test distributions are equal. Here we focus on the fundamental case of intersections of halfspaces with respect to Gaussian training distributions and prove a variety of new upper bounds including a 2(k/ϵ)O(1)𝗉𝗈𝗅𝗒(d)-time algorithm for TDS learning intersections of k homogeneous halfspaces to accuracy ϵ (prior work achieved d(k/ϵ)O(1)). We work under the mild assumption that the Gaussian training distribution contains at least an ϵ fraction of both positive and negative examples (ϵ-balanced). We also prove the first set of SQ lower-bounds for any TDS learning problem and show (1) the ϵ-balanced assumption is necessary for 𝗉𝗈𝗅𝗒(d,1/ϵ)-time TDS learning for a single halfspace and (2) a dΩ̃ (log1/ϵ) lower bound for the intersection of two general halfspaces, even with the ϵ-balanced assumption. Our techniques significantly expand the toolkit for TDS learning. We use dimension reduction and coverings to give efficient algorithms for computing a localized version of discrepancy distance, a key metric from the domain adaptation literature. 
    more » « less
  5. null (Ed.)
    Knowing whether a published research result can be replicated is important. Carrying out direct replication of published research incurs a high cost. There are efforts tried to use machine learning aided methods to predict scientific claims’ replicability. However, existing machine learning aided approaches use only hand-extracted statistics features such as p-value, sample size, etc. without utilizing research papers’ text information and train only on a very small size of annotated data without making the most use of a large number of unlabeled articles. Therefore, it is desirable to develop effective machine learning aided automatic methods which can automatically extract text information as features so that we can benefit from Natural Language Processing techniques. Besides, we aim for an approach that benefits from both labeled and the large number of unlabeled data. In this paper, we propose two weakly supervised learning approaches that use automatically extracted text information of research papers to improve the prediction accuracy of research replication using both labeled and unlabeled datasets. Our experiments over real-world datasets show that our approaches obtain much better prediction performance compared to the supervised models utilizing only statistic features and a small size of labeled dataset. Further, we are able to achieve an accuracy of 75.76% for predicting the replicability of research. 
    more » « less