NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Differentially Private Synthetic Data via Foundation Model APIs 2: Text

Xie, Chulin; Lin, Zinan; Backurs, Arturs; Gopi, Sivakanth; Yu, Da; Inan, Huseyin A; Nori, Harsha; Jiang, Haotian; Zhang, Huishuai; Lee, Yin Tat; et al (July 2024, International Conference on Machine Learning (ICML 2024))

Text data has become extremely valuable due to the emergence of machine learning algorithms that learn from it. A lot of high-quality text data generated in the real world is private and therefore cannot be shared or used freely due to privacy concerns. Generating synthetic replicas of private text data with a formal privacy guarantee, i.e., differential privacy (DP), offers a promising and scalable solution. However, existing methods necessitate DP finetuning of large language models (LLMs) on private data to generate DP synthetic data. This approach is not viable for proprietary LLMs (e.g., GPT-3.5) and also demands considerable computational resources for open-source LLMs. Lin et al. (2024) recently introduced the Private Evolution (PE) algorithm to generate DP synthetic images with only API access to diffusion models. In this work, we propose an augmented PE algorithm, named AUGPE, that applies to the complex setting of text. We use API access to an LLM and generate DP synthetic text without any model training. We conduct comprehensive experiments on three benchmark datasets. Our results demonstrate that AUGPE produces DP synthetic text that yields competitive utility with the SOTA DP finetuning baselines. This underscores the feasibility of relying solely on API access of LLMs to produce high-quality DP synthetic texts, thereby facilitating more accessible routes to privacy-preserving LLM applications. Our code and data are available at https://github.com/AI-secure/aug-pe.
more » « less
Full Text Available
Two-Sided Kirszbraun Theorem

https://doi.org/10.4230/LIPIcs.SoCG.2021.13

Backurs, Arturs; Mahabadi, Sepideh; Makarychev, Konstantin; Makarychev, Yury (June 2021, Leibniz international proceedings in informatics)
Buchin, Kevin; Colin de Verdiere, Eric (Ed.)
In this paper, we prove a two-sided variant of the Kirszbraun theorem. Consider an arbitrary subset X of Euclidean space and its superset Y. Let f be a 1-Lipschitz map from X to ℝ^m. The Kirszbraun theorem states that the map f can be extended to a 1-Lipschitz map ̃ f from Y to ℝ^m. While the extension ̃ f does not increase distances between points, there is no guarantee that it does not decrease distances significantly. In fact, ̃ f may even map distinct points to the same point (that is, it can infinitely decrease some distances). However, we prove that there exists a (1 + ε)-Lipschitz outer extension f̃:Y → ℝ^{m'} that does not decrease distances more than "necessary". Namely, ‖f̃(x) - f̃(y)‖ ≥ c √{ε} min(‖x-y‖, inf_{a,b ∈ X} (‖x - a‖ + ‖f(a) - f(b)‖ + ‖b-y‖)) for some absolutely constant c > 0. This bound is asymptotically optimal, since no L-Lipschitz extension g can have ‖g(x) - g(y)‖ > L min(‖x-y‖, inf_{a,b ∈ X} (‖x - a‖ + ‖f(a) - f(b)‖ + ‖b-y‖)) even for a single pair of points x and y. In some applications, one is interested in the distances ‖f̃(x) - f̃(y)‖ between images of points x,y ∈ Y rather than in the map f̃ itself. The standard Kirszbraun theorem does not provide any method of computing these distances without computing the entire map ̃ f first. In contrast, our theorem provides a simple approximate formula for distances ‖f̃(x) - f̃(y)‖.
more » « less
Full Text Available
Faster Kernel Matrix Algebra via Density Estimation.

Backurs, Arturs; Indyk, Piotr; Musco, Cameron; Wagner, Tal. (January 2021, International Conference on Machine Learning (ICML))

We study fast algorithms for computing fundamental properties of a positive semidefinite kernel matrix K∈ R^{n*n} corresponding to n points x1,…,xn∈R^d. In particular, we consider estimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector. We show that the sum of matrix entries can be estimated to 1+ϵ relative error in time sublinear in n and linear in d for many popular kernels, including the Gaussian, exponential, and rational quadratic kernels. For these kernels, we also show that the top eigenvalue (and an approximate eigenvector) can be approximated to 1+ϵ relative error in time subquadratic in n and linear in d. Our algorithms represent significant advances in the best known runtimes for these problems. They leverage the positive definiteness of the kernel matrix, along with a recent line of work on efficient kernel density estimation.
more » « less
Full Text Available
Toward Tight Approximation Bounds for Graph Diameter and Eccentricities

https://doi.org/10.1137/18M1226737

Backurs, Arturs; Roditty, Liam; Segal, Gilad; Williams, Virginia Vassilevska; Wein, Nicole (January 2021, SIAM Journal on Computing)

Full Text Available
Submodular Clustering in Low Dimensions

https://doi.org/10.4230/LIPIcs.SWAT.2020.8

Backurs, Arturs; Har-Peled, Sariel (January 2020, 17th Scandinavian Symposium and Workshops on Algorithm Theory)

We study a clustering problem where the goal is to maximize the coverage of the input points by k chosen centers. Specifically, given a set of n points P ⊆ ℝ^d, the goal is to pick k centers C ⊆ ℝ^d that maximize the service ∑_{p∈P}φ(𝖽(p,C)) to the points P, where 𝖽(p,C) is the distance of p to its nearest center in C, and φ is a non-increasing service function φ: ℝ+ → ℝ+. This includes problems of placing k base stations as to maximize the total bandwidth to the clients - indeed, the closer the client is to its nearest base station, the more data it can send/receive, and the target is to place k base stations so that the total bandwidth is maximized. We provide an n^{ε^-O(d)} time algorithm for this problem that achieves a (1-ε)-approximation. Notably, the runtime does not depend on the parameter k and it works for an arbitrary non-increasing service function φ: ℝ+ → ℝ+.
more » « less
Full Text Available
Scalable nearest neighbor search for optimal transport

Backurs, Arturs; Dong, Yihe; Indyk, Piotr; Razenshteyn, Ilya; Wagner, Tal (June 2020, ICML)

Full Text Available
Space and Time Efficient Kernel Density Estimation in High Dimensions

Backurs, Arturs; Indyk, Piotr; Wagner, Tal (December 2019, NeurIPS)

Full Text Available
Scalable Fair Clustering

Backurs, Arturs; Indyk, Piotr; Onak, Krzysztof; Schieber, Baruch; Vakilian, Ali; Wagner, Tal (January 2019, International Conference on Machine Learning)

Full Text Available
On the Fine-Grained Complexity of Empirical Risk Minimization: Kernel Methods and Neural Networks

Backurs, Arturs; Indyk, Piotr; Schmidt, Ludwig (January 2017, Annual Conference on Neural Information Processing Systems)

Empirical risk minimization (ERM) is ubiquitous in machine learning and underlies most supervised learning methods. While there is a large body of work on algorithms for various ERM problems, the exact computational complexity of ERM is still not understood. We address this issue for multiple popular ERM problems including kernel SVMs, kernel ridge regression, and training the final layer of a neural network. In particular, we give conditional hardness results for these problems based on complexity-theoretic assumptions such as the Strong Exponential Time Hypothesis. Under these assumptions, we show that there are no algorithms that solve the aforementioned ERM problems to high accuracy in sub-quadratic time. We also give similar hardness results for computing the gradient of the empirical loss, which is the main computational burden in many non-convex learning tasks.
more » « less
Full Text Available

Search for: All records