Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

We consider the question of Gaussian mean testing, a fundamental task in highdimensional distribution testing and signal processing, subject to adversarial corruptions of the samples. We focus on the relative power of different adversaries, and show that, in contrast to the common wisdom in robust statistics, there exists a strict separation between adaptive adversaries (strong contamination) and oblivious ones (weak contamination) for this task. Specifically, we resolve both the informationtheoretic and computational landscapes for robust mean testing. In the exponentialtime setting, we establish the tight sample complexity of testing N(0,I) against N(αv,I), where ∥v∥2=1, with an εfraction of adversarial corruptions, to be Θ~(max(d√α2,dε3α4,min(d2/3ε2/3α8/3,dεα2))) while the complexity against adaptive adversaries is Θ~(max(d√α2,dε2α4)) which is strictly worse for a large range of vanishing ε,α. To the best of our knowledge, ours is the first separation in sample complexity between the strong and weak contamination models. In the polynomialtime setting, we close a gap in the literature by providing a polynomialtime algorithm against adaptive adversaries achieving the above sample complexity Θ~(max(d−−√/α2,dε2/α4)), and a lowdegree lower bound (which complements an existing reduction from planted clique) suggesting that all efficient algorithms require this many samples, even in the obliviousadversary setting.more » « lessFree, publiclyaccessible full text available November 1, 2024

We study the relationship between adversarial robustness and differential privacy in highdimensional algorithmic statistics. We give the first blackbox reduction from privacy to robustness which can produce private estimators with optimal tradeoffs among sample complexity, accuracy, and privacy for a wide range of fundamental highdimensional parameter estimation problems, including mean and covariance estimation. We show that this reduction can be implemented in polynomial time in some important special cases. In particular, using nearlyoptimal polynomialtime robust estimators for the mean and covariance of highdimensional Gaussians which are based on the SumofSquares method, we design the first polynomialtime private estimators for these problems with nearlyoptimal samplesaccuracyprivacy tradeoffs. Our algorithms are also robust to a nearly optimal fraction of adversariallycorrupted samples.more » « lessFree, publiclyaccessible full text available June 20, 2024

We provide improved differentially private algorithms for identity testing of highdimensional distributions. Specifically, for ddimensional Gaussian distributions with known covariance Σ, we can test whether the distribution comes from N(μ∗,Σ) for some fixed μ∗ or from some N(μ,Σ) with total variation distance at least α from N(μ∗,Σ) with (ε,0)differential privacy, using only O~(d1/2α2+d1/3α4/3⋅ε2/3+1α⋅ε) samples if the algorithm is allowed to be computationally inefficient, and only O~(d1/2α2+d1/4α⋅ε) samples for a computationally efficient algorithm. We also provide a matching lower bound showing that our computationally inefficient algorithm has optimal sample complexity. We also extend our algorithms to various related problems, including mean testing of Gaussians with bounded but unknown covariance, uniformity testing of product distributions over {−1,1}d, and tolerant testing. Our results improve over the previous best work of Canonne et al.~\cite{CanonneKMUZ20} for both computationally efficient and inefficient algorithms, and even our computationally efficient algorithm matches the optimal \emph{nonprivate} sample complexity of O(d√α2) in many standard parameter settings. In addition, our results show that, surprisingly, private identity testing of ddimensional Gaussians can be done with fewer samples than private identity testing of discrete distributions over a domain of size d \cite{AcharyaSZ18}, which refutes a conjectured lower bound of~\cite{CanonneKMUZ20}.more » « less

The \emph{pprocessor cup game} is a classic and widely studied scheduling problem that captures the setting in which a pprocessor machine must assign tasks to processors over time in order to ensure that no individual task ever falls too far behind. The problem is formalized as a multiround game in which two players, a filler (who assigns work to tasks) and an emptier (who schedules tasks) compete. The emptier's goal is to minimize backlog, which is the maximum amount of outstanding work for any task. Recently, Kuszmaul and Westover (ITCS, 2021) proposed the \emph{variableprocessor cup game}, which considers the same problem, except that the amount of resources available to the players (i.e., the number p of processors) fluctuates between rounds of the game. They showed that this seemingly small modification fundamentally changes the dynamics of the game: whereas the optimal backlog in the fixed pprocessor game is Θ(logn), independent of p, the optimal backlog in the variableprocessor game is Θ(n). The latter result was only known to apply to games with \emph{exponentially many} rounds, however, and it has remained an open question what the optimal tradeoff between time and backlog is for shorter games. This paper establishes a tight tradeoff curve between time and backlog in the variableprocessor cup game. Importantly, we prove that for a game consisting of t rounds, the optimal backlog is Θ(n) if and only if t≥Ω(n3). Our techniques also allow for us to resolve several other open questions concerning how the variableprocessor cup game behaves in beyondworstcaseanalysis settings.more » « less

Abstract Particle production from secondary protonproton collisions, commonly referred to as pileup, impair the sensitivity of both new physics searches and precision measurements at large hadron collider (LHC) experiments. We propose a novel algorithm,
Puma , for modeling pileup with the help of deep neural networks based on sparse transformers. These attention mechanisms were developed for natural language processing but have become popular in other applications. In a realistic detector simulation, our method outperforms classical benchmark algorithms for pileup mitigation in key observables. It provides a perspective for mitigating the effects of pileup in the high luminosity era of the LHC, where up to 200 protonproton collisions are expected to occur simultaneously. 
Meyer, J. P. (Ed.)Airwater evaporation systems are ubiquitous in industrial applications, including processes such as fuel combustion, inkjet printing, spray cooling, and desalination. In these evaporationdriven systems, a fundamental understanding of mass accommodation at the liquidvapour interface is critical to predicting and optimizing performance. Interfacial mass accommodation depends on many factors, such as temperature, vapour concentration, nonvolatile impurity content, and noncondensable gasses present. Elucidating how these factors interact is essential to designing devices to meet demanding applications. Hence, high precision measurements are needed to quantify accommodation at the liquidvapour interface accurately. Our previous study has shown surface averaged accommodation coefficients close to 0.001 for pure water droplets throughout evaporation. While it is well established that saline nonvolatile impurities reduce the evaporation rate of sessile droplets, the dynamic effect on mass accommodation during the droplet's lifespan is yet to be determined. In this work, we combine experimental and computational techniques to determine the accommodation coefficient over the lifespan of 103 to 1 molar potassium chloridewater droplets evaporating on a goldcoated surface into dry nitrogen. This study uses a quartz crystal microbalance as a highprecision contact area sensor. It also determines the nonvolatile impurities in the droplet with a precision on the order of nanograms. The computational model couples macroscopic measurements with the microscopic kinetic theory of gasses to quantify hardtomeasure physical quantities. We believe this study will provide a basis for predicting evaporative device performance in conditions where nonvolatile impurities are intrinsic to the application.more » « less

We propose datadriven onepass streaming algorithms for estimating the number of triangles and four cycles, two fundamental problems in graph analytics that are widely studied in the graph data stream literature. Recently, Hsu et al. (2019a) and Jiang et al. (2020) applied machine learning techniques in other data stream problems, using a trained oracle that can predict certain properties of the stream elements to improve on prior “classical” algorithms that did not use oracles. In this paper, we explore the power of a “heavy edge” oracle in multiple graph edge streaming models. In the adjacency list model, we present a onepass triangle counting algorithm improving upon the previous space upper bounds without such an oracle. In the arbitrary order model, we present algorithms for both triangle and four cycle estimation with fewer passes and the same space complexity as in previous algorithms, and we show several of these bounds are optimal. We analyze our algorithms under several noise models, showing that the algorithms perform well even when the oracle errs. Our methodology expands upon prior work on “classical” streaming algorithms, as previous multipass and random order streaming algorithms can be seen as special cases of our algorithms, where the first pass or random order was used to implement the heavy edge oracle. Lastly, our experiments demonstrate advantages of the proposed method compared to stateoftheart streaming algorithms.more » « less

Random dimensionality reduction is a versatile tool for speeding up algorithms for highdimensional problems. We study its application to two clustering problems: the facility location problem, and the singlelinkage hierarchical clustering problem, which is equivalent to computing the minimum spanning tree. We show that if we project the input pointset 𝑋 onto a random 𝑑=𝑂(𝑑𝑋)dimensional subspace (where 𝑑𝑋 is the doubling dimension of 𝑋), then the optimum facility location cost in the projected space approximates the original cost up to a constant factor. We show an analogous statement for minimum spanning tree, but with the dimension 𝑑 having an extra loglog𝑛 term and the approximation factor being arbitrarily close to 1. Furthermore, we extend these results to approximating solutions instead of just their costs. Lastly, we provide experimental results to validate the quality of solutions and the speedup due to the dimensionality reduction. Unlike several previous papers studying this approach in the context of 𝑘means and 𝑘medians, our dimension bound does not depend on the number of clusters but only on the intrinsic dimensionality of 𝑋.more » « less