NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Statistical-Computational Trade-offs for Density Estimation

Aamand, Anders; Andoni, Alexandr; Chen, Justin Y; Indyk, Piotr; Narayanan, Shyam; Silwal, Sandeep; Xu, Haike (December 2024, Advances in neural information processing systems)

Free, publicly-accessible full text available December 1, 2025
Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions

Chen, Justin Y; Indyk, Piotr; Woodruff, David P (July 2024, Innovations in Theoretical Computer Science)

Metric embeddings traditionally study how to map n items to a target metric space such that distance lengths are not heavily distorted. However, what if we are only interested in preserving the relative order of the distances, rather than their exact lengths? In this paper, we explore the fundamental question: given triplet comparisons of the form “item i is closer to item j than to item k,” can we find low-dimensional Euclidean representations for the n items that respect those distance comparisons? Such order-preserving embeddings naturally arise in important applications—such as recommendations, ranking, crowdsourcing, psychometrics, and nearest-neighbor search—and have been studied since the 1950s under the name of ordinal or non-metric embeddings. Our main results include: Nearly-Tight Bounds on Triplet Dimension: We introduce the concept of triplet dimension of a dataset and show, surprisingly, that in order for an ordinal embedding to be triplet-preserving, its dimension needs to grow as n^2 in the worst case. This is nearly optimal, as n−1 dimensions always suffice. Tradeoffs for Dimension vs (Ordinal) Relaxation: We relax the requirement that every triplet must be exactly preserved and present almost tight lower bounds for the maximum ratio between distances whose relative order was inverted by the embedding. This ratio is known as (ordinal) relaxation in the literature and serves as a counterpart to (metric) distortion. New Bounds on Terminal and Top-k-NNs Embeddings: Moving beyond triplets, we study two well-motivated scenarios where we care about preserving specific sets of distances (not necessarily triplets). The first scenario is Terminal Ordinal Embeddings where we aim to preserve relative distance orders to k given items (the “terminals”), and for that, we present matching upper and lower bounds. The second scenario is top-k-NNs Ordinal Embeddings, where for each item we aim to preserve the relative order of its k nearest neighbors, for which we present lower bounds. To the best of our knowledge, these are some of the first tradeoffs on triplet-preserving ordinal embeddings and the first study of Terminal and Top-k-NNs Ordinal Embeddings.
more » « less
Full Text Available
Improved Frequency Estimation Algorithms with and without Predictions

Aamand, Anders; Chen, Justin Y.; Nguyen, Huy; Silwal, Sandeep; Vakilian, Ali (September 2023, Advances in Neural Information Processing Systems)

Estimating frequencies of elements appearing in a data stream is a key task in large-scale data analysis. Popular sketching approaches to this problem (e.g., CountMin and CountSketch) come with worst-case guarantees that probabilistically bound the error of the estimated frequencies for any possible input. The work of Hsu et al.~(2019) introduced the idea of using machine learning to tailor sketching algorithms to the specific data distribution they are being run on. In particular, their learning-augmented frequency estimation algorithm uses a learned heavy-hitter oracle which predicts which elements will appear many times in the stream. We give a novel algorithm, which in some parameter regimes, already theoretically outperforms the learning based algorithm of Hsu et al. without the use of any predictions. Augmenting our algorithm with heavy-hitter predictions further reduces the error and improves upon the state of the art. Empirically, our algorithms achieve superior performance in all experiments compared to prior approaches.
more » « less
Full Text Available
Data Structures for Density Estimation

Aamand, Anders; Andoni, Alexandr; Chen, Justin Y.; Indyk, Piotr; Narayanan, Shyam; Silwal, Sandeep (January 2023, International Conference on Machine Learning, {ICML} 2023)

Full Text Available
Learned Interpolation for Better Streaming Quantile Approximation with Worst-Case Guarantees

https://doi.org/10.1137/1.9781611977714.8

Schiefer, Nicholas; Chen, Justin Y; Indyk, Piotr; Narayanan, Shyam; Silwal, Sandeep; Wagner, Tal (January 2023, SIAM Conference on Applied and Computational Discrete Algorithms)

An ε-approximate quantile sketch over a stream of n inputs approximates the rank of any query point q—that is, the number of input points less than q—up to an additive error of εn, generally with some probability of at least 1−1/ poly(n), while consuming o(n) space. While the celebrated KLL sketch of Karnin, Lang, and Liberty achieves a provably optimal quantile approximation algorithm over worst-case streams, the approximations it achieves in practice are often far from optimal. Indeed, the most commonly used technique in practice is Dunning’s t-digest, which often achieves much better approximations than KLL on realworld data but is known to have arbitrarily large errors in the worst case. We apply interpolation techniques to the streaming quantiles problem to attempt to achieve better approximations on real-world data sets than KLL while maintaining similar guarantees in the worst case.
more » « less
Full Text Available
(Optimal) Online Bipartite Matching with Degree Information

Aamand, Anders; Chen, Justin Y; Indyk, Piotr (January 2022, Conference on Neural Information Processing Systems)

We propose a model for online graph problems where algorithms are given access to an oracle that predicts (e.g., based on modeling assumptions or on past data) the degrees of nodes in the graph. Within this model, we study the classic problem of online bipartite matching, and a natural greedy matching algorithm called MinPredictedDegree, which uses predictions of the degrees of offline nodes. For the bipartite version of a stochastic graph model due to Chung, Lu, and Vu where the expected values of the offline degrees are known and used as predictions, we show that MinPredictedDegree stochastically dominates any other online algorithm, i.e., it is optimal for graphs drawn from this model. Since the “symmetric” version of the model, where all online nodes are identical, is a special case of the well-studied “known i.i.d. model”, it follows that the competitive ratio of MinPredictedDegree on such inputs is at least 0.7299. For the special case of graphs with power law degree distributions, we show that MinPredictedDegree frequently produces matchings almost as large as the true maximum matching on such graphs. We complement these results with an extensive empirical evaluation showing that MinPredictedDegree compares favorably to state-of-the-art online algorithms for online matching.
more » « less
Full Text Available
Exponentially Improving the Complexity of Simulating the Weisfeiler-Lehman Test with Graph Neural Networks

Aamand, Anders; Chen, Justin Y; Indyk, Piotr; Narayanan, Shyam; Rubinfeld, Ronitt; Schiefer, Nicholas; Silwal, Sandeep; Wagner, Tal (January 2022, Conference on Neural Information Processing Systems)

Recent work shows that the expressive power of Graph Neural Networks (GNNs) in distinguishing non-isomorphic graphs is exactly the same as that of the Weisfeiler-Lehman (WL) graph test. In particular, they show that the WL test can be simulated by GNNs. However, those simulations involve neural networks for the “combine” function of size polynomial or even exponential in the number of graph nodes n, as well as feature vectors of length linear in n. We present an improved simulation of the WL test on GNNs with exponentially lower complexity. In particular, the neural network implementing the combine function in each node has only polylog(n) parameters, and the feature vectors exchanged by the nodes of GNN consists of only O(log n) bits. We also give logarithmic lower bounds for the feature vector length and the size of the neural networks, showing the (near)-optimality of our construction.
more » « less
Full Text Available
Triangle and Four Cycle Counting with Predictions in Graph Streams

Chen, Justin Y; Eden, Talya; Indyk, Piotr; Lin, Honghao; Narayanan, Shyam; Rubinfeld, Ronitt; Silwal, Sandeep; Wagner, Tal; Woodruff, David P.; Zhang, Michael (January 2022, International Conference on Learning Representations)

Full Text Available

Search for: All records