skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on July 12, 2025

Title: Space-Optimal Profile Estimation in Data Streams with Applications to Symmetric Functions
Metric embeddings traditionally study how to map n items to a target metric space such that distance lengths are not heavily distorted. However, what if we are only interested in preserving the relative order of the distances, rather than their exact lengths? In this paper, we explore the fundamental question: given triplet comparisons of the form “item i is closer to item j than to item k,” can we find low-dimensional Euclidean representations for the n items that respect those distance comparisons? Such order-preserving embeddings naturally arise in important applications—such as recommendations, ranking, crowdsourcing, psychometrics, and nearest-neighbor search—and have been studied since the 1950s under the name of ordinal or non-metric embeddings. Our main results include: Nearly-Tight Bounds on Triplet Dimension: We introduce the concept of triplet dimension of a dataset and show, surprisingly, that in order for an ordinal embedding to be triplet-preserving, its dimension needs to grow as n^2 in the worst case. This is nearly optimal, as n−1 dimensions always suffice. Tradeoffs for Dimension vs (Ordinal) Relaxation: We relax the requirement that every triplet must be exactly preserved and present almost tight lower bounds for the maximum ratio between distances whose relative order was inverted by the embedding. This ratio is known as (ordinal) relaxation in the literature and serves as a counterpart to (metric) distortion. New Bounds on Terminal and Top-k-NNs Embeddings: Moving beyond triplets, we study two well-motivated scenarios where we care about preserving specific sets of distances (not necessarily triplets). The first scenario is Terminal Ordinal Embeddings where we aim to preserve relative distance orders to k given items (the “terminals”), and for that, we present matching upper and lower bounds. The second scenario is top-k-NNs Ordinal Embeddings, where for each item we aim to preserve the relative order of its k nearest neighbors, for which we present lower bounds. To the best of our knowledge, these are some of the first tradeoffs on triplet-preserving ordinal embeddings and the first study of Terminal and Top-k-NNs Ordinal Embeddings.  more » « less
Award ID(s):
2022448
PAR ID:
10523014
Author(s) / Creator(s):
; ;
Publisher / Repository:
Innovations in Theoretical Computer Science
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A practical and well-studied method for computing the novelty of a design is to construct an ordinal embedding via a collection of pairwise comparisons between items (called triplets), and use distances within that embedding to compute which designs are farthest from the center. Unfortunately, ordinal embedding methods can require a large number of triplets before their primary error measure — the triplet violation error — converges. But if our goal is accurate novelty estimation, is it really necessary to fully minimize all triplet violations? Can we extract useful information regarding the novelty of all or some items using fewer triplets than classical convergence rates might imply? This paper addresses this question by studying the relationship between triplet violation error and novelty score error when using ordinal embeddings. Specifically, we compare how errors in embeddings produced by Generalized Non-Metric Dimensional Scaling (GNMDS) converge under different sampling methods, for different numbers of embedded items, sizes of latent spaces, and for the top K most novel designs. We find that estimating the novelty of a set of items via ordinal embedding can require significantly fewer human-provided triplets than is needed to converge the triplet error, and that this effect is modulated by the type of triplet sampling method (random versus uncertainty sampling). We also find that uncertainty sampling causes unique converge behavior in estimating most novel items compared to non-novel items. Our results imply that in certain situations one can use ordinal embedding techniques to estimate novelty error in fewer samples than is typically expected. Moreover, the convergence behavior of top K novel items motivates new potential triplet sampling methods that go beyond typical triplet reduction measures. 
    more » « less
  2. Bouamor, Houda; Pino, Juan; Bali, Kalika (Ed.)
    Cross-encoder models, which jointly encode and score a query-item pair, are prohibitively expensive for direct k-nearest neighbor (k-NN) search. Consequently, k-NN search typically employs a fast approximate retrieval (e.g. using BM25 or dual-encoder vectors), followed by reranking with a cross-encoder; however, the retrieval approximation often has detrimental recall regret. This problem is tackled by ANNCUR (Yadav et al., 2022), a recent work that employs a cross-encoder only, making search efficient using a relatively small number of anchor items, and a CUR matrix factorization. While ANNCUR’s one-time selection of anchors tends to approximate the cross-encoder distances on average, doing so forfeits the capacity to accurately estimate distances to items near the query, leading to regret in the crucial end-task: recall of top-k items. In this paper, we propose ADACUR, a method that adaptively, iteratively, and efficiently minimizes the approximation error for the practically important top-k neighbors. It does so by iteratively performing k-NN search using the anchors available so far, then adding these retrieved nearest neighbors to the anchor set for the next round. Empirically, on multiple datasets, in comparison to previous traditional and state-of-the-art methods such as ANNCUR and dual-encoder-based retrieve-and-rerank, our proposed approach ADACUR consistently reduces recall error—by up to 70% on the important k = 1 setting—while using no more compute than its competitors. 
    more » « less
  3. Suppose that we have $$n$$ agents and $$n$$ items which lie in a shared metric space. We would like to match the agents to items such that the total distance from agents to their matched items is as small as possible. However, instead of having direct access to distances in the metric, we only have each agent's ranking of the items in order of distance. Given this limited information, what is the minimum possible worst-case approximation ratio (known as the \emph{distortion}) that a matching mechanism can guarantee? Previous work by \citet{CFRF+16} proved that the (deterministic) Serial Dictatorship mechanism has distortion at most $$2^n - 1$. We improve this by providing a simple deterministic mechanism that has distortion $O(n^2)$. We also provide the first nontrivial lower bound on this problem, showing that any matching mechanism (deterministic or randomized) must have worst-case distortion $$\Omega(\log n)$$. In addition to these new bounds, we show that a large class of truthful mechanisms derived from Deferred Acceptance all have worst-case distortion at least $2^n - 1$, and we find an intriguing connection between \emph{thin matchings} (analogous to the well-known thin trees conjecture) and the distortion gap between deterministic and randomized mechanisms. 
    more » « less
  4. We study the problem of representing all distances between n points in Rd, with arbitrarily small distortion, using as few bits as possible. We give asymptotically tight bounds for this problem, for Euclidean metrics, for ℓ1 (a.k.a.~Manhattan) metrics, and for general metrics. Our bounds for Euclidean metrics mark the first improvement over compression schemes based on discretizing the classical dimensionality reduction theorem of Johnson and Lindenstrauss (Contemp.~Math.~1984). Since it is known that no better dimension reduction is possible, our results establish that Euclidean metric compression is possible beyond dimension reduction. 
    more » « less
  5. We study the natural problem of Triplet Reconstruction (also known as Rooted Triplets Consistency or Triplet Clustering), originally motivated by applications in computational biology and relational databases (Aho, Sagiv, Szymanski, and Ullman, 1981) [2]: given n datapoints, we want to embed them onto the n leaves of a rooted binary tree (also known as a hierarchical clustering, or ultrametric embedding) such that a given set of m triplet constraints is satisfied. A triplet constraint i j · k for points i, j, k indicates that 'i, j are more closely related to each other than to k,' (in terms of distances d(i, j) ≤ d(i, k) and d(i, j) ≤ d(j, k)) and we say that a tree satisfies the triplet i j · k if the distance in the tree between i, j is smaller than the distance between i, k (or j, k). Among all possible trees with n leaves, can we efficiently find one that satisfies a large fraction of the m given triplets? Aho et al. (1981) [2] studied the decision version and gave an elegant polynomial-time algorithm that determines whether or not there exists a tree that satisfies all of the m constraints. Moreover, it is straightforward to see that a random binary tree achieves a constant 13-approximation, since there are only 3 distinct triplets i j|k, i k| j, j k · i (each will be satisfied w.p. 13). Unfortunately, despite more than four decades of research by various communities, there is no better approximation algorithm for this basic Triplet Reconstruction problem.Our main theorem-which captures Triplet Reconstruction as a special case-is a general hardness of approximation result about Constraint Satisfaction Problems (CSPs) over infinite domains (CSPs where instead of boolean values {0,1} or a fixed-size domain, the variables can be mapped to any of the n leaves of a tree). Specifically, we prove that assuming the Unique Games Conjecture [57], Triplet Reconstruction and more generally, every Constraint Satisfaction Problem (CSP) over hierarchies is approximation resistant, i.e., there is no polynomial-time algorithm that does asymptotically better than a biased random assignment.Our result settles the approximability not only for Triplet Reconstruction, but for many interesting problems that have been studied by various scientific communities such as the popular Quartet Reconstruction and Subtree/Supertree Aggregation Problems. More broadly, our result significantly extends the list of approximation resistant predicates by pointing to a large new family of hard problems over hierarchies. Our main theorem is a generalization of Guruswami, Håstad, Manokaran, Raghavendra, and Charikar (2011) [36], who showed that ordering CSPs (CSPs over permutations of n elements, e.g., Max Acyclic Subgraph, Betweenness, Non-Betweenness) are approximation resistant. The main challenge in our analyses stems from the fact that trees have topology (in contrast to permutations and ordering CSPs) and it is the tree topology that determines whether a given constraint on the variables is satisfied or not. As a byproduct, we also present some of the first CSPs where their approximation resistance is proved against biased random assignments, instead of uniformly random assignments. 
    more » « less