skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Fewer Triplets Than You Think: Novelty Error Converges Faster Than Triplet Violations in Ordinal Embeddings
A practical and well-studied method for computing the novelty of a design is to construct an ordinal embedding via a collection of pairwise comparisons between items (called triplets), and use distances within that embedding to compute which designs are farthest from the center. Unfortunately, ordinal embedding methods can require a large number of triplets before their primary error measure — the triplet violation error — converges. But if our goal is accurate novelty estimation, is it really necessary to fully minimize all triplet violations? Can we extract useful information regarding the novelty of all or some items using fewer triplets than classical convergence rates might imply? This paper addresses this question by studying the relationship between triplet violation error and novelty score error when using ordinal embeddings. Specifically, we compare how errors in embeddings produced by Generalized Non-Metric Dimensional Scaling (GNMDS) converge under different sampling methods, for different numbers of embedded items, sizes of latent spaces, and for the top K most novel designs. We find that estimating the novelty of a set of items via ordinal embedding can require significantly fewer human-provided triplets than is needed to converge the triplet error, and that this effect is modulated by the type of triplet sampling method (random versus uncertainty sampling). We also find that uncertainty sampling causes unique converge behavior in estimating most novel items compared to non-novel items. Our results imply that in certain situations one can use ordinal embedding techniques to estimate novelty error in fewer samples than is typically expected. Moreover, the convergence behavior of top K novel items motivates new potential triplet sampling methods that go beyond typical triplet reduction measures.  more » « less
Award ID(s):
2155072
PAR ID:
10518117
Author(s) / Creator(s):
;
Publisher / Repository:
American Society of Mechanical Engineers
Date Published:
ISBN:
978-0-7918-8734-9
Format(s):
Medium: X
Location:
Boston, Massachusetts, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Metric embeddings traditionally study how to map n items to a target metric space such that distance lengths are not heavily distorted. However, what if we are only interested in preserving the relative order of the distances, rather than their exact lengths? In this paper, we explore the fundamental question: given triplet comparisons of the form “item i is closer to item j than to item k,” can we find low-dimensional Euclidean representations for the n items that respect those distance comparisons? Such order-preserving embeddings naturally arise in important applications—such as recommendations, ranking, crowdsourcing, psychometrics, and nearest-neighbor search—and have been studied since the 1950s under the name of ordinal or non-metric embeddings. Our main results include: Nearly-Tight Bounds on Triplet Dimension: We introduce the concept of triplet dimension of a dataset and show, surprisingly, that in order for an ordinal embedding to be triplet-preserving, its dimension needs to grow as n^2 in the worst case. This is nearly optimal, as n−1 dimensions always suffice. Tradeoffs for Dimension vs (Ordinal) Relaxation: We relax the requirement that every triplet must be exactly preserved and present almost tight lower bounds for the maximum ratio between distances whose relative order was inverted by the embedding. This ratio is known as (ordinal) relaxation in the literature and serves as a counterpart to (metric) distortion. New Bounds on Terminal and Top-k-NNs Embeddings: Moving beyond triplets, we study two well-motivated scenarios where we care about preserving specific sets of distances (not necessarily triplets). The first scenario is Terminal Ordinal Embeddings where we aim to preserve relative distance orders to k given items (the “terminals”), and for that, we present matching upper and lower bounds. The second scenario is top-k-NNs Ordinal Embeddings, where for each item we aim to preserve the relative order of its k nearest neighbors, for which we present lower bounds. To the best of our knowledge, these are some of the first tradeoffs on triplet-preserving ordinal embeddings and the first study of Terminal and Top-k-NNs Ordinal Embeddings. 
    more » « less
  2. Abstract Well-studied techniques that enhance diversity in early design concept generation require effective metrics for evaluating human-perceived similarity between ideas. Recent work suggests collecting triplet comparisons between designs directly from human raters and using those triplets to form an embedding where similarity is expressed as a Euclidean distance. While effective at modeling human-perceived similarity judgments, these methods are expensive and require a large number of triplets to be hand-labeled. However, what if there were a way to use AI to replicate the human similarity judgments captured in triplet embedding methods? In this paper, we explore the potential for pretrained Large Language Models (LLMs) to be used in this context. Using a dataset of crowdsourced text descriptions written about engineering design sketches, we generate LLM embeddings and compare them to an embedding created from human-provided triplets of those same sketches. From these embeddings, we can use Euclidean distances to describe areas where human perception and LLM perception disagree regarding design similarity. We then implement this same procedure but with descriptions written from a template that attempts to isolate a particular modality of a design (i.e., functions, behaviors, structures). By comparing the templated description embeddings to both the triplet-generated and pre-template LLM embeddings, we identify ways of describing designs such that LLM and human similarity perception better agree. We use these results to better understand how humans and LLMs interpret similarity in engineering designs. 
    more » « less
  3. Personalized recommender systems play a crucial role in modern society, especially in e-commerce, news, and ads areas. Correctly evaluating and comparing candidate recommendation models is as essential as constructing ones. The common offline evaluation strategy is holding out some user-interacted items from training data and evaluating the performance of recommendation models based on how many items they can retrieve. Specifically, for any hold-out item or so-called target item for a user, the recommendation models try to predict the probability that the user would interact with the item and rank it among overall items, which is calledglobal evaluation. Intuitively, a good recommendation model would assign high probabilities to such hold-out/target items. Based on the specific ranks, some metrics likeRecall@KandNDCG@Kcan be calculated to further quantify the quality of the recommender model. Instead of ranking the target items among all items, Koren first proposed to rank them among a smallsampled set of items, then quantified the performance of the models, which is calledsampling evaluation. Ever since then, there has been a large amount of work adopting sampling evaluation due to its efficiency and frugality. In recent work, Rendle and Krichene argued that the sampling evaluation is “inconsistent” with respect to a global evaluation in terms of offline top-Kmetrics. In this work, we first investigate the “inconsistent” phenomenon by taking a glance at the connections between sampling evaluation and global evaluation. We reveal the approximately linear relationship between sampling with respect to its global counterpart in terms of the top-KRecall metric. Second, we propose a new statistical perspective of the sampling evaluation—to estimate the global rank distribution of the entire population. After the estimated rank distribution is obtained, the approximation of the global metric can be further derived. Third, we extend the work of Krichene and Rendle, directly optimizing the error with ground truth, providing not only a comprehensive empirical study but also a rigorous theoretical understanding of the proposed metric estimators. To address the “blind spot” issue, where accurately estimating metrics for small top-Kvalues in sampling evaluation is challenging, we propose a novel adaptive sampling method that generalizes the expectation-maximization algorithm to this setting. Last but not least, we also study the user sampling evaluation effect. This series of works outlines a clear roadmap for sampling evaluation and establishes a foundational theoretical framework. Extensive empirical studies validate the reliability of the sampling methods presented. 
    more » « less
  4. Abstract In recent years, large language models (LLMs) and vision language models (VLMs) have excelled at tasks requiring human-like reasoning, inspiring researchers in engineering design to use language models (LMs) as surrogate evaluators of design concepts. But do these models actually evaluate designs like humans? While recent work has shown that LM evaluations sometimes fall within human variance on Likert-scale grading tasks, those tasks often obscure the reasoning and biases behind the scores. To address this limitation, we compare LM word embeddings (trained to capture semantic similarity) with human-rated similarity embeddings derived from triplet comparisons (“is A closer to B than C?”) on a dataset of design sketches and descriptions. We assess alignment via local tripletwise similarity and embedding distances, allowing for deeper insights than raw Likert-scale scores provide. We also explore whether describing the designs to LMs through text or images improves alignment with human judgments. Our findings suggest that text alone may not fully capture the nuances humans key into, yet text-based embeddings outperform their multimodal counterparts on satisfying local triplets. On the basis of these insights, we offer recommendations for effectively integrating LMs into design evaluation tasks. 
    more » « less
  5. OpenReview (Ed.)
    We consider the distributionally robust optimization (DRO) problem with spectral risk-based uncertainty set and f-divergence penalty. This formulation includes common risk-sensitive learning objectives such as regularized condition value-at-risk (CVaR) and average top-k loss. We present Prospect, a stochastic gradient-based algorithm that only requires tuning a single learning rate hyperparameter, and prove that it enjoys linear convergence for smooth regularized losses. This contrasts with previous algorithms that either require tuning multiple hyperparameters or potentially fail to converge due to biased gradient estimates or inadequate regularization. Empirically, we show that Prospect can converge 2-3× faster than baselines such as stochastic gradient and stochastic saddle-point methods on distribution shift and fairness benchmarks spanning tabular, vision, and language domains. 
    more » « less