NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Exploring Human and Language Model Alignment in Perceived Design Similarity Using Ordinal Embeddings

https://doi.org/10.1115/1.4069129

Keeler, Matthew; Fuge, Mark D; Peng, Aoran; Miller, Scarlett (October 2025, Journal of Mechanical Design)

Abstract In recent years, large language models (LLMs) and vision language models (VLMs) have excelled at tasks requiring human-like reasoning, inspiring researchers in engineering design to use language models (LMs) as surrogate evaluators of design concepts. But do these models actually evaluate designs like humans? While recent work has shown that LM evaluations sometimes fall within human variance on Likert-scale grading tasks, those tasks often obscure the reasoning and biases behind the scores. To address this limitation, we compare LM word embeddings (trained to capture semantic similarity) with human-rated similarity embeddings derived from triplet comparisons (“is A closer to B than C?”) on a dataset of design sketches and descriptions. We assess alignment via local tripletwise similarity and embedding distances, allowing for deeper insights than raw Likert-scale scores provide. We also explore whether describing the designs to LMs through text or images improves alignment with human judgments. Our findings suggest that text alone may not fully capture the nuances humans key into, yet text-based embeddings outperform their multimodal counterparts on satisfying local triplets. On the basis of these insights, we offer recommendations for effectively integrating LMs into design evaluation tasks.
more » « less
Free, publicly-accessible full text available October 1, 2026
CAP: The creativity assessment platform for online testing and automated scoring

https://doi.org/10.3758/s13428-025-02761-9

Patterson, John_D; Pronchick, Jimmy; Panchanadikar, Ruchi; Fuge, Mark; van_Hell, Janet_G; Miller, Scarlett_R; Johnson, Dan_R; Beaty, Roger_E (August 2025, Behavior Research Methods)

Abstract Creativity is increasingly recognized as a core competency for the 21st century, making its development a priority in education, research, and industry. To effectively cultivate creativity, researchers and educators need reliable and accessible assessment tools. Recent software developments have significantly enhanced the administration and scoring of creativity measures; however, existing software often requires expertise in experiment design and computer programming, limiting its accessibility to many educators and researchers. In the current work, we introduce CAP—the Creativity Assessment Platform—a free web application for building creativity assessments, collecting data, and automatically scoring responses (cap.ist.psu.edu). CAP allows users to create custom creativity assessments in ten languages using a simple, point-and-click interface, selecting from tasks such as the Short Story Task, Drawing Task, and Scientific Creative Thinking Test. Users can automatically score task responses using machine learning models trained to match human creativity ratings—with multilingual capabilities, including the new Cross-Lingual Alternate Uses Scoring (CLAUS), a large language model achieving strong prediction of human creativity ratings in ten languages. CAP also provides a centralized dashboard to monitor data collection, score assessments, and automatically generate text for a Methods section based on the study’s tasks, metrics, and instructions—with a single click—promoting transparency and reproducibility in creativity assessment. Designed for ease of use, CAP aims to democratize creativity measurement for researchers, educators, and everyone in between.
more » « less
A Picture or a Thousand Words: Design Description Crafting to Replicate Human Similarity Judgments in Large Language Models

https://doi.org/10.1115/DETC2024-143634

Keeler, Matthew; Fuge, Mark; Peng, Aoran; Miller, Scarlett (August 2024, American Society of Mechanical Engineers)

Abstract Well-studied techniques that enhance diversity in early design concept generation require effective metrics for evaluating human-perceived similarity between ideas. Recent work suggests collecting triplet comparisons between designs directly from human raters and using those triplets to form an embedding where similarity is expressed as a Euclidean distance. While effective at modeling human-perceived similarity judgments, these methods are expensive and require a large number of triplets to be hand-labeled. However, what if there were a way to use AI to replicate the human similarity judgments captured in triplet embedding methods? In this paper, we explore the potential for pretrained Large Language Models (LLMs) to be used in this context. Using a dataset of crowdsourced text descriptions written about engineering design sketches, we generate LLM embeddings and compare them to an embedding created from human-provided triplets of those same sketches. From these embeddings, we can use Euclidean distances to describe areas where human perception and LLM perception disagree regarding design similarity. We then implement this same procedure but with descriptions written from a template that attempts to isolate a particular modality of a design (i.e., functions, behaviors, structures). By comparing the templated description embeddings to both the triplet-generated and pre-template LLM embeddings, we identify ways of describing designs such that LLM and human similarity perception better agree. We use these results to better understand how humans and LLMs interpret similarity in engineering designs.
more » « less
Full Text Available
Fewer Triplets Than You Think: Novelty Error Converges Faster Than Triplet Violations in Ordinal Embeddings

https://doi.org/10.1115/DETC2023-116696

Keeler, Matthew; Fuge, Mark (August 2023, American Society of Mechanical Engineers)

A practical and well-studied method for computing the novelty of a design is to construct an ordinal embedding via a collection of pairwise comparisons between items (called triplets), and use distances within that embedding to compute which designs are farthest from the center. Unfortunately, ordinal embedding methods can require a large number of triplets before their primary error measure — the triplet violation error — converges. But if our goal is accurate novelty estimation, is it really necessary to fully minimize all triplet violations? Can we extract useful information regarding the novelty of all or some items using fewer triplets than classical convergence rates might imply? This paper addresses this question by studying the relationship between triplet violation error and novelty score error when using ordinal embeddings. Specifically, we compare how errors in embeddings produced by Generalized Non-Metric Dimensional Scaling (GNMDS) converge under different sampling methods, for different numbers of embedded items, sizes of latent spaces, and for the top K most novel designs. We find that estimating the novelty of a set of items via ordinal embedding can require significantly fewer human-provided triplets than is needed to converge the triplet error, and that this effect is modulated by the type of triplet sampling method (random versus uncertainty sampling). We also find that uncertainty sampling causes unique converge behavior in estimating most novel items compared to non-novel items. Our results imply that in certain situations one can use ordinal embedding techniques to estimate novelty error in fewer samples than is typically expected. Moreover, the convergence behavior of top K novel items motivates new potential triplet sampling methods that go beyond typical triplet reduction measures.
more » « less

Search for: All records