skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Semantic Spaces Are Not Created Equal – How Should We Weigh Them in the Sequel?: On Composites in Automated Creativity Scoring
Semantic distance scoring provides an attractive alternative to other scoring approaches for responses in creative thinking tasks. In addition, evidence in support of semantic distance scoring has increased over the last few years. In one recent approach, it has been proposed to combine multiple semantic spaces to better balance the idiosyncratic influences of each space. Thereby, final semantic distance scores for each response are represented by a composite or factor score. However, semantic spaces are not necessarily equally weighted in mean scores, and the usage of factor scores requires high levels of factor determinacy (i.e., the correlation between estimates and true factor scores). Hence, in this work, we examined the weighting underlying mean scores, mean scores of standardized variables, factor loadings, weights that maximize reliability, and equally effective weights on common verbal creative thinking tasks. Both empirical and simulated factor determinacy, as well as Gilmer-Feldt’s composite reliability, were mostly good to excellent (i.e., > .80) across two task types (Alternate Uses and Creative Word Association), eight samples of data, and all weighting approaches. Person-level validity findings were further highly comparable across weighting approaches. Observed nuances and challenges of different weightings and the question of using composites vs. factor scores are thoroughly provided.  more » « less
Award ID(s):
1920653
PAR ID:
10352010
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
European Journal of Psychological Assessment
ISSN:
1015-5759
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Scoring divergent thinking tasks opens multiple avenues and possibilities – decisions researchers have to make. While some scholars postulate that scoring should focus on the best ideas provided, the measurement of the best responses (e.g., “top scoring”) comes along with challenges. More specifically, compared to the average quality across all responses, top scoring uses less information—the “bad” ideas are thrown away—which decreases reliability. To resolve this issue, this article introduces a multidimensional top-scoring approach analogous to linear growth modeling which retains information provided by all responses (best ideas and “bad” ideas). Across two studies, using both subjective human ratings and semantic distance originality scoring of responses to over a dozen divergent thinking tasks, we demonstrated that Maximum (the best idea) and Top2 Scoring (two best ideas) could surpass typically applied average scoring in measurement precision when the “bad” ideas’ originality is used as auxiliary information (i.e., additional information in the analysis). We thus recommend retaining all ideas when scoring divergent thinking tasks, and we discuss the potential this new approach holds for creativity research and practice. 
    more » « less
  2. Creativity research often relies on human raters to judge the novelty of participants’ responses on open-ended tasks, such as the Alternate Uses Task (AUT). Albeit useful, manual ratings are subjective and labor intensive. To address these limitations, researchers increasingly use automatic scoring methods based on a natural language processing technique for quantifying the semantic distance between words. However, many methodological choices remain open on how to obtain semantic distance scores for ideas, which can significantly impact reliability and validity. In this project, we propose a new semantic distance-based method, maximum associative distance (MAD), for assessing response novelty in AUT. Within a response, MAD uses the semantic distance of the word that is maximally remote from the prompt word to reflect response novelty. We compare the results from MAD with other competing semantic distance-based methods, including element-wise-multiplication—a commonly used compositional model—across three published datasets including a total of 447 participants. We found MAD to be more strongly correlated with human creativity ratings than the competing methods. In addition, MAD scores reliably predict external measures such as openness to experience. We further explored how idea elaboration affects the performance of various scoring methods and found that MAD is closely aligned with human raters in processing multi-word responses. The MAD method thus improves the psychometrics of semantic distance for automatic creativity assessment, and it provides clues about what human raters find creative about ideas. 
    more » « less
  3. null (Ed.)
    Abstract Creativity research requires assessing the quality of ideas and products. In practice, conducting creativity research often involves asking several human raters to judge participants’ responses to creativity tasks, such as judging the novelty of ideas from the alternate uses task (AUT). Although such subjective scoring methods have proved useful, they have two inherent limitations—labor cost (raters typically code thousands of responses) and subjectivity (raters vary on their perceptions and preferences)—raising classic psychometric threats to reliability and validity. We sought to address the limitations of subjective scoring by capitalizing on recent developments in automated scoring of verbal creativity via semantic distance, a computational method that uses natural language processing to quantify the semantic relatedness of texts. In five studies, we compare the top performing semantic models (e.g., GloVe, continuous bag of words) previously shown to have the highest correspondence to human relatedness judgements. We assessed these semantic models in relation to human creativity ratings from a canonical verbal creativity task (AUT; Studies 1–3) and novelty/creativity ratings from two word association tasks (Studies 4–5). We find that a latent semantic distance factor—comprised of the common variance from five semantic models—reliably and strongly predicts human creativity and novelty ratings across a range of creativity tasks. We also replicate an established experimental effect in the creativity literature (i.e., the serial order effect) and show that semantic distance correlates with other creativity measures, demonstrating convergent validity. We provide an open platform to efficiently compute semantic distance, including tutorials and documentation ( https://osf.io/gz4fc/ ). 
    more » « less
  4. Metaphor is crucial in human cognition and creativity, facilitating abstract thinking, analogical reasoning, and idea generation. Typically, human raters manually score the originality of responses to creative thinking tasks – a laborious and error-prone process. Previous research sought to remedy these risks by scoring creativity tasks automatically using semantic distance and large language models (LLMs). Here, we extend research on automatic creativity scoring to metaphor generation – the ability to creatively describe episodes and concepts using nonliteral language. Metaphor is arguably more abstract and naturalistic than prior targets of automated creativity assessment. We collected 4,589 responses from 1,546 participants to various metaphor prompts and corresponding human creativity ratings. We fine-tuned two open-source LLMs (RoBERTa and GPT-2) – effectively “teaching” them to score metaphors like humans – before testing their ability to accurately assess the creativity of new metaphors. Results showed both models reliably predicted new human creativity ratings (RoBERTa r = .72, GPT-2 r = .70), significantly more strongly than semantic distance (r = .42). Importantly, the fine-tuned models generalized accurately to metaphor prompts they had not been trained on (RoBERTa r = .68, GPT-2 r = .63). We provide open access to the fine-tuned models, allowing researchers to assess metaphor creativity in a reproducible and timely manner. 
    more » « less
  5. ABSTRACT Automated scoring is a current hot topic in creativity research. However, most research has focused on the English language and popular verbal creative thinking tasks, such as the alternate uses task. Therefore, in this study, we present a large language model approach for automated scoring of a scientific creative thinking task that assesses divergent ideation in experimental tasks in the German language. Participants are required to generate alternative explanations for an empirical observation. This work analyzed a total of 13,423 unique responses. To predict human ratings of originality, we used XLM‐RoBERTa (Cross‐lingual Language Model‐RoBERTa), a large, multilingual model. The prediction model was trained on 9,400 responses. Results showed a strong correlation between model predictions and human ratings in a held‐out test set (n = 2,682;r = 0.80; CI‐95% [0.79, 0.81]). These promising findings underscore the potential of large language models for automated scoring of scientific creative thinking in the German language. We encourage researchers to further investigate automated scoring of other domain‐specific creative thinking tasks. 
    more » « less