skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Using ChatGPT to Generate Human-Value User Stories as Inspirational Triggers
Recent work has recognized the importance of developing and deploying software systems that reflect human values and has explored different approaches for eliciting these values from stakeholders. However, prior studies have also shown that it can be challenging for stakeholders to specify a diverse set of product-related human values. In this paper we therefore explore the use of ChatGPT for generating user stories that describe candidate human values. These generated stories provide inspiration to stakeholder discussions and enrich the human-created user stories. We engineer a series of ChatGPT prompts to retrieve a list of common stakeholders and candidate features for a targeted product, and then, for each pairwise combination of role and feature, and for each individual Schwartz value, we issue an additional prompt to generate a candidate user story reflecting that value. We present the candidate user-stories to stakeholders and, as part of a creative requirements engineering session, we ask them to assess and prioritize the generated user-stories, and then use them as inspiration for discussing and specifying their own product-related human values. Through conducting a series of focus groups we compare the human-values created by stakeholders with and without the benefit of the ChatGPT examples. Results are evaluated with respect to coverage of values, clarity of expression, internal completeness, and through feedback from our participants. Results from our analysis show that the ChatGPT-generated user stories are able to provide creativity triggers that help stakeholders to specify human values for a product.  more » « less
Award ID(s):
2131515
PAR ID:
10471800
Author(s) / Creator(s):
;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
31st {IEEE} International Requirements Engineering Conference, {RE} 2023 - Workshops, Hannover, Germany, September 4-5, 2023
ISBN:
979-8-3503-2691-8
Page Range / eLocation ID:
52 to 61
Subject(s) / Keyword(s):
Human values Creative requirements elicitation User stories
Format(s):
Medium: X
Location:
Hannover, Germany
Sponsoring Org:
National Science Foundation
More Like this
  1. Harmful textual content is pervasive on social media, poisoning online communities and negatively impacting participation. A common approach to this issue is developing detection models that rely on human annotations. However, the tasks required to build such models expose annotators to harmful and offensive content and may require significant time and cost to complete. Generative AI models have the potential to understand and detect harmful textual content. We used ChatGPT to investigate this potential and compared its performance with MTurker annotations for three frequently discussed concepts related to harmful textual content on social media: Hateful, Offensive, and Toxic (HOT). We designed five prompts to interact with ChatGPT and conducted four experiments eliciting HOT classifications. Our results show that ChatGPT can achieve an accuracy of approximately 80% when compared to MTurker annotations. Specifically, the model displays a more consistent classification for non-HOT comments than HOT comments compared to human annotations. Our findings also suggest that ChatGPT classifications align with the provided HOT definitions. However, ChatGPT classifies “hateful” and “offensive” as subsets of “toxic.” Moreover, the choice of prompts used to interact with ChatGPT impacts its performance. Based on these insights, our study provides several meaningful implications for employing ChatGPT to detect HOT content, particularly regarding the reliability and consistency of its performance, its understanding and reasoning of the HOT concept, and the impact of prompts on its performance. Overall, our study provides guidance on the potential of using generative AI models for moderating large volumes of user-generated textual content on social media. 
    more » « less
  2. Recently, there have been significant advances and wide-scale use of generative AI in natural language generation. Models such as OpenAI’s GPT3 and Meta’s LLaMA are widely used in chatbots, to summarize documents, and to generate creative content. These advances raise concerns about abuses of these models, especially in social media settings, such as large-scale generation of disinformation, manipulation campaigns that use AI-generated content, and personalized scams. We used stylometry (the analysis of style in natural language text) to analyze the style of AI-generated text. Specifically, we applied an existing authorship verification (AV) model that can predict if two documents are written by the same author on texts generated by GPT2, GPT3, ChatGPT and LLaMA. Our AV model was trained only on human-written text and was effectively used in social media settings to analyze cases of abuse. We generated texts by providing the language models with fanfiction snippets and prompting them to complete the rest of it in the same writing style as the original snippet. We then applied the AV model across the texts generated by the language models and the human written texts to analyze the similarity of the writing styles between these texts. We found that texts generated with GPT2 had the highest similarity to the human texts. Texts generated by GPT3 and ChatGPT were very different from the human snippet, and were similar to each other. LLaMA-generated texts had some similarity to the original snippet but also has similarities with other LLaMA-generated texts and texts from other models. We then conducted a feature analysis to identify the features that drive these similarity scores. This analysis helped us answer questions like which features distinguish the language style of language models and humans, which features are different across different models, and how these linguistic features change over different language model versions. The dataset and the source code used in this analysis have been made public to allow for further analysis of new language models. 
    more » « less
  3. Zervas, E. (Ed.)
    Recently, there has been a surge in general-purpose language models, with ChatGPT being the most advanced model to date. These models are primarily used for generating text in response to user prompts on various topics. It needs to be validated how accurate and relevant the generated text from ChatGPT is on the specific topics, as it is designed for general conversation and not for context-specific purposes. This study explores how ChatGPT, as a general-purpose model, performs in the context of a real-world challenge such as climate change compared to ClimateBert, a state-of-the-art language model specifically trained on climate-related data from various sources, including texts, news, and papers. ClimateBert is fine-tuned on five different NLP classification tasks, making it a valuable benchmark for comparison with the ChatGPT on various NLP tasks. The main results show that for climate-specific NLP tasks, ClimateBert outperforms ChatGPT. 
    more » « less
  4. Large language model (LLM) applications, such as ChatGPT, are a powerful tool for online information-seeking (IS) and problem-solving tasks. However, users still face challenges initializing and refining prompts, and their cognitive barriers and biased perceptions further impede task completion. These issues reflect broader challenges identified within the fields of IS and interactive information retrieval (IIR). To address these, our approach integrates task context and user perceptions into human-ChatGPT interactions through prompt engineering. We developed a ChatGPT-like platform integrated with supportive functions, including perception articulation, prompt suggestion, and conversation explanation. Our findings of a user study demonstrate that the supportive functions help users manage expectations, reduce cognitive loads, better refine prompts, and increase user engagement. This research enhances our comprehension of designing proactive and user-centric systems with LLMs. It offers insights into evaluating human-LLM interactions and emphasizes potential challenges for under served users. 
    more » « less
  5. Instruction-tuned large language models (LLMs), such as ChatGPT, have led to promising zero-shot performance in discriminative natural language understanding (NLU) tasks. This involves querying the LLM using a prompt containing the question, and the candidate labels to choose from. The question-answering capabilities of ChatGPT arise from its pre-training on large amounts of human-written text, as well as its subsequent fine-tuning on human preferences, which motivates us to ask: Does ChatGPT also inherit humans’ cognitive biases? In this paper, we study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer. We have two main findings: i) ChatGPT’s decision is sensitive to the order of labels in the prompt; ii) ChatGPT has a clearly higher chance to select the labels at earlier positions as the answer. We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions. We release the source code at https://github.com/wangywUST/PrimacyEffectGPT. 
    more » « less