Large language models offer new ways of empowering people to program robot applications-namely, code generation via prompting. However, the code generated by LLMs is susceptible to errors. This work reports a preliminary exploration that empirically characterizes common errors produced by LLMs in robot programming. We categorize these errors into two phases: interpretation and execution. In this work, we focus on errors in execution and observe that they are caused by LLMs being “forgetful” of key information provided in user prompts. Based on this observation, we propose prompt engineering tactics designed to reduce errors in execution. We then demonstrate the effectiveness of these tactics with three language models: ChatGPT, Bard, and LLaMA-2. Finally, we discuss lessons learned from using LLMs in robot programming and call for the benchmarking of LLM-powered end-user development of robot applications. 
                        more » 
                        « less   
                    
                            
                            Task Supportive and Personalized Human-Large Language Model Interaction: A User Study
                        
                    
    
            Large language model (LLM) applications, such as ChatGPT, are a powerful tool for online information-seeking (IS) and problem-solving tasks. However, users still face challenges initializing and refining prompts, and their cognitive barriers and biased perceptions further impede task completion. These issues reflect broader challenges identified within the fields of IS and interactive information retrieval (IIR). To address these, our approach integrates task context and user perceptions into human-ChatGPT interactions through prompt engineering. We developed a ChatGPT-like platform integrated with supportive functions, including perception articulation, prompt suggestion, and conversation explanation. Our findings of a user study demonstrate that the supportive functions help users manage expectations, reduce cognitive loads, better refine prompts, and increase user engagement. This research enhances our comprehension of designing proactive and user-centric systems with LLMs. It offers insights into evaluating human-LLM interactions and emphasizes potential challenges for under served users. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2106152
- PAR ID:
- 10543485
- Publisher / Repository:
- ACM
- Date Published:
- ISBN:
- 9798400704345
- Page Range / eLocation ID:
- 370 to 375
- Format(s):
- Medium: X
- Location:
- Sheffield United Kingdom
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Instruction-tuned large language models (LLMs), such as ChatGPT, have led to promising zero-shot performance in discriminative natural language understanding (NLU) tasks. This involves querying the LLM using a prompt containing the question, and the candidate labels to choose from. The question-answering capabilities of ChatGPT arise from its pre-training on large amounts of human-written text, as well as its subsequent fine-tuning on human preferences, which motivates us to ask: Does ChatGPT also inherit humans’ cognitive biases? In this paper, we study the primacy effect of ChatGPT: the tendency of selecting the labels at earlier positions as the answer. We have two main findings: i) ChatGPT’s decision is sensitive to the order of labels in the prompt; ii) ChatGPT has a clearly higher chance to select the labels at earlier positions as the answer. We hope that our experiments and analyses provide additional insights into building more reliable ChatGPT-based solutions. We release the source code at https://github.com/wangywUST/PrimacyEffectGPT.more » « less
- 
            The rise of e-commerce and social networking platforms has led to an increase in the disclosure of personal health information within user-generated content. This study investigates the application of large language models (LLMs) to detect and sanitize sensitive health data shared by users across platforms such as Amazon, patient.info, and Facebook. We propose a methodology that leverages LLMs to evaluate both the sensitivity of disclosed information and the platform-specific semantics of the content. Through prompt engineering, our method identifies sensitive information and rephrases it to minimize disclosure while preserving content similarity. ChatGPT serves as the LLM in this study due to its versatility. Empirical results suggest that ChatGPT can reliably assign sensitivity scores to user-generated text and generate sanitized versions that effectively preserve the original meaning.more » « less
- 
            Recent work has recognized the importance of developing and deploying software systems that reflect human values and has explored different approaches for eliciting these values from stakeholders. However, prior studies have also shown that it can be challenging for stakeholders to specify a diverse set of product-related human values. In this paper we therefore explore the use of ChatGPT for generating user stories that describe candidate human values. These generated stories provide inspiration to stakeholder discussions and enrich the human-created user stories. We engineer a series of ChatGPT prompts to retrieve a list of common stakeholders and candidate features for a targeted product, and then, for each pairwise combination of role and feature, and for each individual Schwartz value, we issue an additional prompt to generate a candidate user story reflecting that value. We present the candidate user-stories to stakeholders and, as part of a creative requirements engineering session, we ask them to assess and prioritize the generated user-stories, and then use them as inspiration for discussing and specifying their own product-related human values. Through conducting a series of focus groups we compare the human-values created by stakeholders with and without the benefit of the ChatGPT examples. Results are evaluated with respect to coverage of values, clarity of expression, internal completeness, and through feedback from our participants. Results from our analysis show that the ChatGPT-generated user stories are able to provide creativity triggers that help stakeholders to specify human values for a product.more » « less
- 
            This study evaluated ChatGPT’s ability to understand causal language in science papers and news by testing its accuracy in a task of labeling the strength of a claim as causal, conditional causal, correlational, or no relationship. The results show that ChatGPT is still behind the existing fine-tuned BERT models by a large margin. ChatGPT also had difficulty understanding conditional causal claims mitigated by hedges. However, its weakness may be utilized to improve the clarity of human annotation guideline. Chain-of-thought prompting was faithful and helpful for improving prompt performance, but finding the optimal prompt is difficult with inconsistent results and the lack of effective method to establish cause-effect between prompts and outcomes, suggesting caution when generalizing prompt engineering results across tasks or models.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    