<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Forgetful Large Language Models: Lessons Learned from Using LLMs in Robot Programming</title></titleStmt>
			<publicationStmt>
				<publisher>AAAI</publisher>
				<date>01/22/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10525517</idno>
					<idno type="doi">10.1609/aaaiss.v2i1.27721</idno>
					<title level='j'>Proceedings of the AAAI Symposium Series</title>
<idno>2994-4317</idno>
<biblScope unit="volume">2</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Juo-Tung Chen</author><author>Chien-Ming Huang</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>Large language models offer new ways of empowering people to program robot applications-namely, code generation via prompting. However, the code generated by LLMs is susceptible to errors. This work reports a preliminary exploration that empirically characterizes common errors produced by LLMs in robot programming. We categorize these errors into two phases: interpretation and execution. In this work, we focus on errors in execution and observe that they are caused by LLMs being “forgetful” of key information provided in user prompts. Based on this observation, we propose prompt engineering tactics designed to reduce errors in execution. We then demonstrate the effectiveness of these tactics with three language models: ChatGPT, Bard, and LLaMA-2. Finally, we discuss lessons learned from using LLMs in robot programming and call for the benchmarking of LLM-powered end-user development of robot applications.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Introduction</head><p>Programmable robots have enabled a wide range of applications, ranging from flexible automation to people-facing services. However, programming robot applications effectively requires years of training and experience. The paradigm of end-user programming lowers the barriers to robot programming <ref type="bibr">(Ajaykumar, Steele, and Huang 2021)</ref> and empowers end users to develop custom robot applications without substantial engineering training. The rise of large language models introduces new opportunities in this paradigm by offering a natural interface in which end users may program robots <ref type="bibr">(Vemprala et al. 2023)</ref>.</p><p>However, LLM-powered code generation is not error-free due to its nondeterministic nature <ref type="bibr">(Ouyang et al. 2023)</ref>. Despite extensive research efforts aimed at assessing the effectiveness and accuracy of LLM-based code generation tools, certain limitations persist. For instance, these tools may produce inconsistent and occasionally incorrect code outputs. Existing studies have employed approaches such as benchmark evaluations <ref type="bibr">(Liu et al. 2023a;</ref><ref type="bibr">Chen et al. 2021;</ref><ref type="bibr">Hammond Pearce et al. 2021</ref>) and systematic empirical assessments <ref type="bibr">(Liu et al. 2023b</ref>) to explore the capabilities of and challenges in LLM-powered code generation. While these investigations have illuminated various errors and obstacles that may arise during the code generation process, they often fell short in providing comprehensive solutions to enhance code generation stability and minimize the occurrence of errors.</p><p>It is worth noting that existing research often focuses primarily on general benchmarking errors, aiming to identify common pitfalls and shortcomings in LLM-generated code; therefore, these studies may not fully capture the specific nuances and intricacies of code specific to a specialized domain such as robotics. As a result, while such benchmark evaluations provide valuable insights into the overall performance of LLMs, they may not comprehensively address the unique challenges posed by code generation for robotic applications.</p><p>As a step toward developing the empirical science of incorporating LLMs into robot programming processes, in this work, we sought to explore two research questions: 1) What are the common errors produced by LLMs in enduser robot programming? and 2) What practical strategies can be employed to mitigate and reduce these errors? To ground our exploration, we designed a sequential manipulation task (Figure <ref type="figure">1</ref>) and tested three language models-ChatGPT, Bard, and LLaMA-2-to assess their capabilities in generating code to complete the task.</p><p>Our key findings are 1) LLMs are "forgetful" and do not consider information provided in the system prompt as hard fact; 2) the forgetfulness of LLMs leads to errors in code execution; 3) in addition to execution errors, LLMs make various errors (e.g., syntax errors, missing necessary libraries) that cause failures in code interpretation; and 4) simple strategies-such as reinforcing task constraints in the objective prompt and extracting numerical task contexts from the system prompt and storing them in data structures-seem to notably reduce execution errors caused by LLM forgetfulness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiment 1: Identifying Common Errors Programming Task</head><p>In order to assess the code generation ability and performance of LLMs in robot programming, we set up a sequential manipulation task. Our experimental setup includes a UR5 manipulator paired with a webcam for basic perception via AR markers, allowing for the registration of task</p><p>AAAI Fall Symposium Series (FSS-23) Home Reach Grasp Pour Put back</p><p>Figure <ref type="figure">1</ref>: Sequential task execution by the robotic system. The five stages encompass homing, reaching the cylinder, grasping it, pouring its contents into a beaker, and returning the cylinder to its initial position.</p><p>objects into a virtual workspace for precise and accurate motion planning via MoveIt. The sequential manipulation task involves the robot picking up a graduated cylinder and pouring its contents into a beaker; this task is a common step in biochemical lab tests<ref type="foot">foot_0</ref> . The high-level procedure of the manipulation task involves:</p><p>1. Moving the robot to a home (neutral) position; 2. Reaching out to the graduated cylinder; 3. Grasping the graduated cylinder at its midpoint; 4. Performing the pouring action (including moving to the target location and rotating the robot's end effector); and 5. Placing the cylinder back in its original position.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Baseline Prompt</head><p>A descriptive prompt is believed to enhance the quality of LLM-generated responses. It has been documented that a well-constructed prompt should contain the following components <ref type="bibr">(Vemprala et al. 2023</ref>): constraints and requirements, environmental description, current state of the system, goals and objectives, description of the robotic API library, and solution examples. Consequently, our baseline prompt is composed of four parts: system prompt, description of robotic API library, solution example, and objective prompt. See the appendix<ref type="foot">foot_1</ref> for the full baseline prompt used in our experiments.</p><p>System Prompt Here, we defined the role of the LLM and provided it with task constraints and requirements. We additionally included contextual details regarding the environment to alert the LLM to potential task objects.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Description of Robotic API Library</head><p>We provided a clear rundown of how each high-level function provided for the LLM should be used, along with useful reminders and conventions. It is worth noting that by providing descriptive names for all of the API functions, the LLM's ability to understand the functional links between APIs may be enhanced, which can facilitate the LLM to produce more desirable outcomes for the given problem <ref type="bibr">(Vemprala et al. 2023)</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Solution Example</head><p>We provided an example solution to guide the LLM's solution strategy and to (hopefully) prevent it from generating erroneous responses.</p><p>Objective Prompt Here, we articulated the intended objective for the LLM to respond to while considering all prompts as outlined previously. Below is the objective prompt used in our experiments:</p><p>Please write a Python function to pick up a 25mL graduated cylinder at Marker 15 and pour its contents into a 500mL beaker at Marker 7. After that, put the cylinder back to where it was.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Large Language Models</head><p>In our experiments, we used three language models: Chat-GPT (3.5-turbo-0613), Bard, and LLaMA-2 (13B parameters). Given the stochastic nature of these LLMs, each model was tested ten times while keeping the prompts and sequential manipulation task the same across trials.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Findings</head><p>Our first experiment sought to understand common errors produced by the three language models. To this end, we manually characterized the observed errors, which can be grouped roughly into two categories representing errors in different phases of application development-errors in interpretation and errors in execution-as illustrated in Figure 2. We note that there may be errors in motion planning that have nothing to do with LLM-generated code, which is outside the scope of this work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Errors in Interpretation</head><p>Errors in this category cause failures in code interpretation and include four different subtypes: a. Name Error: This error type includes instances where references to variables or functions precede their definition or initialization within the code (Figure <ref type="figure">3</ref>). b. Syntax Error: Characterized by syntactically incorrect code structures, this error type hinders the proper interpretation of the generated code (Figure <ref type="figure">4</ref>). c. Import Error: This error type typically indicates that the generated code does not include the necessary libraries for code interpretation (Figure <ref type="figure">5</ref>). d. ROS Error: Within the context of the Robot Operating System, this type of error surfaces due to the omission of ROS node initialization or incorrect utilization of ROS packages, negatively impacting the overall communication and coordination within the robotic system (Figure <ref type="figure">6</ref>).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MoveIt</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Errors in motion planning</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Errors in interpretation</head><p>Errors in execution not due to generated code Name error Syntax error Import error ROS error # Move and place the 100mL beaker at marker 7 move_and_place_object("beaker_100mL", 7) undefined function # Open the gripper to release the cylinder lib.open_gripper( ) unclosed parenthesis forgot to import libraries import rospy from Lib.ur5.FunctionLibrary import FunctionLib # Initialize rospy node called gpt rospy.init_node('gpt') # Initialize function library lib = FunctionLib() import rospy from Lib.ur5.FunctionLibrary import FunctionLib Figure 3: Name error (using undefined functions). Name error Syntax error Import error ROS error Factual error # Move and place the 100mL beaker at marker 7 move_and_place_object("beaker_100mL", 7) undefined function # Open the gripper to release the cylinder lib.open_gripper( ) unclosed parenthesis forgot to import libraries import rospy from Lib.ur5.FunctionLibrary import FunctionLib # Initialize rospy node called gpt rospy.init_node('gpt') # Initialize function library lib = FunctionLib() ... # Pour into beaker 500mL lib.pour("beaker 500mL") # Move above 0.1 meters the beaker's location success = lib.go(beaker[0], beaker[1], beaker[2] + 0.1, rospy.init_node('gpt') forgot to initialize rospy node import rospy from Lib.ur5.FunctionLibrary import FunctionLib # initialize function library lib = FunctionLib() # Define the objects dimensions Errors in Execution Errors in this category cause failures in code execution-even though the code may be interpretable-and include two types: a. Factual Error: This error type indicates model hallucination; for example, instead of using numerical values that the user provides in the system prompt to describe objects, the model fabricates numbers, quently causing errors in motion planning or execution (Figure <ref type="figure">7</ref>). b. Physical Error: This error type includes errors that ultimately cause execution failures even if all other error types are not present. Examples include adding unnecessary steps to the action sequence (Figure <ref type="figure">8</ref>). The two error categories-interpretation and executioncall for different methods of error handling. Errors in interpretation are typically caught by the program interpreter or compiler, which displays error messages that help users address the errors in a more straightforward identification and rectification process <ref type="bibr">(Inagaki et al. 2023)</ref>. In contrast, errors in execution are less obvious, as they do not necessarily cause immediate code breakdown; these errors surface only when undesirable task outcomes are observed.</p><p>Our experiment revealed varying patterns of error occurrence across the three language models (Table <ref type="table">1</ref>). To our surprise, none of the three models successfully completed the intended task in any of the trials. This result underscores the challenges involved in translating end-user prompts into ac-</p><p>Syntax error Import error ROS error Factual error # Open the gripper to release the cylinder lib.open_gripper( ) unclosed parenthesis forgot to import libraries import rospy from Lib.ur5.FunctionLibrary import FunctionLib # Initialize rospy node called gpt rospy.init_node('gpt') # Initialize function library lib = FunctionLib() unnecessa (will resul ... # Pour into beaker 500mL lib.pour("beaker 500mL") # Move above 0.1 meters the b success = lib.go(beaker[0], be beaker[3], beaker[4 # Move down to pour the cont success = lib.go(beaker[0], be beaker[3], beaker[4 # Open the gripper to release lib.open_gripper() # Move above 0.1 meters the c success = lib.go(cylinder[0], c cylinder[3], cylinde # Move down to put the cylind success = lib.go(cylinder[0], c cylinder[3], cylinde # Open the gripper to release lib.open_gripper() ... rospy.init_node('gpt') forgot to initialize rospy node import rospy from Lib.ur5.FunctionLibrary import FunctionLib # initialize function library lib = FunctionLib() # Define the objects dimensions cylinder_25mL_height = 0.065 cylinder_25mL_radius = 0.0125 beaker_500mL_height = 0.25 beaker_500mL_radius = 0.065 fabricated numerical values Figure 5: Import error (oversight in importing necessary libraries). Syntax error Import error ROS error Factual error # Open the gripper to release the cylinder lib.open_gripper( ) unclosed parenthesis forgot to import libraries import rospy from Lib.ur5.FunctionLibrary import FunctionLib # Initialize rospy node called gpt rospy.init_node('gpt') # Initialize function library lib = FunctionLib() ... # Pour into bea lib.pour("beaker # Move above 0 success = lib.go beake # Move down to success = lib.go beake # Open the grip lib.open_grippe # Move above 0 success = lib.go cylin # Move down to success = lib.go cylin # Open the grip lib.open_grippe ... rospy.init_node('gpt') forgot to initialize rospy node import rospy from Lib.ur5.FunctionLibrary import FunctionLib curate and executable robot control code via LLMs. Furthermore, across the three models evaluated, factual and physical errors were most common; the prevalence of these errors highlights a key limitation of LLM-based code generation for end-user development of robot applications, which prompted us to explore practical strategies to reduce these types of execution errors.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Experiment 2: Exploring Practical Strategies to Reduce Errors in Execution</head><p>This experiment studied strategies that might enhance an LLM's ability to generate accurate and reliable code for robotic applications. This experiment followed the same protocol (e.g., same manipulation task, ten trials per language model) as the first experiment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Practical Strategies</head><p>In Experiment 1, we found that errors in execution may be attributed to the "forgetfulness" of LLMs; the models ap-  pear to "forget" the information provided in user prompts or do not treat the provided description as factual information to use in code generation. Therefore, we explored the following strategies' effectiveness in addressing the issue of forgetfulness:</p><p>1. When prompts involve task/context information specified in numerical form, implement dedicated functions for retrieving precise, numerical data. (Figure <ref type="figure">9</ref>) 2. When dealing with intricate functions (like the pour function in our experiment), reinforce key constraints in the objective prompt to ensure more accurate and reliable code generation. (Figure <ref type="figure">10</ref>)</p><p>In addition to the these strategies, enhancing the clarity and specificity of the objective prompt by articulating its physical implications or providing greater descriptive context can also help curtail excessive divergence in LLMgenerated code. Our implementations of these strategies are shown in Figures <ref type="figure">9</ref> and <ref type="figure">10</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Findings</head><p>Table <ref type="table">2</ref> shows the results of adopting the strategies proposed above. Across all models, we observed a substantial increase in successful task completion and a decrease in the number of factual and physical errors. Specifically, ChatGPT was able to achieve a task completion rate of 60% and errors in execution were reduced by 94.7% as compared to its results in Experiment 1. Bard achieved a similar success rate of 70% with strategy implementation and the occurrence of factual and physical errors was reduced by 95%. However, LLaMA-2-13B only reached a task completion rate of 40% using the strategies and factual and physical errors were reduced by only 83.3%.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ROS error</head><p>Factual error forgot to import libraries import rospy from Lib.ur5.FunctionLibrary import FunctionLib # Initialize rospy node called gpt rospy.init_node('gpt') # Initialize function library lib = FunctionLib() unnecessary steps (will result in task failure) ... # Pour into beaker 500mL lib.pour("beaker 500mL") # Move above 0.1 meters the beaker's location success = lib.go(beaker[0], beaker[1], beaker[2] + 0.1, beaker[3], beaker[4], beaker[5]) # Move down to pour the contents into the beaker success = lib.go(beaker[0], beaker[1], beaker[2], beaker[3], beaker[4], beaker[5]) # Open the gripper to release the cylinder lib.open_gripper() # Move above 0.1 meters the cylinder's location success = lib.go(cylinder[0], cylinder[1], cylinder[2] + 0.1, cylinder[3], cylinder[4], cylinder[5]) # Move down to put the cylinder back success = lib.go(cylinder[0], cylinder[1], cylinder[2], cylinder[3], cylinder[4], cylinder[5]) # Open the gripper to release the cylinder lib.open_gripper() ... rospy.init_node('gpt') forgot to initialize rospy node import rospy from Lib.ur5.FunctionLibrary import FunctionLib # initialize function library lib = FunctionLib() # Define the objects dimensions cylinder_25mL_height = 0.065 cylinder_25mL_radius = 0.0125 beaker_500mL_height = 0.25 beaker_500mL_radius = 0.065 fabricated numerical values </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Discussion</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lessons Learned</head><p>While promising, LLM-based code generation for end-user development of robot applications remains inconsistent, which is unsurprising given the intricate and probabilistic design of these models. This work highlights the importance keeping users in the loop in application development.</p><p>We additionally determined that the success of LLMpowered code generation often hinges on the user's ability to provide explicit and descriptive objective prompts; for instance, specifying detailed instructions such as "Place the cylinder back to its original position" yields more accurate results than ambiguous directives like "Put it back." Furthermore, we found that errors in execution primarily stem from the forgetfulness of LLMs, which causes them to overlook information supplied in prompts. Consequently, we made a concerted effort to explicitly emphasize the instruction, "All the information I provided should be treated factual information and shouldn't be ignored." Despite this explicit instruction, unsatisfactory outcomes persisted, indicating that simple reinforcement is ineffective.</p><p>Lastly, a suite of tools is needed for the productive use of LLM-based robot programming: at the basic level, custom verification scripts may be used to identify and correct errors in interpretation (e.g., missing libraries); the strategies discussed in this work may also help reduce factual and physical errors; and a preview tool may allow users to simulate program behavior prior to robot deployment, thereby reducing unforeseen errors during actual execution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Call for Benchmarks</head><p>In light of the evolving landscape of LLM-driven robot programming, we advocate for the establishment of standardized benchmarks that encompass a diverse set of tasks and metrics to assess the performance of LLMs in various programming scenarios. Such benchmarks will let researchers, practitioners, and developers collectively advance the science of LLM-driven robot programming.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Limitations and Future Work</head><p>This preliminary work has limitations that may motivate future research. Our experiments focused on a single manipulation task, which does not capture the vast array of scenarios in end-user robot programming. Future work may build on our exploration and include a wider range of representative programming tasks and language models.</p><p>In our experiments, we simplified the challenges of robot perception by using AR markers. As new vision-language models are developed, future research should study the true complexity of incorporating large data models in the various processes of robot programming.</p><p>Future work should also include a comprehensive evaluation of different aspects of end-user robot programming, including debugging; we speculate that debugging may be particularly challenging in the new paradigm of LLM-powered robot programming, as end users will need to spend time understanding the generated code and developing a mental model of it in order to resolve errors successfully.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>We envision the automation of several biochemical lab tests through custom robot applications so as to accelerate scientific experimentation.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1"><p>https://tinyurl.com/AAAI-Appendix</p></note>
		</body>
		</text>
</TEI>
