<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Expectation vs.Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>04/27/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10366304</idno>
					<idno type="doi">10.1145/3491101.3519665</idno>
					<title level='j'>CHI Conference on Human Factors in Computing Systems Extended Abstracts (CHI ’22 Extended Abstracts)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Priyan Vaithilingam</author><author>Tianyi Zhang</author><author>Elena L. Glassman</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Recent advances in Large Language Models (LLM) have made automaticcode generation possible for real-world programming tasks ingeneral-purpose programming languages such as Python. However,there are few human studies on the usability of these tools and howthey fit the programming workflow. In this work, we conducteda within-subjects user study with 24 participants to understandhow programmers use and perceive Copilot, a LLM-based codegeneration tool. We found that, while Copilot did not necessarilyimprove the task completion time or success rate, most participantspreferred to use Copilot in daily programming tasks, sinceCopilot often provided a useful starting point and saved the effortof searching online. However, participants did face difficulties inunderstanding, editing, and debugging code snippets generatedby Copilot, which significantly hindered their task-solving effectiveness.Finally, we highlighted several promising directions forimproving the design of Copilot based on our observations andparticipants’ feedback.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>on two different kinds of approaches: <ref type="bibr">(1)</ref> program synthesis algorithms that search over a large program space defined by a domainspecific language (DSL) <ref type="bibr">[2,</ref><ref type="bibr">7,</ref><ref type="bibr">10,</ref><ref type="bibr">12,</ref><ref type="bibr">14,</ref><ref type="bibr">19,</ref><ref type="bibr">24,</ref><ref type="bibr">25,</ref><ref type="bibr">30,</ref><ref type="bibr">31,</ref><ref type="bibr">34,</ref><ref type="bibr">43]</ref>, and (2) deep learning models that are trained on a large amount of existing code and can generate new code given some forms of specifications such as natural language descriptions or incomplete code <ref type="bibr">[5,</ref><ref type="bibr">16,</ref><ref type="bibr">17,</ref><ref type="bibr">22,</ref><ref type="bibr">38,</ref><ref type="bibr">39,</ref><ref type="bibr">48,</ref><ref type="bibr">49]</ref>. Both kinds of approaches have clear drawbacks. On the one hand, existing program synthesis techniques are constrained to pre-defined DSLs and cannot scale to general-purpose programming languages <ref type="bibr">[15]</ref>. On the other hand, existing generative models have a hard time learning sophisticated programming patterns from code corpora and often generate code with syntactic or semantic errors <ref type="bibr">[9,</ref><ref type="bibr">29,</ref><ref type="bibr">40]</ref>. The recent development of Large Language Models (LLM) such as GPT-3 <ref type="bibr">[32]</ref> has opened up new opportunities for addressing the limitations of existing code generation techniques. For example, Codex <ref type="bibr">[50]</ref>, which contains 12 billion model parameters and is trained on 54 million software repositories on GitHub, has demonstrated stunning code generation capability&#208;solving over 70% of 164 Python programming tasks with 100 samples <ref type="bibr">[8]</ref>.</p><p>The performance of LLM-based code generation tools has been extensively studied using benchmarks <ref type="bibr">[8,</ref><ref type="bibr">33]</ref>. However, little is known about the usability and programmers' perception of such a tool in a real-world programming workflow. To bridge the gap, we conducted a within-subjects comparative study with 24 participants, in which participants were asked to complete Python programming tasks. In the experimental condition, participants wrote programs with the assistance of Copilot, a Visual Studio Code (VSCode) plugin powered by Codex <ref type="bibr">[13]</ref>. In the control condition, participants wrote programs with the assistance of Intellisense, the default code completion plugin in VSCode. We investigated the following research questions:</p><p>&#8226; RQ1: How does using Copilot affect the programming experience? &#8226; RQ2: How do users recognize errors in code generated by</p><p>Copilot? &#8226; RQ3: What coping mechanisms do users employ when they find errors in code generated by Copilot? &#8226; RQ4: What are the obstacles and limitations that can prevent adoption of Copilot?</p><p>Our key findings are: <ref type="bibr">(1)</ref> the majority of the participants (19 out of 24) preferred using Copilot over Intellisense (Control condition);</p><p>(2) Copilot provides a useful starting point for participants to kick start the task and saved them the effort of searching online; (3) There is a need to identify better ways for participants to understand long blocks of generated code to help them edit, debug, and repair the code.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK 2.1 AI-based Code Generation</head><p>There is a long history of research on automated code generation. Some of the earliest work dates back to the 1960s, where Waldinger and Lee presented a program synthesizer called PROW that automatically generated LISP programs based on user-provided specifications in the form of a predicate calculus <ref type="bibr">[41]</ref>.</p><p>There are two main trends in modern automatic code generation: program synthesis and machine learning. Program Synthesis primarily uses a search-based technique to generate code that fulfills a given specification. These techniques work on a subset of the language components relevant to the domain known as Domain-Specific Languages (DSLs). More recently, program synthesis has been applied to a variety of domains, e.g., low-level bit-vector implementations <ref type="bibr">[36]</ref>, data manipulation in excel <ref type="bibr">[14]</ref>, and regular expression synthesis <ref type="bibr">[51]</ref>. The main limitation is that these techniques are limited to a pre-defined DSL, making it less scalable to programs written in general-purpose programming languages such as Java or Python. Because general-purpose programming languages include much more language features and syntax rules compared with DSLs and therefore define a much bigger program space to search from <ref type="bibr">[15]</ref>.</p><p>The second trend is using machine learning, especially deep learning. Advances in deep learning have shown promising results on automatically generating code for real-world programming tasks <ref type="bibr">[5,</ref><ref type="bibr">16,</ref><ref type="bibr">17,</ref><ref type="bibr">22,</ref><ref type="bibr">38,</ref><ref type="bibr">39,</ref><ref type="bibr">48,</ref><ref type="bibr">49]</ref>. For instance, Kim et al. <ref type="bibr">[21]</ref> developed a transformer architecture that is aware of code structures using abstract syntax trees. Alon et al. <ref type="bibr">[1]</ref> introduced structural language models that remove any restriction on the vocabulary or structure&#208; the main limitation of program synthesis techniques. Karampatsis and Sutton <ref type="bibr">[20]</ref> similarly introduced open-vocabulary models that can generate code with an arbitrary number of tokens. Though these methods have shown promising results, they still suffer from low accuracy and are less reliable <ref type="bibr">[9,</ref><ref type="bibr">29,</ref><ref type="bibr">40]</ref>. For instance, Ciniselli et al. <ref type="bibr">[9]</ref> show their RoBERTa-based model can only produce correct solutions for 7% of the tasks from the Code-SearchNet benchmark <ref type="bibr">[18]</ref>.</p><p>The recent advances in large language models (LLM) such as GPT-3 <ref type="bibr">[32]</ref> have led to a breakthrough in automated code generation compared to prior state-of-the-art deep learning methods <ref type="bibr">[4,</ref><ref type="bibr">6,</ref><ref type="bibr">42]</ref>. For example, Codex <ref type="bibr">[50]</ref>, a fine-tuned version of GPT-3, can generate fully correct code for 29% of unseen programming tasks with only one sample of generated programs and 72% of them with 100 samples, while a widely used code generation tool, TabNine <ref type="bibr">[39]</ref> can only solve 3% and 8%, respectively <ref type="bibr">[8]</ref>.</p><p>While there has been recent work evaluating the accuracy of LLM-based code generation tools <ref type="bibr">[8,</ref><ref type="bibr">33]</ref>, little is known about its usability. With such increases in accuracy, how will programmers interact with a tool that generates almost accurate yet not perfect code? How easy or difficult is it for programmers to recognize errors in a code snippet that is almost but not quite correct? Will they simply modify the incorrect part or completely rewrite the entire code themselves? This motivates us to study programmers' expectations, coping strategies, and needs for such powerful code generation tools.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Coping with Imperfect AI</head><p>Prior studies have examined how users interact with imperfect AI <ref type="bibr">[11, 23, 26&#347;28, 35, 37, 45]</ref>. Dzindolet et al. <ref type="bibr">[11]</ref> showed that once people observed an automated system make errors, their distrust in the system increased unless an explanation was provided. However, these explanations may also lead to over-reliance on the system even when unwarranted, signaling the importance and difficulty of providing explanations that help people to calibrate trust appropriately. Kocielnik et al. <ref type="bibr">[23]</ref> examined the effect of giving people control over the types of errors made by a scheduling assistant, either by avoiding false positives or false negatives. They found that even when the system was only 50% accurate, users who expected a reduction in the false positive rate had a lower perception of accuracy and lower acceptance of the system than the users who expected a reduction in the false negative rate. <ref type="bibr">[3,</ref><ref type="bibr">52]</ref> showed that confidence scores helped calibrate users' trust, form a good mental model of the AI, and understand the error boundaries better.</p><p>Similar to other AI techniques, AI-based code generation tools also suffer from inherent uncertainty and imperfection. They may inevitably generate code with errors or even code that wildly differs from users' expectations. However, unlike other domains, code generation demands a much higher level of correctness: code either compiles or not, and it is either correct or contains bugs such as logic errors and security vulnerabilities. Therefore, existing findings of other types of AI techniques may not generalize to the domain of code generation.</p><p>Currently, there are only a few studies on how programmers use such imperfect code generation tools <ref type="bibr">[44,</ref><ref type="bibr">47]</ref>. Xu et al. <ref type="bibr">[47]</ref> did a user study with 31 participants to evaluate the usefulness of a NL-to-code plugin <ref type="bibr">[46]</ref>. They found that there was no statistically significant difference in task completion time or task correctness scores when using or not using the NL-to-code plugin. Furthermore, most participants stayed neural or somewhat positive to the NLto-code plugin. The main reason for these negative results was the correctness and quality of generated code as pointed out by many participants in the post-study survey. However, these findings may not hold as more recent large language models have significantly boosted the correctness and quality of generated code. This further motivates us to conduct the user study with Copilot.</p><p>Weisz et al. <ref type="bibr">[44]</ref> interviewed 11 software engineers at IBM and solicited their feedback on a neural machine translation (NMT) model for an adjacent domain&#208;translating code from one programming language to another. They found that the user's acceptance of the NMT model was contingent on the number of errors in the translated code. They also identified several common themes in participants' feedback such as acceptance via verification and the desire to provide guidance to the NMT model. Our study was designed to complement this knowledge but for daily programming tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">STUDY DESIGN</head><p>To understand how programmers use an LLM-based code generation tool, we designed and carried out a within-subjects comparative study with 24 participants. For the control condition, each participant was asked to complete a Python programming task in Visual Studio Code (VSCode) IDE with the default code completion tool called Intellisense. Intellisense suggests a drop-down list of valid tokens in the current code context, ordered by alphabetical order or relevance. The users can select the token they want and press the Tab button to accept the suggested token or the Esc button to reject it.</p><p>For the experiment condition, each participant finished another Python programming task in VSCode with Copilot. Similar to Intellisene, Copilot can automatically suggest code based on the current code context as a programmer is typing. While Intellisense only predicts one token at a time, Copilot is capable of generating multiple lines of code. The participants can press Tab to accept the code suggestion or Esc to reject. Though not required, participants can give prompts to Copilot by writing comments. Henceforth, when we mention prompts in the text, we refer to comments written by the participants in the code specifically to guide Copilot.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Tasks</head><p>We selected three real-world python programming tasks with different levels of difficulty from <ref type="bibr">[47]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Participants</head><p>We recruited 24 participants (4 Female, 19 Male, 1 Non-binary) through mailing lists of two research universities. Ten participants were undergraduate students, 5 were master's student, 8 were Ph.D. students, and 1 was a software engineer. Regarding their familiarity with programming, only 1 participant had less than 2 years of programming experience, 14 participants have 2-5 years of experience, and 9 participants have over 5 years of experience. Participants received a $20 Amazon gift card as compensation for their time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Protocol</head><p>To enable easy access to the code generation tools, we set up two virtual machines (VMs) in Microsoft Azure, one with Copilot installed and the other with IntelliSense installed. We also pre-installed VS-Code and several popular Python packages in both VMs. Participants can easily log into each VM from their laptop to start the user study. We recorded the audio and the screen-cast with the consent of each participant. In each study session, a participant completed one of the three tasks using Copilot (i.e. the experiment condition) and another task with Intellisense (i.e. the control condition). To emulate real-world programming experience, the participants were allowed to use Internet search or refer to any online resources anytime during the task. To mitigate the learning effect, both the order of task assignment and the order of tool assignment was counterbalanced across participants through random assignment. Therefore, for each unique combination of 3 tasks and 2 conditions, we have 8 participant data points. Before each task, the participants were given a quick tutorial of the assigned tool. We set a time limit of 20 minutes for each programming task. A task was considered failed if participants did not complete it within 20 minutes. After each task, participants answered a survey to reflect on their experience using the tool. After finishing both tasks, participants answered a final survey to directly compare the two conditions. The first author performed open-coding on participants' responses to identify themes and then discussed with co-authors to refine the themes over multiple sessions. These themes were then used to explain the results in the following sessions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">RESULTS</head><p>This section describes both the quantitative and qualitative results of our study. Quantitative results include the task completion time, task failure rates, and metrics from survey responses. In the qualitative results subsection, we describe the common themes that emerged through open coding of participant comments and experimenter observations made during the study.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Quantitative</head><p>Participants using Copilot failed to complete tasks more often than participants using Intellisense. Table <ref type="table">1</ref> shows individual and average task completion times. Table cells in the orange background indicate sessions in which participants did not solve the task within 20 minutes. When using Copilot, all 8 participants working on the easiest task completed it, 6 out of 8 participants working on the medium-difficulty task completed it, and 5 out of 8 participants working on the hardest task completed it within the allotted time. In contrast, when using Intellisense, all 8 participants in both the easiest and medium-difficulty task conditions completed their tasks, and only 2 participants failed to complete the hardest task. Overall task difficulties, Intellisense users failed twice while Copilot users failed 5 times. This difference is not statistically significant.</p><p>We analyzed the session recordings to identify the root cause of these task failures. Out of the 5 task failures when using Copilot, 3 were caused by incorrect code generated by Copilot, which led participants into a time-consuming debugging rabbit hole (discussed in Section 4.2.4). The other two were caused by the participants' inexperience with the relevant Python libraries (graph plotting and HTML parsing libraries) and the debugging features of the IDE. In contrast, participants using Intellisense failed to finish the 2 tasks due to their inexperience with a graph plotting library.</p><p>While Copilot users completed fewer tasks than Intellisense users, the tasks completed with Copilot were done more quickly on average (see the last row of Table <ref type="table">1</ref>). The overall mean difference of task completion time using Copilot vs. Intellisense is about 1 min. Yet the mean difference is not statistically significant (student t-test, p = 0.53).</p><p>In the post-study survey, 19 of 24 participants answered that they preferred Copilot over Intellisense. Furthermore, 23 of 24 participants answered that Copilot was more helpful than Intellisense.</p><p>We also asked participants to rate the helpfulness of code generated by both tools on a scale of 1 (not at all helpful) to 7 (very helpful). Participants found code generated by Copilot more helpful than code generated by Intellisense (6.16 vs. 4.45 on average). This difference is statistically significant (student t-test: p &lt; 0.001). However, only 10 participants self-reported that they felt more confident about the code generated by Copilot than the code suggested by IntelliSense.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Qualitative</head><p>4.2.1 User Perception. Participants found Copilot helpful as it provided a starting point for the task instead of a blank canvas they usually have. Even if the code generated by Copilot is incorrect, it always points them towards a direction they can get started from. P1 said &#322;Copilot's function/line generation is a helpful reference; even if the generated code is not correct, it can point me in the right direction for completing the task.&#382; This is primarily useful for the kind of tasks in which the user has no experience. P7 said, &#322;the generation of fully formed functions that completed a task that I wasn't sure how to approach/start was very cool.&#382; For four of the participants, Copilot auto-completed the code for almost the whole tasks, and participants did very little to no fixes to the generated code. Though we did not see any significant difference in task completion time, seven participants explicitly mentioned that Copilot can save time in completing the task compared to Intellisense. P4 said &#322; <ref type="bibr">[Copilot]</ref> will likely save me much more time during the coding process.&#382; Participants also considered writing comments to guide Copilot as a way of communicating with the AI. P24 said &#322;Copilot behaves just like a TA and can tell me exactly what I want by reading the comments.&#382; However, participants pointed out several concerns about adopting Copilot in practice. First, twelve participants said they found it hard to understand and change the code generated by Copilot. P1 said, &#322;Copilot generated a complete function to fulfill the full task, but part of the function did not work as desired. Because I did not understand several parts of the function generated by Copilot, I did not know how to debug the function. This caused me to get rid of the whole function generated by copilot and start over.&#382;. Due to a lack of understanding, five participants perceived a loss of control over their code. P13 said, &#322;I would go with Intellisense for now since it gives me more control over the code I am writing&#382;. Second, seven participants expressed concerns over code reliability. P7 said &#322;At this time I probably prefer Intellisense just because I trust my own googling and understanding code examples online rather than opaque suggestions from copilot.&#382; P18 felt very frustrated after observing Copilot continuously generate code with errors. They said, &#322;Yes, I got rid of the whole snippet as I didn't want to conform to the code generated by AI as it may have unwanted bugs.&#382; Third, eight participants said they only trusted participants for simple tasks. This is due to multiple reasons, e.g., the difficulty to understand generated code, fear of unknown bugs, failure to match the coding style, etc.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2">User Interaction Patterns.</head><p>While prior code completion tools such as Intellisense only suggest one token at a time, Copilot is capable of generating even multiple lines of code at a time. While such a code generation capability is often interpreted as a powerful feature, it causes significant cognitive overload in practice, especially when the generated code has errors. A long piece of generated code forces the user to switch back and forth between program reading and writing. When the generated code has errors, the user needs to further enter into the debugging mode. This constant context switching puts significant mental demand on the users.</p><p>Another common interaction pattern is to use Copilot as a substitute for Internet search. P3 said, &#322;for certain tasks that follow very routine structures, and which I always have to look up on Stack Overflow, a tool like Copilot eliminates a lot of the tedious searching on Google&#382;. However, we have to note that unlike code examples from Stack Overflow, which are vetted by human programmers, the code generated by Copilot may contain errors. P10 wrote, &#322;I'm not fully confident that Copilot will suggest the best solution. By reading Stack Overflow, the helpful thing is that there will always be someone who would just post a better solution, and people will discuss and compare. I feel like that is missing from Copilot.&#382; Since Copilot only generates one solution at a time and does not provide any explanations, programmers cannot compare multiple alternative solutions and assess their quality as they often do in an online search. Furthermore, we observed eight instances of over-reliance on Copilot. For example, P8 simply accepted the generated code and said, &#322;I guess I will take its word.&#382; This over-reliance also makes participants defer code validation. P20 said &#322;Not exactly sure what this does. I'll figure it out later&#382;. Some participants later spot errors in the accepted code. They had to go back and spend a lot of time debugging the previous code.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.3">Coping Strategies.</head><p>There are two main ways participants cope with incorrect code generated by Copilot. The first way is to accept the incorrect suggestion and attempt to repair it. Twelve participants attempted to repair the code when there was an error. However, the participants always found it difficult to repair the code since the code was not written by themselves. Of these twelve participants open to repair the code, five participants were only willing to repair the code if the code generated by Copilot is easy to read and understand. P7 said, &#322;it made debugging the code more difficult as I hadn't written the code directly and didn't have an initial intuition about where the bugs might be. Especially with a final bug in my program I really had no idea why it was happening and had to refactor the code.&#382;</p><p>In cases where the participant is unable or unwilling to repair the code, they will simply get rid of the entire generated code and search for solutions online. Seven participants mentioned they will rewrite the whole code by themselves without any attempt to repair if there is an error in the code generated by Copilot. P13 said, &#322;I think getting rid of the whole code is easier than reading the code and making the changes.&#382; P1 also said, &#322;because I did not understand several parts of the function generated by Copilot, I did not know how to debug the function. This caused me to get rid of the whole function generated by Copilot and start over.&#382; 4.2.4 Obstacles and Limitations. During the user study, we observed three major obstacles to using Copilot in practice. First, participants often failed to understand and assess the correctness of the generated code. Since Copilot often generates a big chunk of code at a time, participants found it hard to understand and debug the code. This is already discussed in Section 4.2.1. The second obstacle is the underestimation of the effort required to fix a bug in the code generated by Copilot. Among the five task failures by participants using Copilot, three were due to incorrect suggestions by Copilot. While participants recognized these errors, they underestimated how much effort it took to fix the bug and got stuck in a debugging rabbit hole they could not get out of. For instance, for P20, Copilot generated a regular expression based code for extracting URLs from HTML. It is extremely hard to get the regular expression right for this task and a better solution is to parse the HTML and extract attributes instead. Since Copilot suggested the regular expression, P20 decided to stick with it and overlooked the better solution. Yet P20 failed to fix the regular expression after 20 minutes, leading to a task failure. The third obstacle is the brittleness and ambiguity of using comments (or prompts) as a specification for Copilot. As discussed in the previous sections, participants used comments to describe the desired code that should be generated by Copilot. However, Copilot is very sensitive to these comments. A little tweak in a comment can cause Copilot to generate a significantly different code snippet. P24 said, &#322;it is ambiguous to use comments to hint at Copilot what I want.&#382;</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">DISCUSSION</head><p>The majority of participants (19 out of 24) expressed a strong preference to use Copilot for their day-to-day programming tasks for several reasons. In many cases, Copilot accurately generated the code from the prompts provided by the participants. In four instances, it even generated the correct code for almost the whole task in one shot. Generating a whole block of code improves developer productivity significantly. However, we did not see a big difference in the time saved by Copilot during the study. Our observations point to a plausible explanation for this non-significance&#208;though it is faster to generate code through Copilot compared to acquiring code from the internet, the code generated by Copilot can be buggy, leading to more time spent in debugging. Whereas, code from the internet is generally bug-free, comes with explanations and discussion, and can be composed suitably for the current task by just doing some minor edits like changing the variable names. Moreover, Copilot also provides a useful starting point for the users to get started, even if the generated code was incorrect. This is especially useful for users who are stuck in a problem or who do not know how to approach the task. Several participants request to see multiple code suggestions so they can compare and compose code from different snippets to suit their needs. Furthermore, we found participants used Copilot as a replacement for internet search. However, they missed out on comparing multiple sources and community discussions. Hence, it is worthwhile integrating online search with code generation to help users compare AI-generated code with online code examples and identify the best possible solution for a task. This can also prevent users from getting trapped in a debugging rabbit hole whenever Copilot suggests an incorrect or inefficient solution.</p><p>Another observation that is worth investigating is that participants had a hard time understanding the code generated by Copilot. One way to help users understand the generated code is to provide explanations using inline comments. We can highlight different parts of the code based on model confidence similar to the approach suggested by <ref type="bibr">[44]</ref>. We can also help users debug code by automatically generating test cases and test data for users to validate generated code and identify corner cases. We would like to study this in-depth and come up with ways to make the code more understandable and help users to debug and repair generated code. Moreover, we observed that Copilot led to more task failures in medium and hard tasks since it was hard for Copilot to generate correct code in one shot. Three participants who finished the hard task approached the problem by decomposing the complex task into simpler sub-tasks and wrote prompts for each sub-task for Copilot to solve. Such a task decomposition strategy led to higher task-solving efficiency and a better user experience. Therefore, it is worth working on interaction mechanisms that facilitate task decomposition in the future.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">CONCLUSION</head><p>This paper presents a user study with 24 participants on the usability of GitHub Copilot, a groundbreaking code generation tool empowered by an ultra-large language model. In particular, we investigated users' perception of Copilot, their interaction patterns, and their coping strategies when the generated code is not correct. We found that, despite all the promising results on benchmarks <ref type="bibr">[8]</ref>, Copilot did not necessarily reduce the task completion time or increase the success rate of solving programming tasks in a real-world setting. On the other hand, participants overwhelmingly preferred using Copilot in their programming workflow since Copilot often provided a good starting point to approach the programming task. Furthermore, our study shed light on several promising future directions for improving the design of Copilot. For example, instead of simply using Copilot as a one-shot code generation tool, there should be more support for understanding and validating the generated code, exploring multiple solutions, and task decomposition.</p></div>		</body>
		</text>
</TEI>
