skip to main content


Title: Expectation vs. Experience: Evaluating the Usability of Code Generation Tools Powered by Large Language Models
Recent advances in Large Language Models (LLM) have made automatic code generation possible for real-world programming tasks in general-purpose programming languages such as Python. However, there are few human studies on the usability of these tools and how they fit the programming workflow. In this work, we conducted a within-subjects user study with 24 participants to understand how programmers use and perceive Copilot, a LLM-based code generation tool. We found that, while Copilot did not necessarily improve the task completion time or success rate, most participants preferred to use Copilot in daily programming tasks, since Copilot often provided a useful starting point and saved the effort of searching online. However, participants did face difficulties in understanding, editing, and debugging code snippets generated by Copilot, which significantly hindered their task-solving effectiveness. Finally, we highlighted several promising directions for improving the design of Copilot based on our observations and participants’ feedback.  more » « less
Award ID(s):
2107391 2123965
NSF-PAR ID:
10366304
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
CHI Conference on Human Factors in Computing Systems Extended Abstracts (CHI ’22 Extended Abstracts)
Page Range / eLocation ID:
1 to 7
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Powered by recent advances in code-generating models, AI assistants like Github Copilot promise to change the face of programming forever. But whatisthis new face of programming? We present the first grounded theory analysis of how programmers interact with Copilot, based on observing 20 participants—with a range of prior experience using the assistant—as they solve diverse programming tasks across four languages. Our main finding is that interactions with programming assistants arebimodal: inacceleration mode, the programmer knows what to do next and uses Copilot to get there faster; inexploration mode, the programmer is unsure how to proceed and uses Copilot to explore their options. Based on our theory, we provide recommendations for improving the usability of future AI programming assistants.

     
    more » « less
  2. null (Ed.)
    Abstract Neuroimaging and transcranial direct current stimulation (tDCS) research has revealed that generating novel ideas is associated with both reductions and increases in prefrontal cortex (PFC) activity, and engagement of posterior occipital cortex, among other regions. However, there is substantial variability in the robustness of these tDCS‐induced effects due to heterogeneous sample sizes, different creativity measures, and methodological diversity in the application of tDCS across laboratories. To address these shortcomings, we used twelve different montages within a standardized tDCS protocol to investigate how altering activity in frontotemporal and occipital cortex impacts creative thinking. Across four experiments, 246 participants generated either the common or an uncommon use for 60 object pictures while undergoing tDCS. Participants also completed a control short-term memory task. We applied active tDCS for 20 min at 1.5 mA through two 5 cm × 5 cm electrodes over left or right ventrolateral prefrontal (areas F7, F8) or occipital (areas O1, O2) cortex, concurrent bilateral stimulation of these regions across polarities, or sham stimulation. Cathodal stimulation of the left, but not right, ventrolateral PFC improved fluency in creative idea generation, but had no effects on originality, as approximated by measures of semantic distance. No effects were obtained for the control tasks. Concurrent bilateral stimulation of the ventrolateral PFC regardless of polarity direction, and excitatory stimulation of occipital cortex did not alter task performance. Highlighting the importance of cross-experimental methodological consistency, these results extend our past findings and contribute to our understanding of the role of left PFC in creative thinking. 
    more » « less
  3. A large portion of the cost of any software lies in the time spent by developers in understanding a program’s source code before any changes can be undertaken. Measuring program comprehension is not a trivial task. In fact, different studies use self-reported and various psycho-physiological measures as proxies. In this research, we propose a methodology using functional Near Infrared Spectroscopy (fNIRS) and eye tracking devices as an objective measure of program comprehension that allows researchers to conduct studies in environments close to real world settings, at identifier level of granularity. We validate our methodology and apply it to study the impact of lexical, structural, and readability issues on developers’ cognitive load during bug localization tasks. Our study involves 25 undergraduate and graduate students and 21 metrics. Results show that the existence of lexical inconsistencies in the source code significantly increases the cognitive load experienced by participants not only on identifiers involved in the inconsistencies but also throughout the entire code snippet. We did not find statistical evidence that structural inconsistencies increase the average cognitive load that participants experience, however, both types of inconsistencies result in lower performance in terms of time and success rate. Finally, we observe that self-reported task difficulty, cognitive load, and fixation duration do not correlate and appear to be measuring different aspects of task difficulty. 
    more » « less
  4. Though conditionals are an integral component of programming, providing an easy means of creating conditionals remains a challenge for programming-by-demonstration (PBD) systems for task automation. We hypothesize that a promising method for implementing conditionals in such systems is to incorporate the use of verbal instructions. Verbal instructions supplied concurrently with demonstrations have been shown to improve the generalizability of PBD. However, the challenge of supporting conditional creation using this multi-modal approach has not been addressed. In this extended abstract, we present our study on understanding how end users describe conditionals in natural language for mobile app tasks. We conducted a formative study of 56 participants asking them to verbally describe conditionals in different settings for 9 sample tasks and to invent conditional tasks. Participant responses were analyzed using open coding and revealed that, in the context of mobile apps, end users often omit desired else statements when explaining conditionals, sometimes use ambiguous concepts in expressing conditionals, and often desire to implement complex conditionals. Based on these findings, we discuss the implications for designing a multimodal PBD interface to support the creation of conditionals. 
    more » « less
  5. Background and context. “Explain in Plain English” (EiPE) questions ask students to explain the high-level purpose of code, requiring them to understand the macrostructure of the program’s intent. A lot is known about techniques that experts use to comprehend code, but less is known about how we should teach novices to develop this capability. Objective. Identify techniques that can be taught to students to assist them in developing their ability to comprehend code and contribute to the body of knowledge of how novices develop their code comprehension skills. Method. We developed interventions that could be taught to novices motivated by previous research about how experts comprehend code: prompting students to identify beacons, identify the role of variables, tracing, and abstract tracing. We conducted think-aloud interviews of introductory programming students solving EiPE questions, varying which interventions each student was taught. Some participants were interviewed multiple times throughout the semester to observe any changes in behavior over time. Findings. Identifying beacons and the name of variable roles were rarely helpful, as they did not encourage students to integrate their understanding of that piece in relation to other lines of code. However, prompting students to explain each variable’s purpose helped them focus on useful subsets of the code, which helped manage cognitive load. Tracing was helpful when students incorrectly recognized common programming patterns or made mistakes comprehending syntax (text-surface). Prompting students to pick inputs that potentially contradicted their current understanding of the code was found to be a simple approach to them effectively selecting inputs to trace. Abstract tracing helped students see high-level, functional relationships between variables. In addition, we observed student spontaneously sketching algorithmic visualizations that similarly helped them see relationships between variables. Implications. Because students can get stuck at many points in the process of code comprehension, there seems to be no silver bullet technique that helps in every circumstance. Instead, effective instruction for code comprehension will likely involve teaching a collection of techniques. In addition to these techniques, meta-knowledge about when to apply each technique will need to be learned, but that is left for future research. At present, we recommend teaching a bottom-up, concrete-to-abstract approach. 
    more » « less