<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Validating AI-Generated Code with Live Programming</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>05/11/2024</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10538053</idno>
					<idno type="doi">10.1145/3613904.3642495</idno>
					
					<author>Kasra Ferdowsi</author><author>Ruanqianqian Lisa Huang</author><author>Michael B James</author><author>Nadia Polikarpova</author><author>Sorin Lerner</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[AI-powered programming assistants are increasingly gaining popularity, with GitHub Copilot alone used by over a million developers worldwide. These tools are far from perfect, however, producing code suggestions that may be incorrect in subtle ways. As a result, developers face a new challenge: validating AI's suggestions. This paper explores whether Live Programming (LP), a continuous display of a program's runtime values, can help address this challenge.To answer this question, we built a Python editor that combines an AI-powered programming assistant with an existing LP environment. Using this environment in a between-subjects study ( = 17), we found that by lowering the cost of validation by execution, LP can mitigate over-and under-reliance on AI-generated programs and reduce the cognitive load of validation for certain types of tasks.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>CodeWhisperer <ref type="bibr">[1]</ref>, and ChatGPT <ref type="bibr">[23]</ref>. These AI programming assistants are changing the face of software development, automating many of the traditional programming tasks, but at the same time introducing new tasks into the developer's workfow-such as prompting the assistant and reviewing its suggestions <ref type="bibr">[2,</ref><ref type="bibr">22]</ref>. Development environments have some catching up to do in order to provide adequate tool support for these new tasks.</p><p>In this paper, we focus on the task of validating AI-generated code, i.e., deciding whether it matches the programmer's intent. Recent studies show that validation is a bottleneck for AI-assisted programming: according to Mozannar et al. <ref type="bibr">[22]</ref>, it is the single most prevalent activity when using AI code assistants, and other studies <ref type="bibr">[3,</ref><ref type="bibr">21,</ref><ref type="bibr">32,</ref><ref type="bibr">36]</ref> report programmers having trouble evaluating the correctness of AI-generated code. Faced with difculties in validation, programmers tend to either under-rely on the assistanti.e., lose trust in it-or to over-rely-i.e., blindly accept its suggestions <ref type="bibr">[27,</ref><ref type="bibr">30,</ref><ref type="bibr">34,</ref><ref type="bibr">37]</ref>; the former can cause them to abandon the assistant altogether <ref type="bibr">[2]</ref>, while the latter can introduce bugs and security vulnerabilities <ref type="bibr">[26]</ref>. These fndings motivate the need for better validation support in AI-assisted programming environments.</p><p>This paper investigates the use of Live Programming (LP) <ref type="bibr">[13,</ref><ref type="bibr">31,</ref><ref type="bibr">35]</ref> as a way to support the validation of AI-generated code. LP environments, such as Projection Boxes <ref type="bibr">[20]</ref>, visualize runtime values of a program in real-time without any extra efort on the part of the programmer. We hypothesize that these environments are a good ft for validation, since LP has been shown to encourage more frequent testing <ref type="bibr">[4]</ref> and facilitate bug fnding <ref type="bibr">[41]</ref> and program comprehension <ref type="bibr">[5,</ref><ref type="bibr">7,</ref><ref type="bibr">8]</ref>. On the other hand, validation of AI-generated code is a new and unexplored domain in program comprehension that comes with its unique challenges, such as multiple AI suggestions for the programmer to choose from, and frequent context switches between prompting, validation, and code authoring <ref type="bibr">[22]</ref>, which cause additional cognitive load <ref type="bibr">[36]</ref>. Hence, the application of LP to the validation setting warrants a separate investigation.</p><p>To this end, we constructed a Python environment that combines an existing LP environment <ref type="bibr">[20]</ref> with an AI assistant similar to Copilot's multi-suggestion pane. Using this environment, we conducted a between-subjects experiment ( = 17) to evaluate how the availability of LP afects users' efectiveness and cognitive load in validating AI suggestions. Our study shows that Live Programming facilitates validation through lowering the cost of inspecting runtime values; as a result, participants were more successful in evaluating the correctness of AI suggestions and experienced lower cognitive load in certain types of tasks. Users can inspect the runtime behavior of the suggestion in Projection Boxes <ref type="bibr">[20]</ref>, which are updated continuously as the user edits the code.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">RELATED WORK</head><p>Validation of AI-Generated Code. A rapidly growing body of work analyzes how users interact with AI programming assistants. Studies show that programmers spend a signifcant proportion of their time validating AI suggestions <ref type="bibr">[2,</ref><ref type="bibr">3,</ref><ref type="bibr">22]</ref>. Moreover, a largescale survey <ref type="bibr">[21]</ref> indicates that 23% of their respondents have trouble evaluating correctness of generated code, which echoes the fndings of lab studies <ref type="bibr">[2,</ref><ref type="bibr">32]</ref> and a need-fnding study <ref type="bibr">[36]</ref>, where participants report difculties understanding AI suggestions and express a desire for better validation support. Barke et al. <ref type="bibr">[2]</ref> and Liang et al. <ref type="bibr">[21]</ref> fnd that programmers use an array of validation strategies, and the prevalence of each strategy is closely related to its time cost. Specifcally, despite the help of execution techniques built into the IDE for validating AI suggestions <ref type="bibr">[30]</ref>, execution is used less often than quick manual inspection or type checking because it is more time-consuming <ref type="bibr">[2,</ref><ref type="bibr">21]</ref> and interrupts programmers' workfows <ref type="bibr">[36]</ref>. The lack of validation support designed for AI-assisted programming, as Wang et al. <ref type="bibr">[36]</ref> identify, leads to a higher cognitive load in reviewing suggestions. The high cost of validating AI suggestions, according to some studies <ref type="bibr">[27,</ref><ref type="bibr">34,</ref><ref type="bibr">37]</ref>, can lead to both under-reliance-lack of trust-and over-reliance-uncritically accepting wrong code-on the part of the programmer.</p><p>Comparatively fewer existing papers explore interface designs to support validation of AI-generated code: Ross et al. <ref type="bibr">[27]</ref> investigates a conversational assistant that allows programmers to ask questions about the code, while Vasconcelos et al. <ref type="bibr">[33]</ref> targets over-reliance by highlighting parts of generated code that might need human intervention; our work is complementary to these eforts in that it focuses on facilitating validation by execution. Validation in Program Synthesis. Another line of related work concerns the validation of code generated by search-based (non-AI-powered) program synthesizers. Several synthesizers help users validate generated code by proactively displaying its outputs <ref type="bibr">[9,</ref><ref type="bibr">16,</ref><ref type="bibr">40]</ref> and intermediate trace values <ref type="bibr">[25]</ref>, although none of them use a LP environment. The only system we are aware of that combines LP and program synthesis is SnipPy <ref type="bibr">[11]</ref>, but it uses LP to help the user specify their intent rather than validate synthesized code. Live Programming. Live Programming (LP) provides immediate feedback on code edits, often in the form of visualizations of the runtime state <ref type="bibr">[13,</ref><ref type="bibr">31,</ref><ref type="bibr">35]</ref>. Some quantitative studies fnd that programmers with LP fnd more bugs <ref type="bibr">[41]</ref>, fx bugs faster <ref type="bibr">[18]</ref>, and test a program more often <ref type="bibr">[4]</ref>. Others fnd no efect in knowledge gain <ref type="bibr">[15]</ref> or efciency in code understanding <ref type="bibr">[5]</ref>. Still, qualitative evidence points to the helpfulness of LP for program comprehension <ref type="bibr">[5,</ref><ref type="bibr">7,</ref><ref type="bibr">8]</ref> and debugging <ref type="bibr">[15,</ref><ref type="bibr">17]</ref>. In contrast to these studies, which evaluate the efectiveness of LP for comprehending and debugging human-written code, our work investigates its efectiveness for validating AI-generated code, a setting that comes with a number of previously unexplored challenges <ref type="bibr">[22,</ref><ref type="bibr">36]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">LEAP: THE TOOL USED IN THE STUDY</head><p>To study how Live Programming afects the validation of AIgenerated code, we implemented Leap (Live Exploration of AI-Generated Programs), a Python environment that combines an AI assistant with LP. This section demonstrates Leap via a usage example and discusses its implementation. Example Usage. Naomi, a biologist, is analyzing some genome sequencing data using Python. As part of her analysis, she needs to fnd the most common bigram (i.e., two-letter sequence) in a DNA strand. <ref type="foot">1</ref> To this end, she creates a function dominant_bigram (line 3 in Fig. <ref type="figure">1</ref>); she has a general idea of what this function might look like, but she decides to use Leap to help translate her idea into code.</p><p>Naomi adds a docstring (line 5), which conveys her intent in natural language, and a test case (line 24), which will help her validate the code. With the cursor positioned at line 7, she presses and to ask for suggestions. Within seconds, a panel opens on the right containing fve AIgenerated code suggestions; Naomi quickly skims through all of them. The overall shape of Suggestion 3 looks most similar to what she has in mind: it frst collects the counts of all bigrams into a dictionary, and then iterates through the dictionary to pick a bigram with the maximum count. Naomi tries this suggestion, pressing its Preview button; Leap inserts the code into the editor and highlights it (lines 8-18). As soon as the suggestion is inserted, Projection Boxes <ref type="bibr">[20]</ref> appear, showing runtime information at each line in the code. Inspecting intermediate values helps Naomi understand what the code is doing step by step. When she gets to line 18, she realizes that the dictionary actually has two dominant bigrams with the same count, and the code returns the last one. She realizes this is not what she wants: instead, she wants to select the dominant bigram that comes frst alphabetically (ag in this case). One option Naomi has is to try other suggestions. She clicks on the Preview button for Suggestion 2; Leap then inserts Suggestion 2 into the editor, in place of the prior suggestion, and the Projection Boxes update instantly to show its behavior. Naomi immediately notices that Suggestion 2 throws an exception inside the second loop, so she abandons it and goes back to Suggestion 3, which got her closer to her goal.</p><p>To fx Suggestion 3, Naomi realizes that she can accumulate all dominant bigrams in a list, sort the list, and return the frst element. She does not remember the exact Python syntax for sorting a list, so she tries diferent variations-including l = l.sort, l = l.sort(), l = sort(l), l = l.sorted(), and so on. Fortunately, Leap's support for LP allows her to get instant feedback on the behavior of each edit, so she iterates quickly to fnd one correct option: l = sorted(l). Note that Naomi's workfow for using Suggestion 3-validation, fnding bugs, and fxing bugs-relies on full LP support, and would not work in traditional environments like computational notebooks, which provide easy access to the fnal output of a snippet but not the intermediate values or immediate feedback on edits. Implementation. To generate code suggestions, Leap uses the text-davinci-003 model <ref type="bibr">[24]</ref>, the largest publicly available codegenerating model at the time of our study. To support live display of runtime values (Fig. <ref type="figure">1</ref> ), we built Leap on top of Projection Boxes, a state-of-the-art LP environment for Python <ref type="bibr">[20]</ref> capable of running in the browser. The code for Leap can be found at <ref type="url">https://bit.ly/leap-code</ref>. As the control condition for our study, we also created a version of Leap, where Projection Boxes are disabled, and instead the user can run the code explicitly by clicking a Run button and see the output in a terminal-like Output Panel.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">USER STUDY</head><p>We conducted a between-subjects study to answer the following research questions: RQ1) How does Live Programming afect over-and under-reliance in validating AI-generated code? RQ2) How does Live Programming afect validation strategies? RQ3) How does Live Programming afect the cognitive load of validating AI-generated code?</p><p>Tasks. Our study incorporates two categories of programming tasks, Fixed-Prompt and Open-Prompt tasks.</p><p>In Fixed-Prompt tasks, we provide participants with a fxed set of fve AI suggestions that are intended to solve the entire problem. We curated the suggestions by querying Copilot <ref type="bibr">[12]</ref> and Leap with slight variations of the prompt. Fixed-Prompt tasks isolate the efects of Live Programming on validation behavior by controlling for the quality of suggestions. We created two Fixed-Prompt tasks, each with fve suggestions: (T1) Bigram: Find the most frequent bigram in a given string, resolving ties alphabetically (same task in Sec. 3); (T2) Pandas: Given a pandas data frame with data on dogs of three size categories (small, medium, and large), compute various statistics, imputing missing values with the mean of the appropriate category. These tasks represent two distinct styles: Bigram is a purely algorithmic task, while Pandas focuses on using a complex API. Pandas has two correct AI suggestions (out of fve) while Bigram has none, a realistic scenario that programmers encounter with imperfect models.</p><p>In Open-Prompt tasks, participants are free to invoke the AI assistant however they want. This task design is less controlled than Fixed-Prompt, but more realistic, thus increasing ecological validity. We used two Open-Prompt tasks: (T3) String Rewriting: parse a set of string transformation rules and apply them fve times to a string; (T4) Box Plot: given a pandas data frame containing 10 experiment data records, create a matplotlib box plot of time values for each group, combined with a color-coded scatter plot. Both tasks are more complex than the Fixed-Prompt tasks, and could not be solved with a single interaction with the AI assistant. Participants and Groups. We recruited 17 participants; 5 selfidentifed as women, 10 as men, and 2 chose not to disclose. 6 were undergraduate students, 9 graduate students, and 2 professional engineers. Participants self-reported experience levels with Python and AI assistants: 2 participants used Python 'occasionally', 8 'regularly', and 7 'almost every day'; 7 participants declared they had 'never' used AI assistants, and 8 used such tools 'occasionally'.</p><p>There were two experimental groups: "LP" participants used Leap with Projection Boxes, as described in Fig. <ref type="figure">1</ref>; "No-LP" participants used Leap without Projection Boxes, instead executing programs in a terminal-like Output Panel. Participants completed both Fixed-Prompt tasks and one Open-Prompt task. We used block randomization <ref type="bibr">[10]</ref> to assign participants to groups while evenly distributing across task order and selection and balancing experience with Python and AI assistants across groups. The LP group had 8 participants, and No-LP had 9.</p><p>Procedure and Data. We conducted the study over Zoom as each participant used Leap in their web browser. Each session was recorded and included two Fixed-Prompt tasks (10 minutes each), two post-task surveys, one Open-Prompt task (untimed), one poststudy survey, and a semi-structured interview. A replication package 2 shows the details of our procedure, tasks, and data collection.</p><p>For quantitative analysis, we performed closed-coding on video recordings of study sessions to determine each participant's subjective assessment of their success on the task; we matched this data against the objective correctness of their fnal code to establish whether they succeeded in accurately validating AI suggestions. We also measured task duration-proportion of time Suggestion Panel (Fig. <ref type="figure">1</ref> ) was in focus-and participants' cognitive load (via fve NASA Task Load Index (TLX) questions <ref type="bibr">[14]</ref>). We used Mann-Whitney U tests to assess all diferences except for validation success, which we analyzed via Fisher's exact tests.</p><p>In addition, we collected qualitative data from both Fixed-Prompt and Open-Prompt tasks. We noted validation-related behavior and quotes, which we discussed in memoing meetings <ref type="bibr">[6]</ref> after the study. Through refexive interpretation, we used category analysis <ref type="bibr">[39]</ref> to assemble the qualitative data into groups. We then revisited notes and recordings to iteratively construct high-level categories.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">RQ1: Over-And Under-Reliance on AI</head><p>To investigate if Live Programming afects over-and under-reliance, we measured whether participants successfully validated the AI suggestions in the Fixed-Prompt tasks, as described below. We also compared task completion times and participants' confdence in their solutions (collected through post-task surveys). However, neither result was signifcantly diferent between the two groups, so we do not discuss them below. 3  We found six instances of unsuccessful validation, all from the No-LP group. As described in Sec. 4, we compared subjective and objective assessments of code correctness on the two Fixed-Prompt tasks, which resulted in four outcomes: (1) Complete and Accurate, where the participant submitted a correct solution within the task time limit, (2) Complete and Inaccurate, where the participant submitted an incorrect solution without recognizing the error, (3) Timeout after Validation, where the participant formed an accurate understanding of the correctness of the suggestions but reached the time limit before fxing the error in their chosen suggestion, and (4) Timeout during Validation, where the participant reached the time limit before they had fnished validating the suggestions. We consider (1) and ( <ref type="formula">3</ref>) to be instances of successful validation, (2) to be an instance of over-reliance on the AI suggestions, and (3) to be an instance of under-reliance, as the participant did not 2 <ref type="url">https://bit.ly/leap-study-materials</ref>  3 In median times, the LP group completed the Pandas task faster by 35 seconds ( = .664, = 31). For Bigram, LP participants were slower by 3 minutes and 51 seconds ( = .583, = 42), though this diference changes to faster by 10 seconds if we exclude those who solved the task incorrectly. For Pandas, both groups had the median ratings of confdence in correctness as "Agree" on seen inputs ( = .784, = 30) and "Neutral" on unseen inputs ( = .795, = 33). For Bigram, the LP group had the median rating of confdence in correctness on seen inputs as "Agree", while the No-LP group had "Strongly Agree" ( = .097, = 19.5). As for confdence in correctness on unseen inputs, the median for the LP group was "Neutral", and that for the No-LP group was "Agree" ( = .201, = 22.5). Figure <ref type="figure">2</ref>: Success in validating AI suggestions across groups for Fixed-Prompt tasks. "Completed" means the participant submitted a solution they were satisfed with by the time limit, and "Timeout" means they did not. We deem the validation successful if a participant submitted a correct solution (dark blue) or timed out when attempting to fx the correctly identifed bugs in their chosen suggestion (light blue).</p><p>successfully validate the suggestions in the given time. As Fig. <ref type="figure">2</ref> shows, we found three instances of over-reliance in the Bigram task and three instances of under-reliance in the Pandas task, all from the No-LP group, though the overall between-group diference was not signifcant ( = .206 for both tasks).</p><p>Participants with over-reliance did not inspect enough runtime behavior. The three No-LP participants with over-reliance in Bigram (P5, P12, P15) made a similar mistake: they accepted one of the mostly-correct suggestions (similar to Suggestion 3 in Sec. 3) and failed to notice that ties were not resolved alphabetically. Among the three participants, P5 did not run their code at all. P12 and P15 both tested only one suggestion on the given input and failed to notice the presence of two bigrams of the same count (and the fact that other suggestions returned diferent results). In addition, P15 cited "reading the comments on what it was doing" as a key factor for choosing the suggestion they did. That suggestion began with a comment stating that it resolved ties alphabetically, but the following code did not do so.</p><p>Participants with under-reliance lacked afordances for inspecting runtime behavior. The three No-LP participants who under-relied on AI suggestions (P7, P9, P15) tried to use runtime values for validation but struggled with doing so. P9 previewed and ran multiple suggestions but did not add any print statements to the code, and so they could only see the output of one of the suggestions, which ended in a print statement. P15 ran all suggestions and did add a print statement to each to inspect the fnal return value, but the need to change the print statement and re-run each time made this process difcult, and they lost track of which suggestions they considered the most promising, saying "I forgot which ones looked decent." Finally, P7's strategy was to print the output of subexpressions from various suggestions in order to understand their behavior and combine them into a single solution, but this was time-consuming, so they did not fnish.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">RQ2: Validation Strategies</head><p>Our participants had access to two validation strategies: examination (reading the code) and execution (inspecting runtime values). The general pattern we observed was that participants frst did some amount of examination inside the Suggestion Panel-ranging from a quick glance to thorough reading-and then proceeded to</p><p>Pandas Bigram 0% 25% 50% 75% LP No-LP Figure 3: Percentage of time spent in the Suggestion Panel across the two groups for Fixed-Prompt tasks. preview zero or more suggestions, performing further validation by execution inside the editor. To this end, No-LP participants in most tasks ran the code and added print statements for both fnal and intermediate values; LP participants in all tasks inspected both fnal and intermediate runtime values in Projection Boxes (by moving the cursor from line to line to bring diferent boxes into focus), and occasionally added print statements to see variables not shown by default. Below we discuss notable examples of validation behavior, as well as diferences between the two groups and across tasks. LP participants spent less time reading the code. We use the time the Suggestion Panel was in focus as a proxy for examination time; Fig. 3 shows this time as a percentage of the total task duration. The No-LP group spent more time in the Suggestion Panel compared to LP for both Fixed-Prompt tasks. The diference is signifcant in the Pandas task ( = .02, = 11, median LP = 14.05%, median No-LP = 30.47%) but not in Bigram ( = .14, = 20, median LP = 24.70%, median No-LP = 36.57%). We also collected this data for the Open-Prompt tasks, although it should be interpreted with caution due to the unstructured nature of the tasks (e.g., participant engagement with the assistant and suggestion quality varied). The results are consistent with the Fixed-Prompt tasks-i.e., No-LP participants spent more time in the Suggestion Panel-but the diference is not signifcant, and the efect in Box Plot is very small ( = .14, = 3.5, median LP = 6.25%, median No-LP = 15.49% for String Rewriting; = .67, = 6, median LP = 8.10%, median No-LP = 8.70% for Box Plot). Participants relied on runtime values more in API-heavy, oneof tasks. According to Fig. 3, both groups spent more time examining the code in Bigram, while in Pandas they jumped to execution more immediately (median Pandas = 16.96%, median Bigram = 31.67%, = .04, = 206</p><p>). This diference in validation strategies between the two tasks was also refected in the interviews. For example, P1 described their strategy for Pandas as follows: "I didn't look too closely in the actual code, I was just looking at the runtime values on the side. " Instead, in Bigram, participants cared more about the code itself, preferring suggestions based on their expected algorithm, data structure, or style (e.g. P15 "was really looking for the dictionary aspect"), with the most popular attribute being "short"/"readable", cited by 10 out of 17 participants. One explanation participants gave for the diference in behavior is that Pandas is an API-heavy task, and when dealing with unfamiliar APIs, reading the code is just not very helpful: "When it's using more jargony stuf that doesn't translate directly into words in your brain, then seeing the preview makes it clearer" (P3). Another explanation they gave is that Pandas was perceived by the participants as a one-of task, i.e., it only needed to work on the one specifed input, whereas Bigram was perceived as general, i.e., it needed to work on "any sort of string [. . . ] not only [. . . ] the specifc string that was tested" (P3); this was not explicit in the instructions, but in retrospect it is a reasonable assumption, given the problem domains and structure of the starter code. On the other hand, some LP participants conjectured that with more familiarity with Live Programming, they would rely on runtime values more, even in tasks like Bigram: "If I were to use this tool again I would preview more immediately, just because I think I was very focused on whether it produced how I would solve the problem vs. whether it solved the problem correctly" (P4).</p><p>LP participants benefted from visualizing intermediate values. We looked into the validation strategies used in Bigram to identify the tie-resolution issue in AI suggestions (excluding P17 because they wrote the code from scratch). In the input we provided, it was hard to identify the most common bigram at a glance, which made it difcult to validate suggestions just by looking at the fnal result. LP participants used liveness features for validation and debugging. For validation, LP participants made use of full liveness, i.e., the ability to see the immediate efects of their edits. Five participants in Pandas added auxiliary calculations to double-check the correctness of the fnal output, e.g., the mean of specifc cells in the input table, comparing it to the output table. When it comes to debugging, LP participants made multiple rounds of trial and error guided by liveness. In fact, the example in Sec. 3 was inspired by P4's debugging process in the Bigram task. Also, in Box Plot, P1 made many repeated edits in an AI suggestion to tune the placement of a label, guided by error messages and incorrect outputs to fgure out the precise usage of an unfamiliar API call. In the interview, they noted: "I was defnitely using the projections [...] as I was editing the suggestions to see if my intended changes actually were followed through. "</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">RQ3: Cognitive Load in Validation</head><p>LP participants experienced signifcantly lower cognitive load in the Pandas task but not the Bigram task. In Pandas, LP participants experienced signifcantly lower cognitive load in four out of fve aspects of NASA-TLX <ref type="bibr">[14]</ref>: mental demand ( = .039, = 14.5), performance ( = .048, = 15.5), efort ( = .015, = 11), and frustration ( = .0004, = 0). We fnd no signifcant diferences in responses to Bigram, but LP participants reported slightly higher</p><p>Bigram Pandas Mental Demand Hurry Performance Effort Frustration Mental Demand Hurry Performance Effort Frustration 1 3 5 7 LP No-LP performance measures (median LP = 3, median No-LP = 2), which stand for higher failure rates.</p><p>Participants found LP helpful in distinguishing between multiple suggestions. Participants from both the No-LP (P9, P14, P17) and LP (P3, P16) groups commented on the utility of seeing multiple suggestions at once: "[Seeing multiple suggestions] gave me diferent ways to look at the code and gave me diferent ideas" (P9) and "multiple suggestions gave points of comparison that were useful" (P14). However, some No-LP participants (P6, P7, P15, P17) said they found the suggestions hard to distinguish. They noted the difculty of diferentiating just by reading the code because "the suggestions [were] all almost the same thing" (P7), and observed that "the tool did not really help with choosing between suggestions" (P15). In comparison, some in the LP group (P1, P16) specifcally commented that Live Programming was helpful in distinguishing and choosing between multiple code suggestions; P1 said: "Being able to preview, edit, and look at the projection boxes before accepting a snippet was very helpful when choosing between multiple suggestions." As far as we are aware, this is a new application of Live Programming, specifc to AI programming assistants and not previously explored in Live Programming literature.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">DISCUSSION</head><p>Live Programming lowers the cost of validation by execution.</p><p>Although both LP and No-LP participants had access to runtime values as a validation mechanism, those without LP needed to examine the code to decide which values to print, add the print statements, run the code, and match each line in the output to the corresponding line in the code. If they wanted to inspect a diferent suggestion, they had to repeat this process from the start. Meanwhile, LP participants could simply click on the suggestion to preview it and get immediate access to all the relevant runtime information, easily switching between suggestions as necessary. In other words, LP lowers the cost-in terms of both time and mental efort-of access to runtime values. As a result, we saw LP participants relied on runtime values more for validation, as they spent less time examining the code in general-and signifcantly so for the Pandas task-and more often used intermediate values to fnd bugs in Bigram (Sec. 5.2). Our fndings are consistent with prior work <ref type="bibr">[2,</ref><ref type="bibr">21]</ref>, which demonstrated that programmers more often use validation strategies with lower time costs. Hence, by lowering the cost of access to runtime values, Live Programming promotes validation by execution.</p><p>The lower cost of validation by execution prevents over-and under-reliance. As discussed in Sec. 5.1, we found six instances of unsuccessful validation in our study, all from the No-LP group, overrelying on AI suggestions in the Bigram task, and under-relying in Pandas. We attribute these failures to the high cost of validation by execution: those who over-relied did not inspect the runtime behavior of the suggestions in enough detail, while those with under-reliance lacked the afordances to do so efectively, and so ran out of time before they could validate the suggestions. Our results echo prior fndings <ref type="bibr">[34]</ref> that relate the cost of a validation strategy to its efectiveness in reducing over-reliance on AI. Prior work has also shown <ref type="bibr">[11,</ref><ref type="bibr">36]</ref> that programmers often struggle to form an appropriate level of trust in code synthesizers, whether AI-based or not; our results suggest an important new role for Live Programming in addressing this challenge. We conclude that the lower cost of validation by execution in Live Programming leads to more accurate judgments of the correctness of AI-generated code.</p><p>Validation strategies depend on the task. Sec. 5.2 shows that participants overall spent signifcantly more time examining the code in Bigram than in Pandas and also paid more attention to code attributes in the former. Participants explained the diference in their validation strategies by two factors: (1) Pandas contained unfamiliar API calls, the meaning of which they could not infer from the code alone; and (2) they perceived Pandas as a one-of task, which only had to work on the given input. We conjecture that (1) is partly due to our participants being LP novices: as they get more used to the environment, they are likely to rely on previews more, even if they are not forced into it by an unfamiliar API (as P4 mentioned in Sec. 5.2). ( <ref type="formula">2</ref>), though, is more fundamental: when dealing with a general task, correctness is not all that matters; code quality becomes important as well, and LP does not help with that.</p><p>In Open-Prompt tasks, code examination was less prevalent in the overall task duration, because in these tasks participants spent a signifcant amount of time on activities besides validation (e.g., decomposing the problem and crafting prompts). It might seem surprising, however, that we did not see any diference in examination time between the two groups in Box Plot, which is an API-heavy, one-of task, similar to Pandas. This might be because, in Box Plot, the cost of validation by execution was already low for No-LP participants: this task did not require inspecting intermediate values, because the efects of each line of code were refected on the fnal plot in a compositional manner (i.e., it was easy to tell what each line of code was doing just by looking at the fnal plot).</p><p>In conclusion, Live Programming does not completely eliminate the need for code examination but reduces it in tasks amenable to validation by execution.</p><p>Live Programming lowers the cognitive load of validation by execution. In Pandas, LP participants experienced lower cognitive load in four out of fve TLX categories (Sec. 5.3). This confrms our hypotheses that LP lowers the cost of validation by execution, and that Pandas is a task amenable to such validation. More specifcally, we conjecture that, by automating away the process of writing print statements, LP reduces workfow interruptions, which were identifed as one of the sources of increased cognitive load in reviewing AI-generated code <ref type="bibr">[36]</ref>.</p><p>In Bigram, however, we did not observe a similar reduction in cognitive load; in fact, LP participants reported higher cognitive load in the "performance" category (i.e., they perceived themselves as less successful). Our interpretation is that the cognitive load in this task was dominated by debugging and not validation, and whereas all participants in the LP group engaged in debugging, only two-thirds of the No-LP group did so. Finally, the higher "performance" ratings from the LP group were from those who ran out of time trying to fx the code, and hence were aware that they had failed. These fndings show that Live Programming by itself does not necessarily help with debugging a faulty suggestion. As we saw in Sec. 5.2, it can be helpful when the user has a set of potential fxes in mind, which they can quickly try out and get immediate feedback on. But when the user does not have potential fxes in mind, they need to rely on other tools, such as searching the web or using chat-based AI assistants.</p><p>From these fndings, we conclude that Live Programming lowers the cognitive load of validating AI suggestions when the task is amenable to validation by execution.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">CONCLUSION AND FUTURE WORK</head><p>We investigated an application of Live Programming in the domain of AI-assisted programming, fnding that LP can reduce overand under-reliance on AI-generated code by lowering the cost of validation by execution. Our work highlights new benefts of LP specifc to AI-assisted programming, such as building appropriate trust in the assistant and helping to choose between multiple suggestions. Our study is necessarily limited in scope: we focused on self-contained tasks due to LP's limited support for complex programs <ref type="bibr">[20,</ref><ref type="bibr">31]</ref> and its need for small demonstrative inputs <ref type="bibr">[28]</ref>. We hope that our fndings inform future studies on code validation and motivate further research into AI-LP integration. To that end, we highlight key opportunities below.</p><p>To ofer liveness, LP places several burdens on the user. The user must provide a complete executable program and a set of test cases, and then look through potentially large runtime traces for the relevant information. AI may alleviate these burdens by flling in missing runtime values <ref type="bibr">[29]</ref> for incomplete programs, generating test cases <ref type="bibr">[19,</ref><ref type="bibr">38]</ref>, and predicting the most relevant information to be displayed at each program point. Looking beyond the validation of newly generated code, there are also opportunities for AI-LP integration for debugging and code repair <ref type="bibr">[38]</ref>. In combination, AI-LP would tighten the feedback loop of querying and repairing AI-generated code: users could validate code via LP, request repair using the runtime information from LP <ref type="bibr">[11]</ref>, and further validate the repair in LP.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>This is one of the programming tasks from our user study, and each of Naomi's interactions with Leap has been observed in some of our participants.</p></note>
		</body>
		</text>
</TEI>
