<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>ChatDBG: Augmenting Debugging with Large Language Models</title></titleStmt>
			<publicationStmt>
				<publisher>ACM</publisher>
				<date>06/19/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10632177</idno>
					<idno type="doi">10.1145/3729355</idno>
					<title level='j'>Proceedings of the ACM on Software Engineering</title>
<idno>2994-970X</idno>
<biblScope unit="volume">2</biblScope>
<biblScope unit="issue">FSE</biblScope>					

					<author>Kyla H Levin</author><author>Nicolas van_Kempen</author><author>Emery D Berger</author><author>Stephen N Freund</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<p>Debugging is a critical but challenging task for programmers. This paper proposes ChatDBG, an AI-powered debugging assistant. ChatDBG integrates large language models (LLMs) to significantly enhance the capabilities and user-friendliness of conventional debuggers. ChatDBG lets programmers engage in a collaborative dialogue with the debugger, allowing them to pose complex questions about program state, perform root cause analysis for crashes or assertion failures, and explore open-ended queries like why is x null?. To handle these queries, ChatDBG grants the LLM autonomy to take the wheel: it can act as an independent agent capable of querying and controlling the debugger to navigate through stacks and inspect program state. It then reports its findings and yields back control to the programmer. By leveraging the real-world knowledge embedded in LLMs, ChatDBG can diagnose issues identifiable only through the use of domain-specific reasoning. Our ChatDBG prototype integrates with standard debuggers including LLDB and GDB for native code and Pdb for Python. Our evaluation across a diverse set of code, including C/C++ code with known bugs and a suite of Python code including standalone scripts and Jupyter notebooks, demonstrates that ChatDBG can successfully analyze root causes, explain bugs, and generate accurate fixes for a wide range of real-world errors. For the Python programs, a single query led to an actionable bug fix 67% of the time; one additional follow-up query increased the success rate to 85%. ChatDBG has seen rapid uptake; it has already been downloaded more than 75,000 times.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Debuggers help programmers identify and !x bugs by letting them investigate program state and navigate program execution. Debuggers for mainstream languages, including GDB <ref type="bibr">[39]</ref> and LLDB <ref type="bibr">[27]</ref> (for C, C++, and Rust), JDB (for Java), Pdb (for Python), and the Chrome or Firefox debuggers (for JavaScript), generally provide the same functionality. In particular, most debuggers support observing program execution via tracing and reporting when a program reaches a given line or function of source code; interrupting execution and returning control to the debugger when the program reaches a given line or function via breakpoints, when a particular condition is true via conditional breakpoints, or when a variable changes via watchpoints (a.k.a. data breakpoints); inspecting local variables, globals, heap objects, and backtraces of the call stack; and resuming program execution line-by-line (single-step) or at the granularity of function calls.</p><p>Debuggers can be helpful, but !nding and !xing software defects remains a deeply challenging and time-consuming task <ref type="bibr">[7,</ref><ref type="bibr">20,</ref><ref type="bibr">47]</ref>. Programmers must still reason about program behavior to ascertain what went wrong. They must formulate and test hypotheses about program execution, they must read and understand code they may have not written, and they must pore over potentially voluminous information. Such information includes lengthy executions, large amounts of program data, and many stack frames that potentially span multiple threads.</p><p>This paper introduces the C!"#DBG AI-powered debugger assistant. C!"#DBG integrates into and signi!cantly extends the functionality of standard debuggers. C!"#DBG builds on the insight that large language models (LLMs), such as OpenAI's GPT-4 <ref type="bibr">[34]</ref>, enable a debugger to leverage insights and intuition from the vast real-world knowledge embedded in LLMs. This knowledge enables C!"#DBG to !x classes of issues that depend on logical thinking and domain-speci!c reasoning beyond the ability to write and debug programs. For example, Figure <ref type="figure">2</ref> illustrates the use of C!"#DBG to debug a program by leveraging a knowledge of statistics that cannot be gleaned from the program itself.</p><p>A debugger integrated with C!"#DBG continues to provide its full range of functionality but also lets programmers engage in debugging dialogs where they can ask high-level questions like 'why is x null here?' or 'why isn't this value what I expected?'. The question can be as simple as 'why?' if a program has crashed or failed an assertion. To answer such queries, C!"#DBG orchestrates a conversation with an LLM. A key feature of C!"#DBG is that it grants autonomy to the LLM to "take the wheel" and act as an independent agent <ref type="bibr">[10,</ref><ref type="bibr">42]</ref> while answering the programmer's queries. Speci!cally, the LLM issues "function calls" <ref type="bibr">[33]</ref> to run commands in the underlying debugger to investigate program state, execute code, or obtain source code. The results of those calls are sent back to the LLM to use in constructing its response. After answering a query, control is returned to the programmer, who may then enter additional commands or chat messages.</p><p>Our prototype of C!"#DBG integrates into three widely used debuggers: GDB, LLDB, and Pdb. Our evaluation presents a range of case studies demonstrating that C!"#DBG improves signi!cantly on existing debuggers. On a suite of unpublished Python scripts and Jupyter notebooks written by undergraduate students, one or two queries is su"cient for C!"#DBG to properly diagnose and !x defects 85% of the time, typically at a cost well under $0.20 USD. C!"#DBG is also e#ective at identifying causes and providing !xes for a range of real-world bugs in C/C++ code.</p><p>This paper makes the following contributions:</p><p>&#8226; It introduces C!"#DBG, an AI-powered debugger assistant that enables large language models to "take the wheel" and control the debugger via agentic reasoning. &#8226; It describes the implementation of our C!"#DBG prototype.</p><p>&#8226; It presents an evaluation of C!"#DBG that demonstrates its signi!cant advantages over existing debugger functionality.</p><p>Our evaluation shows that C!"#DBG is broadly applicable to many domains and programming languages, and we expect it to be particularly useful for novice programmers, who often lack the experience to e#ectively use debuggers. C!"#DBG is also useful for experienced programmers, who can augment debugging sessions with C!"#DBG's reasoning capabilities in a conversational and interactive way.</p><p>Source code for bootstrap.py 1 from datascience import * 2 from ds101 import * 3 4 def make_marble_sample(): 5 table =</p><p>Table().read_table( marble-sample.csv ) 6 return table.column( color ) Fig. <ref type="figure">1</ref>. An example program containing several bugs ( &#167;2). It is supposed to create an array of marble colors, compute the proportions of blue marbles in resamples of that array, and assert that their mean is about 0.7, the proportion for the array.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Overview</head><p>This section illustrates C!"#DBG's features and ability to assist in debugging the program in Figure <ref type="figure">1</ref>. That program is a distillation of real errors encountered by students in an introductory data science lab. It creates an array observed_marbles representing the colors of marbles (red or blue) in a sample stored in a !le. It then calls bootstrap_statistic to create same-sized resamples of that array. That function computes a statistic for each resample and returns an array of those statistics. In this case, the statistic is proportion_blue, the proportion of blue marbles. Given a su"ciently large number of trials, the mean of the resamples' statistics should be close to 0.7, the proportion of blue marbles in the original sample <ref type="bibr">[6]</ref>.</p><p>The program fails the assertion in resampled_stats, and Figure <ref type="figure">2</ref> illustrates a debugging session. To try to !gure out what went wrong, the user issues the Pdb command p num_trials to view the value of that variable. Continuing debugging with existing tools would likely involve issuing additional commands, examining data !les, source code, and examining library documentation. With C!"#DBG, the user instead starts a dialog with the debugger, asking why doesn't stats have 5 elements? While constructing the answer (in blue ), the LLM takes the wheel and directly issues debugger commands ( yellow ). These include standard Pdb commands and a C!"#DBGspeci!c info command for accessing the source code and docstrings for any user-written code, as well as the docstrings for library code (which we assume is correct and not the root cause of errors).</p><p>C!"#DBG identi!es and corrects the root cause: proportion_blue incorrectly computes the desired statistic. When C!"#DBG cannot identify the root cause, it suggests further debugging steps and control is returned to the user, who may continue the chat, issue further debugger commands, or both. Figure <ref type="figure">3</ref> illustrates this scenario, where a version of bootstrap.py with the corrected proportion_blue function fails the assertion on line 21.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>$ chatdbg bootstrap.py</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Traceback (most recent call last):</head><p>File !/python3.</p><p>11/site-packages/ipdb/__main__.py!, ... pdb._run(stdlib_pdb._ScriptTarget(mainpyfile)) File !/python3.11/pdb.py!, line 1659, in _run self.run(target.code) File !/python3.11/bdb.py!, line 600, in run exec(cmd, globals, locals) File !&lt;string&gt;!, line 1, in &lt;module&gt; File !./bootstrap.py!, line 19, in &lt;module&gt; stats = resampled_stats(observed_marbles, 5) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^F ile !./bootstrap.py!, line 15, in resampled_stats assert len(stats) == num_trials ^^^^^^^^^^^^^^^^^^^^^^^&#194; ssertionError Uncaught exception. Entering post mortem debugging Running cont or step will restart the program (ChatDBG) p num_trials 5 (ChatDBG) Why doesn't stats have 5 elements? (ChatDBG) p len(stats) 150 (ChatDBG) info bootstrap_statistic bootstrap_statistic(observed_sample, compute_statistic, num_trials) Creates num_trials resamples of the initial sample. Returns an array of the provided statistic for those samples. * observed_sample: the initial sample, as an array. * compute_statistic: a function that takes a sample as an array and returns the statistic for that sample. * num_trials: the number of bootstrap samples to create. You MUST assume that bootstrap_statistic is specified and implemented correctly. (ChatDBG) p proportion_blue &lt;function proportion_blue at 0x1050b00e0&gt; (ChatDBG) info proportion_blue 8 def proportion_blue(sample): 9 return sample From the information we have, the stats variable in resampled_stats is supposed to have 5 elements, but it actually has 150 elements... The Python debugger output shows a reference to a variable or function called proportion_blue, which is used as the compute_statistic parameter for bootstrap_statistic. Upon looking at the proportion_blue function's definition, it is found that the function simply returns the input sample without computing a statistical value. This means that for each of the num_trials (which is 5), bootstrap_statistic is adding the entire sample to the stats array without processing the sample to obtain a single statistic value... Recommendation To fix this issue, the proportion_blue function must be corrected to perform the intended computation of returning the proportion of blue marbles in the sample: def proportion_blue(sample): return np.mean(sample == B ) Fig. 2. A debugging session with C!"#DBG ( &#167;2). For brevity, we elide repetitive or unimportant parts. C!"#DBG catches the assertion failure on line 15 and prompts the user to enter commands (bold and highlighted in gray ). Commands issued by C!"#DBG when it takes the wheel are highlighted in yellow , and its response is highlighted in blue . $ chatdbg bootstrap.py Traceback (most recent call last): File !./bootstrap.py!, line 21, in &lt;module&gt; assert np.isclose(np.mean(stats), 0.7) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^&#194; ssertionError ... (ChatDBG) Why is the mean of stats not close to 0.7? (ChatDBG) p np.mean(stats) 0.7466666666666667 ... (ChatDBG) info resampled_stats ... (ChatDBG) info make_marble_sample .... (ChatDBG) info bootstrap_statistic ... ... Recommendation 1. Check the CSV file marble-sample.csv to verify the proportion of blue to red marbles and ensure it is close to 70% blue marbles. 2. Review the implementation of the proportion_blue function... 3. Verify that the proportion_blue function ... is available in the current scope or imported correctly. (ChatDBG) p proportion_blue(observed_marbles) 0.7 (ChatDBG) That value is correct. ... (ChatDBG) p stats array([0.66666667, 0.8 , 0.63333333, 0.76666667, 0.86666667])</p><p>At this point, the most straightforward explanation for the error is that the number of trials is not enough to accurately reflect the expected mean due to random sampling e!ects. Increasing the number of trials would likely result in an average statistic that is closer to the expected value of 0.7, following the Law of Large Numbers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Recommendation</head><p>Increase the number of bootstrap trials significantly (e.g., from 5 to something like 1000 or more) to reduce the variability of the mean... Fig. <ref type="figure">3</ref>. A debugging session demonstrating how C!"#DBG incorporates real-world knowledge ( &#167;2). A"er correcting proportion_blue in Figure <ref type="figure">1</ref>, the program fails on line 21 because the mean proportion of blue marbles in the resamples is not the expected value. C!"#DBG identifies high variance resulting from the small number of trials as the root cause.</p><p>The user asks why the mean of stats is not close to 0.7, and C!"#DBG's initial response suggests examining whether 0.7 is the appropriate expected value. To verify this, the user then computes the proportion of blue marbles with a debugger command and tells C!"#DBG that 0.7 is indeed the correct value. In its follow-up, C!"#DBG points to the low number of trials (!ve) as the issue. The LLM drew this correct conclusion without seeing any discussion of trial size or variance in any program state, code, or documentation encountered during the chat. A powerful aspect of C!"#DBG is its ability to exploit real-world knowledge in its analyses (here, the fact that bootstrapping depends on large numbers of resamples) without speci!c instruction or user intervention.</p><p>Table <ref type="table">1</ref>. Debugger features and their dates of introduction ( &#167;3). Most key features have been around for decades. By integrating into modern debuggers (GDB, LLDB, and Pdb), C!"#DBG inherits all of their features while significantly extending them with functionality to explain bugs and their root causes, propose fixes, and answer arbitrary natural-language queries over program state. (An asterisk or year in italics means the feature is limited in functionality, performance, or depends on specific hardware support.)</p><p>System and Date S i n g l e S t e p S t a c k N a v i g a t i o n B r e a k p o i n t s ( B P s ) C o n d i t i o n a l B P s S o u r c e L e v e l T r a c e D i s p l a y S t a t e E v a l . C o d e W a t c h p o i n t s E x p l a i n B u g s P r o p o s e F i x e s O p e n Q u e r i e s DDT [19], 1961</p><p>Table <ref type="table">1</ref> presents an overview of previous interactive debuggers, together with their features. The !rst interactive debugger, DDT, introduced breakpoints, single-stepping, and stack navigation in 1961 <ref type="bibr">[19]</ref>. By 1979, the Mesa debugger had most key features of modern debuggers, including source-level debugging, conditional breakpoints, tracing, and the ability to display run-time state and evaluate code <ref type="bibr">[43]</ref>. Arbitrary conditional breakpoints date back at least to 1990 with Dbx <ref type="bibr">[26]</ref>.</p><p>Watchpoints were introduced by 1991 and have been in GDB since version 3.93.</p><p>In other work, Ko and Myers present Whyline, an interactive, trace-based debugger that lets programmers select from a range of queries and identi!es (via static and dynamic analysis) a timeline that answers the query <ref type="bibr">[18]</ref>. Programmers can only select from those queries presented by Whyline as options. In contrast, C!"#DBG permits programmers to pose arbitrary queries that it answers via a dialog with an LLM. Whyline's use of traces gives it the ability to answer questions that might not be straightforward to answer with the current program state but limits its applicability to relatively short-lived executions.</p><p>The goal of program slicing, introduced by Weiser in 1981 <ref type="bibr">[41]</ref>, is to produce a shorter version of a program limited to the source code that could have led to an error. Program slicing has been extensively studied; Weiser's paper has been cited over 5,000 times to date. As Section 4.7 describes, C!"#DBG performs backwards slicing to collect code spread across code cells to facilitate debugging of Jupyter notebooks.</p><p>Fault localization seeks to identify the likely location of a defect's root cause. Several prior studies have investigated the use of machine learning and LLMs for fault localization. Some of the studied techniques apply machine learning to source code features, coverage data, or other static code features to predict faulty lines of code, but they do not utilize dynamic state and run-time information. DeepFL <ref type="bibr">[24]</ref>, Grace <ref type="bibr">[28]</ref> and DeepRL4FL <ref type="bibr">[25]</ref> are examples of such systems. Similarly, LLMAO <ref type="bibr">[44]</ref> employs LLMs to provide suspiciousness scores for each line of code in a given program, but only provides access to the source code. AutoFL <ref type="bibr">[16]</ref> also utilizes an LLM and enables it to statically retrieve source code and coverage information about the program via function calls. However, the system requires a failing test case as input and does not employ run-time state information.</p><p>C!"#DBG improves upon these systems by providing an LLM with access to run-time program state and the ability to take control of the underlying debugger. Both features enhance the LLM's ability to provide more accurate and informative feedback to the user. We also note that other fault localization techniques can be used in tandem with C!"#DBG to improve results, as suggested by Section 5.1's utilization of backwards slicing to identify code relevant to a bug in Python notebooks and Section 5.2's utilization of AddressSanitizer <ref type="bibr">[37]</ref> to provide a better starting point for diagnosing and !xing memory errors in native code.</p><p>Automated program repair is another active area of software engineering research <ref type="bibr">[9]</ref>. Systems for automatic program repair attempt to generate source-level program patches that prevent a program from failing. C!"#DBG performs best-e#ort automated program repair by requesting that the LLM propose code !xes as part of its response, ultimately letting the programmer drive code changes using these suggestions. Previous research has shown that automated program repair hints can provide signi!cant help in the debugging process and suggests that the bene!ts of correct advice outweigh the risk of deceptive ones <ref type="bibr">[8]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Concurrent Work</head><p>Several approaches developed concurrently with C!"#DBG have also integrated LLMs into automatic program repair or fault localization techniques to enhance the debugging process. Robin <ref type="bibr">[2]</ref> is a chat-based debugging assistant designed to help users diagnose errors more quickly. Both Robin and C!"#DBG provide a limited program context to the LLM at the beginning of a conversation. However, Robin has no direct access to any additional context about the program and execution state; the user must manually retrieve and provide these items. Robin's functionality is therefore roughly equivalent to the Enriched Stack con!guration of C!"#DBG, detailed in Section 5.1. As Figure <ref type="figure">6</ref> shows, C!"#DBG achieves a nearly two-fold increase improvement in diagnosing errors versus the Enriched Stack con!guration. C!"#DBG's e#ectiveness generally increases further with targeted questions and follow-up discussions with the user.</p><p>AutoCodeRover <ref type="bibr">[48]</ref> and SWE-agent <ref type="bibr">[45]</ref> are complementary approaches that focus on fault localization and automatic repair, relying exclusively on issue descriptions and source code. C!"#$ DBG additionally leverages run-time information to identify root causes and propose !xes. Section 5 demonstrates the strength of this approach over relying solely on static information. Both Au-toCodeRover and SWE-agent perform an evaluation using SWE-bench <ref type="bibr">[15]</ref>, which was created to evaluate the e"cacy of such static tools; unfortunately, this benchmark suite is not applicable to C!"#DBG due to its extensive usage of run-time information.</p><p>RepairAgent <ref type="bibr">[4]</ref> and AutoSD <ref type="bibr">[17]</ref> are tools that employ LLMs in speci!c work$ows that mimic standard debugging strategies in an attempt to repair pre-identi!ed bugs. While successful in some settings, both tools rely on the user providing a failing test case and the precise location of the bug. By contrast, C!"#DBG does not require this information. C!"#DBG also enables a more $exible work$ow that permits collaboration with the developer in addition to seamless integration into the standard debugging process.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Implementation</head><p>4.1 Using C!"#DBG: Preliminaries C!"#DBG integrates with existing debuggers as either a plug-in or a direct extension. Our primary focus to date has been an extension to Pdb, which supports both non-interactive Python scripts and interactive sessions in IPython or Jupyter notebooks, and a plug-in for LLDB to support C/C++ code. A subset of features has been ported to GDB and WinDBG.</p><p>Con!guration for Python is minimal and limited to the installation of the chatdbg package with the standard package installer, plus one optional shell script command to add it as an extension to IPython. C!"#DBG extends either the standard pdb.Pdb debugger or IPython's implementation of Pdb, depending on how it is run. Con!guration for LLDB and other C/C++ debuggers is similarly straightforward. LLDB can be installed through standard package managers if it is not already present, and the C!"#DBG plug-in is installed via a single shell command. Since C!"#DBG leverages OpenAI's LLMs, the user must also set an environment variable to a valid OpenAI API key within their system's con!guration settings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Debugging a Target Program</head><p>For Python, debugging with C!"#DBG begins by running chatdbg on the target program. No special preparation of the target is needed; Python's managed run time ensures that debugging information and source code is always available. Debugging is supported in IPython interactive sessions or Jupyter notebooks via the standard command-line $ag --pdb or the Jupyter magic command %pdb, respectively. Control drops into the debugger when an exception occurs.</p><p>For C and C++, debugging begins by running lldb on the target program. The target program must be an unstripped executable generated with the -g compiler $ag, which ensures the availability of DWARF debug information that describes the memory layout and maps the program's machine code back to the original source code. That information is essential for the e#ective debugging of unmanaged code.</p><p>C!"#DBG also handles native code generated for other languages but may require additional steps. For example, to debug a Rust target program, the Cargo.toml !le must list C!"#DBG as a dependency and the main function must be annotated with #[chatdbg::main] to ensure that error messages are visible to C!"#DBG through a log !le.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">C!"#DBG Architecture Overview</head><p>C!"#DBG orchestrates communication between the user, the debugger, and the LLM, as shown in the architecture diagram in Figure <ref type="figure">4</ref>. The operations in the command loop pseudocode map naturally onto debugger APIs and onto LLM APIs supporting completion and function calls <ref type="bibr">[33]</ref>. C!"#DBG currently utilizes OpenAI's API <ref type="bibr">[32]</ref> and GPT-4 models. We provide a brief overview of the C!"#DBG architecture and then elaborate on the most salient technical innovations below.</p><p>1 &#8594; C!"#DBG dispatches standard commands, such as p num_trials in Figure <ref type="figure">2</ref>, directly to the underlying debugger (lines 3-7). It also preserves those commands and their output in the history variable for later communication to the LLM. 2</p><p>&#8594; Any other text entered by the user, such as 'why doesn't stats have 5 elements?', is directed to C!"#DBG, which creates a prompt to send to the LLM. If this is the start of a chat, C!"#DBG bundles basic instructions, information from the debugger about the current stack and error, program inputs, history of user commands, and the text together in an initial prompt (lines 9-12). Otherwise, C!"#DBG bundles only the history since the last chat step and text (line 14). The M"%&amp;P'()*# function concatenates the prompt components into a string, respecting any length limits set by the LLM by selectively truncating parts as needed.</p><p>3 &#8594; C!"#DBG then sends the prompt to the LLM and processes the response stream, which includes both 4</p><p>&#8594; requests to run debugger commands (lines 19-22) and 5 &#8594; prose for the user (line 23). In Figure <ref type="figure">2</ref>, C!"#DBG runs four debugging commands, including one to print the length of the stats array, via this mechanism as the LLM constructs its response. C!"#DBG echoes those commands and their outputs to the user. Once the full response has been processed, C!"#DBG returns control to the user. As Section 4.5 discusses, C!"#DBG augments the underlying debuggers</p><p>fr ee -f or m te xt Existing Debugger (Pdb, LLDB,&#8230;) output command free-form text enriched stack, error info prompt response command ChatDBG Agent output st an d ar d co m m an d &#9312; &#9313; &#9314; &#9315; &#9316; LLM language server (native) or other source tools</p><p>1 &#8594; Standard commands are handled by the existing debugger. 2 &#8594; C!"#DBG converts free-form text into a suitable prompt. 3 &#8594; C!"#DBG sends the prompt. 4 &#8594; The LLM takes the wheel and directly issues commands to the underlying debugger. This step may involve consulting other tools, such as a language server for native code. 5 &#8594; The LLM responds to the prompt. 1: history = "" 2: loop 3: line = I+*,#() 4: if I-D&amp;.,//&amp;'C())"+0(line) then 5: output = D(C())"+0(line) 6: P'1+#(output) 7: history = history + (line + "&#8594;" + output) 8: else 9: if not C!"#I+P'(/'&amp;--() then 10: prompt = M"%&amp;P'()*#(I+-#',2#1(+-(), 11: E+'12!&amp;0S#"2%(), I+*,#-(), 12: E''('(), history, line) 13: else 14: prompt = M"%&amp;P'()*#(history, line) 15: S&amp;+0(prompt) 16: history = "" 17: while R&amp;-*(+-&amp;P&amp;+01+/() do 18: match R&amp;2&amp;13&amp;() 19: case D&amp;.,/(cmd) &#8593; 20: output = D(C())"+0(cmd) 21: P'1+# (cmd + "&#8594;" + output) 22: S&amp;+0(output) 23: case M&amp;--"/&amp;(text) &#8593; P'1+#(text) with specialized commands for the LLM to use when taking the wheel. For example, the C!"#DBG variant for native code installs debugger commands that utilize the clangd language server <ref type="bibr">[5,</ref><ref type="bibr">30]</ref> to retrieve source code corresponding to symbol de!nitions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Initial Prompts and Enriched Stack Traces</head><p>In addition to including the user's text, the initial prompt conveys instructions to LLM and the context surrounding the error. We illustrate the components of the prompt in this section, using the initial prompt in Figure <ref type="figure">5</ref> that was generated for the !rst query in Figure <ref type="figure">2</ref> as a running example.</p><p>Instructions. The instructions at the top of the prompt ask the LLM to answer questions about the root cause of the error, to focus on user code, to explain values stored in variables, and to end each response with either a !x or suggestions for further debugging steps. The last item ensures a relatively consistent structure for answers that facilitates reading them and evaluating their quality. Paragraphs 2-4 of the instructions are the take the wheel prompt described in Section 4.5.</p><p>Enriched stack trace. C!"#DBG's success at identifying and !xing errors relies critically on providing the LLM with su"cient details to reveal the cause of the error. A key source of that information is the run-time stack. Debuggers provide a way for the user to view the stack trace but often only show function names, source !le locations, and possibly a couple lines of code for each stack frame. C!"#DBG provides a more detailed enriched stack trace to the LLM. That stack trace includes the types and values of variables for each frame, as well as a larger window of at least 10 lines of code. Enriched stack traces also elide frames corresponding to library code to better focus the LLM on user-written code, which C!"#DBG assumes to be the most likely cause of errors. In Python, C!"#DBG leverages Pdb's internal data structures to build enriched stack traces. When converting values to suitable string representations, C!"#DBG must balance utility with the size of the string produced. For objects, C!"#DBG calls the object's __repr__ method if an Instructions: You are a debugging assistant. You will be given a Python stack trace for an error and answer questions related to the root cause of the error.</p><p>Call the debug function to run Pdb debugger commands on the stopped program. You may call the debug function to run the following commands: bt, up, down, p expression, list. Call debug to print any variable value or expression that you believe may contribute to the error.</p><p>Call the info function to get the documentation and source code for any variable, function, package, class, method reference, !eld reference, or dotted reference visible in the current frame. Examples include: n, e.n where e is an expression, and t.n where t is a type. Unless it is from a common, widely-used library, you MUST call info exactly once on any symbol that is referenced in code leading up to the error.</p><p>Call the provided functions as many times as you would like.</p><p>The root cause of any error is likely due to a problem in the source code from the user. Explain why each variable contributing to the error has been set to the value that it has. Continue with your explanations until you reach the root cause of the error. Your answer may be as long as necessary.</p><p>End your answer with a section titled "Recommendation" that contains one of: -a !x if you have identi!ed the root cause -a numbered list of 1-3 suggestions for how to continue debugging if you have not</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Enriched Stack Trace:</head><p>The program has this stack trace: Error:</p><p>The program encountered the following error:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>AssertionError</head><p>The code assert len(stats) == num_trials is correct and must not be changed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>History:</head><p>This is the history of some pdb commands I ran and the results:</p><p>(ChatDBG) p num_trials 5</p><p>User Text: Why doesn't stats have 5 elements? Fig. <ref type="figure">5</ref>. The initial prompt for the debugging session in Figure <ref type="figure">2</ref> ( &#167;4.4). For brevity, the enriched stack includes only five lines of source in each frame, rather than the default of 10.</p><p>appropriate (non-default) version exists. Otherwise, it iterates over the object's !elds and recursively converts their values to strings. Similarly, C!"#DBG recursively converts the values stored in aggregate structures like lists, arrays, and dictionaries to strings, but limits the number of elements shown to a small, !xed number. The rest of the elements are abbreviated with an ellipsis (...). This recursive conversion of values to strings is limited to a depth of three, at which point any remaining values are again abbreviated with ellipses. This strategy balances the need to provide the LLM with su"cient information to diagnose the error with the need to avoid overwhelming it with too much information. In cases with the elided details are important, the LLM can request them via the take the wheel mechanism. C!"#DBG follows roughly the same approach in LLDB, utilizing the static types embedded in the DWARF debugging information to decode the stack. In addition, any pointers are dereferenced to show the values being referred to as well; null pointers and illegal dereferences are dropped.</p><p>Inputs. The initial prompt also includes the target's command line arguments and standard input when that information is available from the underlying debugger. These are empty and elided in Figure <ref type="figure">5</ref>.</p><p>Error. A description of the error causing execution to stop is extracted from the underlying debugger. When the error is due to an assertion failure, C!"#DBG instructs the LLM to assume that the assertion is valid as written so that it will look beyond the assertion for the real problem.</p><p>History. The initial prompt also includes the history of commands already issued by the user, as well as their outputs. This builds a more complete context surrounding the user's query.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.5">Taking the Wheel</head><p>C!"#DBG supports take the wheel debugging via the function call capabilities in OpenAI's API and most recent models <ref type="bibr">[33]</ref>. This agentic approach <ref type="bibr">[10,</ref><ref type="bibr">42]</ref> lets clients register callback functions with the LLM for obtaining additional information while constructing a response. The LLM calls these functions by sending special messages to the client as part of its response stream. The client receives those messages, computes the requested results, and sends them back to the LLM. The initial prompt describes how to use the available functions.</p><p>For example, C!"#DBG registers a debug(command) function for running a command in the underlying debugger. The LLM calls debug(!p len(stats)!) through this mechanism in the session from Figure <ref type="figure">2</ref>. C!"#DBG then runs Pdb's command processing routine, onecmd(!p len(stats)!), and captures the output to and send back. C!"#DBG similarly uses the SBCommandInterpreter.HandleCommand routine in LLVM. In both cases, the command and output are printed so the user can see these steps.</p><p>The LLM has su"cient background knowledge on debuggers and requires no additional training to navigate up/down the stack, inspect variables and heap data, evaluate expressions, and perform other typical debugger operations.</p><p>Supporting agentic reasoning over run-time program state via function calls is a key technical innovation of C!"#DBG. Without this capability, there would be no e#ective way to provide the LLM with a detailed view of relevant program state. A common alternative technique for handling large amounts of task-speci!c data in LLMs is to employ a retrieval augmented generation (RAG) model <ref type="bibr">[23]</ref>, which collects and stores the data in a vectorized database that is then made available to the model for retrieval. However, that approach seems less useful in this context, as program state information will be distinct for each debugging session and not easily vectorized. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.6">Navigating the Code</head><p>While the LLM can often leverage pre-existing background knowledge of common Python and C/C++ standard libraries, it will likely have limited-to-no knowledge of any user-de!ned code or third-party library functions. Trying to include all possibly-relevant source code in the initial prompt would be infeasible and would prevent C!"#DBG from scaling to larger codebases. Instead, C!"#DBG extends the underlying debuggers with several new commands designed to help the LLM navigate through and understand the target's code. These commands are available to the LLM via function calls and listed in Table <ref type="table">2</ref>. C!"#DBG augments Pdb with the info command, which prints the docstring for any function, class, !eld, method, or package. It additionally prints the source code for any user-de!ned function. The info requests in Figure <ref type="figure">2</ref> demonstrate these two cases for proportion_blue and bootstrap_statistic, respectively. The command is implemented via the standard inspect and pydoc Python libraries.</p><p>The info command is not directly reproducible for unmanaged code in LLVM because there is no comparable existing debugger support for retrieving the source or documentation for a symbol. Instead, C!"#DBG adds two other debugging commands to LLDB. The !rst, code, prints the code surrounding a source location described by a !lename and line number, as in code polymorph.c:118. The second command, definition, prints the location and source code for the de!nition corresponding to the !rst occurrence of a symbol on a given line of code. For example, definition polymorph.c:118 target prints the location and source for the declaration of target corresponding to its use on that line. The definition implementation leverages the clangd language server, which supports source code queries via JSON-RPC and Microsoft's Language Server Protocol <ref type="bibr">[30]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.7">Slices for Interactive Python</head><p>C!"#DBG supports debugging interactive IPython sessions and Jupyter notebooks. Interactive sessions lead to many individual code cells that are each evaluated separately. Cells may be evaluated out-of-order, override de!nitions from earlier cells, and communicate values to other cells through top-level global variables. Others have noted the challenges of reasoning about program behavior in this context <ref type="bibr">[11,</ref><ref type="bibr">38]</ref>. C!"#DBG provides an additional slice debugging command to facilitate that reasoning. The slice command computes the backwards slice for any variable used in the current cell that was de!ned in previously-executed cells. It returns the code for cells in that slice. Suppose the code from bootstrap.py in Figure <ref type="figure">1</ref> were written in four notebook cells as shown below:</p><p>In <ref type="bibr">[2]</ref>: def make_marble_sample(): ... In <ref type="bibr">[3]</ref>: def proportion_blue(sample): ... In <ref type="bibr">[4]</ref>: def resampled_stats(observed_marbles, num_trials): stats = bootstrap_statistic(observed_marbles, proportion_blue, num_trials) assert len(stats) == num_trials return stats</p><p>In <ref type="bibr">[5]</ref>: observed_marbles = make_marble_sample() stats = resampled_stats(observed_marbles, 5)</p><p>After evaluating these cells, slice(observed_samples) returns the source for the cells labeled In <ref type="bibr">[2]</ref> and In <ref type="bibr">[5]</ref>, and slice(stats) returns the source for all four cells. C!"#DBG uses ipyflow to compute slices <ref type="bibr">[14,</ref><ref type="bibr">38]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.8">Security and Risks</head><p>It is possible for the LLM to issue debugging commands containing arbitrary code through the debug function call provided by C!"#DBG. That code could, for example, delete !les or execute other malicious actions on the client. C!"#DBG mitigates this risk by sanitizing LLM-generated debugging commands before running them. For Python, the sanitizer ensures that any functions called in LLM-provided commands belong to a user-con!gurable whitelist. For native code, code provenance is harder to track and languages are more permissive, so the sanitizer rejects any commands calling functions. C!"#DBG supports an --unsafe $ag to disable sanitizing when the client system is running in an isolated environment that obviates the need for such protections.</p><p>It is also possible for the LLM to hallucinate and respond with incorrect or misleading diagnoses and !xes. C!"#DBG mitigates this risk by not directly applying proposed code !xes or suggestions to the target code. Instead, C!"#DBG presents them to the user, who may then vet and judge the quality of the LLM's responses and decide whether or not to follow suggested changes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Evaluation</head><p>We demonstrate C!"#DBG's capacity to identify the root cause of defects and provide !xes in two contexts: bugs in relatively small Python programs written by students and bugs in large C/C++ programs. The former have well-de!ned expected behavior that enables us to thoroughly and systematically assess C!"#DBG. The latter demonstrates its e#ectiveness on unmanaged code when unusual corner cases trigger crashes. Our evaluation addresses the following research questions: RQ1: Is C!"#DBG e#ective at diagnosing and !xing bugs in Python? RQ2: Which components of C!"#DBG contribute to its e#ectiveness? RQ3: Is C!"#DBG e#ective at diagnosing and !xing bugs in unmanaged code (C/C++)?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Python</head><p>We applied C!"#DBG to all of the bugs in a collection of student labs from two introductory computer science courses; see Table <ref type="table">3</ref>. Bugs c1-c8 are in non-interactive scripts from a programming class that perform various !le reading and text processing tasks. Bugs s1-s14 are in Jupyter notebooks <ref type="bibr">[40]</ref> from a data science class that manipulate, visualize, and compute over arrays and tables. Some bugs were apparent to the programs' authors. Others were identi!ed during autograding.</p><p>Table <ref type="table">3</ref>. Python programs exhibiting a variety of common errors ( &#167;5.1). Programs c1-c8 are command line scripts, and programs s1-s14 are Jupyter notebooks, which utilize two non-standard libraries consisting of 3,000 lines of code. Semantic errors reflect failed tests expressed as assertions. Crashes reflect unexpected termination due to any other type of error.  c3 (Crash) Why am I not reading the CSV !le correctly? s11 (Crash) Why am I not able to sample 100 rows? c1 (Semantic) Why am I not getting 3? s1 (Semantic) bill_length_mean_by_species should be a table of the mean bill lengths of each species in our data set. Why isn't it?</p><p>The !nal +Dialog con!guration is the same as +Targeted Question but extends the chat with a second query. All trials use the same follow up: Continue to explain your reasoning and give me a !x to make it work as I describe. Context-speci!c follow-ups work better, but we opted for consistency. C!"#DBG used the gpt-4-1106-preview model for these experiments. Under +Targeted Question, the !rst prompt and response led to, on average, a chat of about 10,000 tokens (7,500 words), a cost of about $0.12 USD under OpenAI's current pricing model <ref type="bibr">[35]</ref>, and a completion time of about 25 seconds. Subsequent steps in extended debugging dialogs incurred comparable costs. Time was highly variable and dominated by the performance of OpenAI's service. These characteristics will be di#erent for other platforms and models and, given current trends, we expect signi!cant reductions in both time and cost as models improve. RQ1: Is C!"#DBG e"ective at diagnosing and !xing bugs in Python?</p><p>Each response was manually examined and deemed a success if it included an accurate explanation of the error and an actionable !x. That !x could be either code or a prose description in which all necessary details were made explicit. To avoid bias in this assessment, explicit criteria for each program was determined prior to examining the responses.</p><p>Figure <ref type="figure">6</ref> shows the success rate under each con!guration. The simplest con!guration, Default Stack, provides functionality roughly equivalent to the user copying and pasting the program stack trace and basic error information into an LLM chat window and requesting a !x. We use this con!guration as a baseline for evaluating the impact of C!"#DBG's more advanced con!gurations. With all features enabled, C!"#DBG was successful at identifying and !xing bugs in well over half of the trials. Any time or energy expended by the user manually debugging those cases would be all but eliminated by using C!"#DBG.</p><p>RQ1 Summary: Even with just the simple question why?, C!"#DBG was successful 57% of the time. With questions specialized to the target's particular error, that number jumps to 67%, and with an additional dialog step C!"#DBG succeeded in identifying and !xing the defect in 85% of the trials.</p><p>RQ2: Which components of C!"#DBG contribute to its e"ectiveness?</p><p>Figure <ref type="figure">7</ref> presents the success rates for each program under each con!guration. The Enriched Stack plots demonstrate that enriched stacks provide some bene!t, particularly for crashes in which the stack contains su"cient information to diagnose the problem, but they alone do not provide much improvement for many semantic errors in which the relevant computation steps complete before failure. However, enriched stacks coupled with letting the LLM take the wheel led to signi!cant improvement in the success rate for both crashing and semantic bugs, as shown in the +Take the Wheel plots.</p><p>Using the +Take the Wheel feature, the LLM issues from 0 to 12 debugging commands per run, most commonly calling the info, slice (for notebooks), and p (print) debugging commands. While all of these commands provide useful information about execution state and code, the slice command was critically important for notebooks. Without it, success rates rarely improved when the LLM took the wheel.</p><p>The +Targeted Question con!guration demonstrates the impact of providing even the most modest details about expected behavior in queries. When the LLM is asked to continue its reasoning in +Dialog, C!"#DBG's success rate improves despite the follow-up prompt providing no feedback on the contents or quality of the !rst response. This phenom indicates that constraints on the underlying LLM's response lengths may prevent it from conducting the amount of reasoning necessary to develop a !x in a single step. The success rates for +Targeted Question and +Dialog demonstrate the importance of continued dialogs and user input. We expect those features to be even more important to C!"#DBG's success when diagnosing bugs in more complex programs.</p><p>The LLM also demonstrated its background knowledge with the responses including, for example, details of Python idioms and libraries, the de!nition of h-index <ref type="bibr">[12]</ref>, and the implementation and limitations of various statistical techniques.</p><p>Failures were generally due to the LLM not always recognizing or discovering key aspects of a program's behavior. We observe that in some cases, enabling additional features in C!"#DBG decreases its success rate. We attribute this result to the fact that longer and more complex prompts can occasionally degrade the e#ectiveness of LLMs <ref type="bibr">[22]</ref>. In general, the further the distance between the root cause of a bug and observable e#ect, the more challenging it was for C!"#DBG (and people <ref type="bibr">[47, p. 243]</ref>) to !nd it. In some cases, it was on the right track but did not converge on an actionable !x. In others, it suggested changes that would introduce other bugs. It also occasionally made mistakes, such as con$ating proportions and percentages or failing to handle unusual corner cases. All of these could be mitigated by feedback from the user in subsequent follow-ups. RQ2 Summary: While all features of C!"#DBG contribute to its success, the technical innovations enabling it to take the wheel are critical. The most sophisticated con!gurations show that user-provided contextual information about behavior and engaging in multi-step dialogs are particularly good ways to improve its e#ectiveness.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">C and C++</head><p>Programs in unmanaged languages such as C and C++ are vulnerable to memory safety errors. These memory errors can also hinder the debugging process: the crash may not occur immediately at the memory violation but instead much later on, and the crash may cause corruption of the stack and/or heap, making it challenging to recover any useful information.</p><p>Table <ref type="table">5</ref> summarizes the programs extracted from the BugBench <ref type="bibr">[29]</ref> and BugsC++ <ref type="bibr">[1]</ref> suites used to evaluate C!"#DBG's e#ectiveness at debugging unmanaged code. Programs used in this evaluation are all real-world applications with concrete known bugs. The four BugBench programs were selected as the only ones we could retrieve, build, and reproduce on our system. The BugsC++ suite does not include the original crash-causing inputs. However, it provides links to the original bug report, CVE identi!er, and/or exploit-!xing patch, from which we manually retrieve crash reproduction information. We randomly selected and reproduced four bugs from the "memory error" category. Some of the programs studied do not crash at run time. We employed AddressSanitizer <ref type="bibr">[37]</ref> to force a crash at the moment a memory violation occurs to trigger those defects. AddressSanitizer is already capable of reporting some information about the crash when it happens. However, this information is often very dense, and typically points at the symptom of the bug, not its root cause. We did not include that information in the initial prompt. RQ3: Is C!"#DBG e"ective at diagnosing and !xing bugs in unmanaged code (C/C++)?</p><p>We ran our C/C++ experiments on an x86 Ubuntu 22.04 server. We used Clang and LLDB 17 to compile and debug, using $ags -g -Og -fno-omit-frame-pointer. C!"#DBG used OpenAI's gpt-4-1106-preview model. Each program was run ten times using queries of the form I am debugging cpp-peglib. Provide the root cause of this crash, for PEG, followed by a We manually examined each response to determine if C!"#DBG successfully provided an actionable code !x for the proximate cause of the crash or for the underlying root cause. We used the criteria outlined in Table <ref type="table">5</ref>. While !xing root causes is the ultimate goal, !xing proximate causes can still be bene!cial as !xing crashes enables further debugging steps.</p><p>Figure <ref type="figure">8</ref> presents C!"#DBG's ability to suggest a !x for either the proximate or root cause of the bug. Generally, C!"#DBG is excellent at diagnosing and explaining the reason for the crash, which in itself may be useful to programmers. For BC, GZIP, NCOM, and POLY, C!"#DBG tends to suggest replacing the strcpy or sprintf call with their respective strncpy and snprintf counterparts to prevent bu#er over$ows. While correct, this change truncates the input silently. Validation or other measures should be added to obtain a robust !x. The root cause in BC is inside code generated from a YACC !le. The clangd language server does not handle this case in a way that would let C!"#DBG answer the LLM's definition requests properly.</p><p>In the case of PEG, C!"#DBG correctly identi!es which pointer is null but typically suggests ignoring it instead of failing immediately. This is similar to YAML2, where C!"#DBG recommends replacing the assertion with a check inside a function rather than recommending that the client check that the function's preconditions are met prior to the call. C!"#DBG has a relatively high root cause !x rate for YAML1 and TIFF. It often correctly suggests !xes to limit recursion depth (YAML1) and to validate input parameters (TIFF). RQ3 Summary: C!"#DBG was successful in virtually all of our trials in diagnosing and explaining the cause of the crash. It was also capable of providing relevant, actionable !xes: 36% of its suggestions addressed the root cause of the bug, while another 55% corrected the proximate cause. C!"#DBG successfully identified and fixed the root cause 36% of the time and the proximate cause an additional 55% of the time.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3">Threats to Validity</head><p>This paper evaluates C!"#DBG on two suites of code. The primary suite is a collection of unpublished student labs that may not be entirely representative of code written by, for example, experienced programmers. The second suite consists of real C/C++ applications and bugs drawn from the BugBench and BugsC++ suites. Unlike the Python suite, the C/C++ source code and the bug !xes for these programs are available on GitHub, which may lead to data leakage a#ecting the C/C++ study if those repositories were part of the training set for the LLMs we used. While the C/C++ suite consists of real-world applications, most of the errors are memory errors. Other types, such as assertion failures, concurrency errors, or other logical errors, may lead to di#erent results.</p><p>C!"#DBG depends on an LLM to analyze and drive exploration of state, and like all systems based on LLMs today, its performance is a#ected by prompt engineering. It is possible that C!"#DBG's prompts are over!t to the speci!c GPT-4 models we employed; this threat is somewhat mitigated by the fact that C!"#DBG was originally developed using a di#erent model (GPT-3.5-turbo). LLMs are also inherently stochastic, and it is possible to obtain unusually good results by chance. To mitigate this threat, our evaluation runs C!"#DBG on each test program at least ten times, which produced stable and repeatable results with only small variation in aggregate.</p><p>Our evaluation depends on a manual evaluation of whether C!"#DBG's explanation of a bug and its proposed !x are satisfactory. We mitigated the risks of subjectivity by using precisely de!ned criteria decided upon in advance. Python !xes were deemed successful if the resulting code met the correctness requirements outlined in the assignment. C/C++ !xes were deemed successful at !xing the proximate or root cause using the criteria in Table <ref type="table">5</ref>. Fixes described in prose were permitted, provided that the details of all necessary changes to the code were made explicit.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Future Work</head><p>We see several promising avenues for future work. Incorporating existing fault localization approaches into C!"#DBG, rather than relying solely on the LLM's ability to explore the program's code and state, could potentially increase its e#ectiveness and e"ciency by allowing the LLM to focus its attention on suspicious !les, functions, or lines of source code. Similarly, incorporating delta debugging <ref type="bibr">[46]</ref> could increase the e#ectiveness of C!"#DBG by limiting the amount of input for an LLM and providing failure-inducing events as guidance. Finally, integrating C!"#DBG with a time-travel debugger <ref type="bibr">[31,</ref><ref type="bibr">36]</ref> would expand its reach to exploring program state over time, letting it answer queries that cannot be answered given only the current program state. One challenge of integrating these more sophisticated techniques will be ensuring that the LLM can e#ectively utilize them, which may necessitate !ne tuning or additional training on their usage.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Conclusion</head><p>This paper presents C!"#DBG, the !rst AI-based debugging assistant. Our evaluation shows that engaging in a debugging dialog with C!"#DBG can signi!cantly assist in identifying root causes of errors and developing correct !xes.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>Proc. ACM Softw. Eng., Vol. 2, No. FSE, Article FSE085. Publication date: July 2025.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Kyla H. Levin, Nicolas van Kempen, Emery D. Berger, and Stephen N. Freund</p></note>
		</body>
		</text>
</TEI>
