skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM to 12:00 PM ET on Tuesday, March 25 due to maintenance. We apologize for the inconvenience.


This content will become publicly available on April 12, 2025

Title: Automated Program Repair, What Is It Good For? Not Absolutely Nothing!
Industrial deployments of automated program repair (APR), e.g., at Facebook and Bloomberg, signal a new milestone for this exciting and potentially impactful technology. In these deployments, developers use APR-generated patch suggestions as part of a human-driven debugging process. Unfortunately, little is known about how using patch suggestions affects developers during debugging. This paper conducts a controlled user study with 40 developers with a median of 6 years of experience. The developers engage in debugging tasks on nine naturally-occurring defects in real-world, open-source, Java projects, using Recoder, SimFix, and TBar, three state-of-the-art APR tools. For each debugging task, the developers either have access to the project's tests, or, also, to code suggestions that make all the tests pass. These suggestions are either developer-written or APR-generated, which can be correct or deceptive. Deceptive suggestions, which are a common APR occurrence, make all the available tests pass but fail to generalize to the intended specification. Through a total of 160 debugging sessions, we find that access to a code suggestion significantly increases the odds of submitting a patch. Correct APR suggestions increase the odds of debugging success by 14,000%, but deceptive suggestions decrease the odds of success by 65%. Correct suggestions also speed up debugging. Surprisingly, we observe no significant difference in how novice and experienced developers are affected by APR, suggesting that APR may find uses across the experience spectrum. Overall, developers come away with a strong positive impression of APR, suggesting promise for APR-mediated, human-driven debugging, despite existing challenges in APR-generated repair quality.  more » « less
Award ID(s):
2210243 1750116
PAR ID:
10542914
Author(s) / Creator(s):
; ;
Publisher / Repository:
ACM
Date Published:
ISBN:
9798400702174
Page Range / eLocation ID:
1 to 13
Format(s):
Medium: X
Location:
Lisbon, Portugal
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Automated debugging techniques, including fault localization and program repair, have been studied for over a decade. However, the only existing connection between fault localization and program repair is that fault localization computes the potential buggy elements for program repair to patch. Recently, a pioneering work, ProFL, explored the idea of unified debugging to unify fault localization and program repair in the other direction for the first time to boost both areas. More specifically, ProFL utilizes the patch execution results from one state-of-the-art repair system, PraPR, to help improve state-of-the-art fault localization. In this way, ProFL not only improves fault localization for manual repair, but also extends the application scope of automated repair to all possible bugs (not only the small ratio of bugs that can be automatically fixed). However, ProFL only considers one APR system (i.e., PraPR), and it is not clear how other existing APR systems based on different designs contribute to unified debugging. In this work, we perform an extensive study of the unified-debugging approach on 16 state-of-the-art program repair systems for the first time. Our experimental results on the widely studied Defects4J benchmark suite reveal various practical guidelines for unified debugging, such as (1) nearly all the studied 16 repair systems can positively contribute to unified debugging despite their varying repairing capabilities, (2) repair systems targeting multi-edit patches can bring extraneous noise into unified debugging, (3) repair systems with more executed/plausible patches tend to perform better for unified debugging, and (4) unified debugging effectiveness does not rely on the availability of correct patches in automated repair. Based on our results, we further propose an advanced unified debugging technique, UniDebug++, which can localize over 20% more bugs within Top-1 positions than state-of-the-art unified debugging technique, ProFL. 
    more » « less
  2. Automated program repair holds the potential to significantly reduce software maintenance effort and cost. However, recent studies have shown that it often produces low-quality patches that repair some but break other functionality. We hypothesize that producing patches by replacing likely faulty regions of code with semantically-similar code fragments, and doing so at a higher level of granularity than prior approaches can better capture abstraction and the intended specification, and can improve repair quality. We create SOSRepair, an automated program repair technique that uses semantic code search to replace candidate buggy code regions with behaviorally-similar (but not identical) code written by humans. SOSRepair is the first such technique to scale to real-world defects in real-world systems. On a subset of the ManyBugs benchmark of such defects, SOSRepair produces patches for 23 (35%) of the 65 defects, including 3, 5, and 8 defects for which previous state-of-the-art techniques Angelix, Prophet, and GenProg do not, respectively. On these 23 defects, SOSRepair produces more patches (8, 35%) that pass all independent tests than the prior techniques. We demonstrate a relationship between patch granularity and the ability to produce patches that pass all independent tests. We then show that fault localization precision is a key factor in SOSRepair's success. Manually improving fault localization allows SOSRepair to patch 24 (37%) defects, of which 16 (67%) pass all independent tests. We conclude that (1) higher-granularity, semantic-based patches can improve patch quality, (2) semantic search is promising for producing high-quality real-world defect repairs, (3) research in fault localization can significantly improve the quality of program repair techniques, and (4) semi-automated approaches in which developers suggest fix locations may produce high-quality patches. 
    more » « less
  3. null (Ed.)
    Generate-and-validate (G&V) automated program repair (APR) techniques have been extensively studied during the past decade. Meanwhile, such techniques can be extremely time-consuming due to the manipulation of program code to fabricate a large number of patches and also the repeated test executions on patches to identify potential fixes. PraPR, a recent G&V APR technique, reduces such costs by modifying program code directly at the level of compiled JVM bytecode with on-the-fly patch validation, which directly allows multiple bytecode patches to be tested within the same JVM process. However, PraPR is limited due to its unique bytecode-repair design, and is basically unsound/imprecise as it assumes that patch executions do not change global JVM state and affect later patch executions on the same JVM process. In this paper, we propose a unified patch validation framework, named UniAPR, to perform the first empirical study of on-the-fly patch validation for state-of-the-art source-code-level APR techniques widely studied in the literature; furthermore, UniAPR addresses the imprecise patch validation issue by resetting the JVM global state via runtime bytecode transformation. We have implemented UniAPR as a publicly available fully automated Maven Plugin. Our study demonstrates for the first time that on-the-fly patch validation can often speed up state-of-the-art source-code-level APR by over an order of magnitude, enabling all existing APR techniques to explore a larger search space to fix more bugs in the near future. Furthermore, our study shows the first empirical evidence that vanilla on-the-fly patch validation can be imprecise/unsound, while UniAPR with JVM reset is able to mitigate such issues with negligible overhead. 
    more » « less
  4. Automated Program Repair (APR) is one of the most recent advances in automated debugging, and can directly fix buggy programs with minimal human intervention. Although various advanced APR techniques (including search-based or semantic-based ones) have been proposed, they mainly work at the source-code level and it is not clear how bytecode-level APR performs in practice. Also, empirical studies of the existing techniques on bugs beyond what has been reported in the original papers are rather limited. In this paper, we implement the first practical bytecode-level APR technique, PraPR, and present the first extensive study on fixing real-world bugs (e.g., Defects4J bugs) using JVM bytecode mutation. The experimental results show that surprisingly even PraPR with only the basic traditional mutators can produce genuine fixes for 17 bugs; with simple additional commonly used APR mutators, PraPR is able to produce genuine fixes for 43 bugs, significantly outperforming state-of-the-art APR, while being over 10X faster. Furthermore, we performed an extensive study of PraPR and other recent APR tools on a large number of additional real-world bugs, and demonstrated the overfitting problem of recent advanced APR tools for the first time. Lastly, PraPR has also successfully fixed bugs for other JVM languages (e.g., for the popular Kotlin language), indicating PraPR can greatly complement existing source-code-level APR. 
    more » « less
  5. Training deep neural networks can generate non-descriptive error messages or produce unusual output without any explicit errors at all. While experts rely on tacit knowledge to apply debugging strategies, non-experts lack the experience required to interpret model output and correct Deep Learning (DL) programs. In this work, we identify DL debugging heuristics and strategies used by experts, andIn this work, we categorize the types of errors novices run into when writing ML code, and map them onto opportunities where tools could help. We use them to guide the design of Umlaut. Umlaut checks DL program structure and model behavior against these heuristics; provides human-readable error messages to users; and annotates erroneous model output to facilitate error correction. Umlaut links code, model output, and tutorial-driven error messages in a single interface. We evaluated Umlaut in a study with 15 participants to determine its effectiveness in helping developers find and fix errors in their DL programs. Participants using Umlaut found and fixed significantly more bugs and were able to implement fixes for more bugs compared to a baseline condition. 
    more » « less