skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on July 8, 2026

Title: Automated Duplicate Bug Report Detection in Large Open Bug Repositories
Many users and contributors of large open-source projects report software defects or enhancement requests (known as bug reports) to the issue-tracking systems. However, they sometimes report issues that have already been reported. First, they may not have time to do sufficient research on existing bug reports. Second, they may not possess the right expertise in that specific area to realize that an existing bug report is essentially elaborating on the same matter, perhaps with a different wording. In this paper, we propose a novel approach based on machine learning methods that can automatically detect duplicate bug reports in an open bug repository based on the textual data in the reports. We present six alternative methods: Topic modeling, Gaussian Na¨ıve Bayes, deep learning, time-based organization, clustering, and summarization using a generative pre-trained transformer large language model. Additionally, we introduce a novel threshold-based approach for duplicate identification, in contrast to the conventional top-k selection method that has been widely used in the literature. Our approach demonstrates promising results across all the proposed methods, achieving accuracy rates ranging from the high 70%’s to the low 90%’s. We evaluated our methods on a public dataset of issues belonging to an Eclipse open-source project.  more » « less
Award ID(s):
2349452
PAR ID:
10657369
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
IEEE
Date Published:
Page Range / eLocation ID:
450 to 458
Subject(s) / Keyword(s):
duplicate bug report detection, bug triage, mining software repositories, natural language processing, machine learning, large language models
Format(s):
Medium: X
Location:
Toronto, Canada
Sponsoring Org:
National Science Foundation
More Like this
  1. Bug tracking systems, which help to track the reported software bugs, have been widely used in software development and maintenance. In these systems, recognizing relevant source files among a large number of source files for a given bug report is a time-consuming and labor-intensive task for software developers. To tackle this problem, information retrieval methods have been widely used to capture either the textual similarities or the semantic similarities between bug reports and source files. However, these two types of similarities are usually considered separately and the historical bug fixings are largely ignored by the existing methods. In this paper, we propose a supervised topic modeling method (STMLOCATOR) for automatically locating the relevant source files for a given bug report. In particular, the proposed model is built upon three key observations. First, supervised modeling can effectively make use of the existing fixing histories. Second, certain words in bug reports tend to appear multiple times in their relevant source files. Third, longer source files tend to have more bugs. By integrating the above three observations, the proposed STMLOCATOR utilizes historical fixings in a supervised way and learns both the textual similarities and semantic similarities between bug reports and source files. We further consider a special type of bug reports with stack-traces in bug reports, and propose a variant of STMLOCATOR to tailor for such bug reports. Experimental evaluations on three real data sets demonstrate that the proposed STMLOCATOR can achieve up to 23.6% improvement in terms of prediction accuracy over its best competitors, and scales linearly with the size of the data. Moreover, the proposed variant further improves STMLOCATOR by up to 76.2% on those bug reports with stack-traces. 
    more » « less
  2. null (Ed.)
    Bug localization plays an important role in software quality control. Many supervised machine learning models have been developed based on historical bug-fix information. Despite being successful, these methods often require sufficient historical data (i.e., labels), which is not always available especially for newly developed software projects. In response, cross-project bug localization techniques have recently emerged whose key idea is to transferring knowledge from label-rich source project to locate bugs in the target project. However, a major limitation of these existing techniques lies in that they fail to capture the specificity of each individual project, and are thus prone to negative transfer.To address this issue, we propose an adversarial transfer learning bug localization approach, focusing on only transferring the common characteristics (i.e., public information) across projects. Specifically, our approach (CooBa) learns the indicative public information from cross-project bug reports through a shared encoder, and extracts the private information from code files by an individual feature extractor for each project. CooBa further incorporates adversarial learning mechanism to ensure that public information shared between multiple projects could be effectively extracted. Extensive experiments on four large-scale real-world data sets demonstrate that the proposed CooBa significantly outperforms the state of the art techniques. 
    more » « less
  3. null (Ed.)
    The availability of quality information in bug reports that are created daily by software users is key to rapidly fixing software faults. Improving incomplete or deficient bug reports, which are numerous in many popular and actively developed open source software projects, can make software maintenance more effective and improve software quality. In this paper, we propose a system that addresses the problem of bug report incompleteness by automatically posing follow-up questions, intended to elicit answers that add value and provide missing information to a bug report. Our system is based on selecting follow-up questions from a large corpus of already posted follow-up questions on GitHub. To estimate the best follow-up question for a specific deficient bug report we combine two metrics based on: 1) the compatibility of a follow-up question to a specific bug report; and 2) the utility the expected answer to the follow-up question would provide to the deficient bug report. Evaluation of our system, based on a manually annotated held-out data set, indicates improved performance over a set of simple and ablation baselines. A survey of software developers confirms the held-out set evaluation result that about half of the selected follow-up questions are considered valid. The survey also indicates that the valid follow-up questions are useful and can provide new information to a bug report most of the time, and are specific to a bug report some of the time. 
    more » « less
  4. Bug report reproduction is a crucial but time-consuming task to be carried out during mobile app maintenance. To accelerate this process, researchers have developed automated techniques for reproducing mobile app bug reports. However, due to the lack of an effective mechanism to recognize different buggy behaviors described in the report, existing work is limited to reproducing crash bug reports, or requires developers to manually analyze execution traces to determine if a bug was successfully reproduced. To address this limitation, we introduce a novel technique to automatically identify and extract the buggy behavior from the bug report and detect it during the automated reproduction process. To accommodate various buggy behaviors of mobile app bugs, we conducted an empirical study and created a standardized representation for expressing the bug behavior identified from our study. Given a report, our approach first transforms the documented buggy behavior into this standardized representation, then matches it against real-time device and UI information during the reproduction to recognize the bug. Our empirical evaluation demonstrated that our approach achieved over 90% precision and recall in generating the standardized representation of buggy behaviors. It correctly identified bugs in 83% of the bug reports and enhanced existing reproduction techniques, allowing them to reproduce four times more bug reports. 
    more » « less
  5. One of the most important tasks related to managing bug reports is localizing the fault so that a fix can be applied. As such, prior work has aimed to automate this task of bug localization by formulating it as an information retrieval problem, where potentially buggy files are retrieved and ranked according to their textual similarity with a given bug report. However, there is often a notable semantic gap between the information contained in bug reports and identifiers or natural language contained within source code files. For user-facing software, there is currently a key source of information that could aid in bug localization, but has not been thoroughly investigated - information from the GUI. We investigate the hypothesis that, for end user-facing applications, connecting information in a bug report with information from the GUI, and using this to aid in retrieving potentially buggy files, can improve upon existing techniques for bug localization. To examine this phenomenon, we conduct a comprehensive empirical study that augments four baseline techniques for bug localization with GUI interaction information from a reproduction scenario to (i) filter out potentially irrelevant files, (ii) boost potentially relevant files, and (iii) reformulate text-retrieval queries. To carry out our study, we source the current largest dataset of fully-localized and reproducible real bugs for Android apps, with corresponding bug reports, consisting of 80 bug reports from 39 popular open-source apps. Our results illustrate that augmenting traditional techniques with GUI information leads to a marked increase in effectiveness across multiple metrics, including a relative increase in Hits@10 of 13-18%. Additionally, through further analysis, we find that our studied augmentations largely complement existing techniques. 
    more » « less