NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Leveraging Data Characteristics for Bug Localization in Deep Learning Programs

https://doi.org/10.1145/3708473

Manke, Ruchira; Wardat, Mohammad; Khomh, Foutse; Rajan, Hridesh (July 2025, ACM Transactions on Software Engineering and Methodology)

Deep Learning (DL) is a class of machine learning algorithms that are used in a wide variety of applications. Like any software system, DL programs can have bugs. To support bug localization in DL programs, several tools have been proposed in the past. As most of the bugs that occur due to improper model structure known as structural bugs lead to inadequate performance during training, it is challenging for developers to identify the root cause and address these bugs. To support bug detection and localization in DL programs, in this article, we propose Theia, which detects and localizes structural bugs in DL programs. Unlike the previous works, Theia considers the training dataset characteristics to automatically detect bugs in DL programs developed using two DL libraries,KerasandPyTorch. Since training the DL models is a time-consuming process, Theia detects these bugs at the beginning of the training process and alerts the developer with informative messages containing the bug’s location and actionable fixes which will help them to improve the structure of the model. We evaluated Theia on a benchmark of 40 real-world buggy DL programs obtained fromStack Overflow. Our results show that Theia successfully localizes 57/75 structural bugs in 40 buggy programs, whereas NeuraLint, a state-of-the-art approach capable of localizing structural bugs before training localizes 17/75 bugs.
more » « less
Free, publicly-accessible full text available July 31, 2026
IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models

https://doi.org/10.1145/3715775

Imtiaz, Sayem Mohammad; Singh, Astha; Batole, Fraol; Rajan, Hridesh (June 2025, Proceedings of the ACM on Software Engineering)

Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity, harmful responses, and factual inaccuracies. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, drawing inspiration from fault localization via program slicing, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model’s most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model’s overall versatility by altering a smaller portion of the model. Furthermore, dynamic selection allows for a more nuanced and precise model repair compared to a fixed selection strategy. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair.
more » « less
Free, publicly-accessible full text available June 19, 2026
Mock Deep Testing: Toward Separate Development of Data and Models for Deep Learning

https://doi.org/10.1109/ICSE55347.2025.00220

Manke, Ruchira; Wardat, Mohammad; Khomh, Foutse; Rajan, Hridesh (April 2025, IEEE)

While deep learning (DL) has permeated, and become an integral component of many critical software systems, today software engineering research hasn't explored how to separately test data and models that are integral for DL approaches to work effectively. The main challenge in independently testing these components arises from the tight dependency between data and models. This research explores this gap, introducing our methodology of mock deep testing for unit testing of DL applications. To enable unit testing, we introduce a design paradigm that decomposes the workflow into distinct, manageable components, minimizes sequential dependencies, and modularizes key stages of the DL, including data preparation and model design. For unit testing these components, we propose modeling their dependencies using mocks. In the context of DL, mocks refer to mock data and mock model that mimic the behavior of the original data and model, respectively. This modular approach facilitates independent development and testing of the components, ensuring comprehensive quality assurance throughout the development process. We have developed KUnit, a framework for enabling mock deep testing for the Keras library, a popular library for developing DL applications. We empirically evaluated KUnit to determine the effectiveness of mocks in independently testing data and models. Our assessment of 50 DL programs obtained from Stack Overflow and GitHub shows that mocks effectively identified 10 issues in the data preparation stage and 53 issues in the model design stage. We also conducted a user study with 36 participants using KUnit to perceive the effectiveness of our approach. Participants using KUnit successfully resolved 25 issues in the data preparation stage and 38 issues in the model design stage. Our findings highlight that mock objects provide a lightweight emulation of the dependencies for unit testing, facilitating early bug detection. Lastly, to evaluate the usability of KUnit, we conducted a post-study survey. The results reveal that KUnit is helpful to DL application developers, enabling them to independently test each component (data and model) and resolve issues effectively in different stages.
more » « less
Free, publicly-accessible full text available April 26, 2026
An LLM-Based Agent-Oriented Approach for Automated Code Design Issue Localization

https://doi.org/10.1109/ICSE55347.2025.00100

Batole, Fraol; OBrien, David; Nguyen, Tien N; Dyer, Robert; Rajan, Hridesh (April 2025, IEEE)

Free, publicly-accessible full text available April 26, 2026
$$\mu \text{PRL}$$: A Mutation Testing Pipeline for Deep Reinforcement Learning Based on Real Faults

https://doi.org/10.1109/ICSE55347.2025.00036

Thomas, Deepak-George; Biagiola, Matteo; Humbatova, Nargiz; Wardat, Mohammad; Jahangirova, Gunel; Rajan, Hridesh; Tonella, Paolo (April 2025, IEEE)

Free, publicly-accessible full text available April 26, 2026
Inferring Data Preconditions from Deep Learning Models for Trustworthy Prediction in Deployment

Ahmed, Shibbir; Gao, Hongyang; Rajan, Hridesh (April 2024, Association for Computing Machinery)

Deep learning models are trained with certain assumptions about the data during the development stage and then used for prediction in the deployment stage. It is important to reason about the trustworthiness of the model's predictions with unseen data during deployment. Existing methods for specifying and verifying traditional software are insufficient for this task, as they cannot handle the complexity of DNN model architecture and expected outcomes. In this work, we propose a novel technique that uses rules derived from neural network computations to infer data preconditions for a DNN model to determine the trustworthiness of its predictions. Our approach, DeepInfer involves introducing a novel abstraction for a trained DNN model that enables weakest precondition reasoning using Dijkstra's Predicate Transformer Semantics. By deriving rules over the inductive type of neural network abstract representation, we can overcome the matrix dimensionality issues that arise from the backward non-linear computation from the output layer to the input layer. We utilize the weakest precondition computation using rules of each kind of activation function to compute layer-wise precondition from the given postcondition on the final output of a deep neural network. We extensively evaluated DeepInfer on 29 real-world DNN models using four different datasets collected from five different sources and demonstrated the utility, effectiveness, and performance improvement over closely related work. DeepInfer efficiently detects correct and incorrect predictions of high-accuracy models with high recall (0.98) and high F-1 score (0.84) and has significantly improved over the prior technique, SelfChecker. The average runtime overhead of DeepInfer is low, 0.22 sec for all the unseen datasets. We also compared runtime overhead using the same hardware settings and found that DeepInfer is 3.27 times faster than SelfChecker, the state-of- the-art in this area.
more » « less
Full Text Available
Data-Driven Evidence-Based Syntactic Sugar Design

OBrien, David; Dyer, Robert; Nguyen, Tien; Rajan, Hridesh (April 2024, Association for Computing Machinery)

Programming languages are essential tools for developers, and their evolution plays a crucial role in supporting the activities of developers. One instance of programming language evolution is the introduction of syntactic sugars, which are additional syntax elements that provide alternative, more readable code constructs. However, the process of designing and evolving a programming language has traditionally been guided by anecdotal experiences and intuition. Recent advances in tools and methodologies for mining open-source repositories have enabled developers to make data-driven software engineering decisions. In light of this, this paper proposes an approach for motivating data-driven programming evolution by applying frequent subgraph mining techniques to a large dataset of 166,827,154 open-source Java methods. The dataset is mined by generalizing Java control-flow graphs to capture broad programming language usages and instances of duplication. Frequent subgraphs are then extracted to identify potentially impactful opportunities for new syntactic sugars. Our diverse results demonstrate the benefits of the proposed technique by identifying new syntactic sugars involving a variety of programming constructs that could be implemented in Java, thus simplifying frequent code idioms. This approach can potentially provide valuable insights for Java language designers, and serve as a proof-of-concept for data-driven programming language design and evolution.
more » « less
Full Text Available
Are Prompt Engineering and TODO Comments Friends or Foes? An Evaluation on GitHub Copilot

OBrien, David; Biswas, Sumon; Imtiaz, Sayem; Abdalkareem, Rabe; Shihab, Emad; Rajan, Hridesh (April 2024, Association for Computing Machinery)

Code intelligence tools such as GitHub Copilot have begun to bridge the gap between natural language and programming language. A frequent software development task is the management of technical debts, which are suboptimal solutions or unaddressed issues which hinder future software development. Developers have been found to ``self-admit'' technical debts (SATD) in software artifacts such as source code comments. Thus, is it possible that the information present in these comments can enhance code generative prompts to repay the described SATD? Or, does the inclusion of such comments instead cause code generative tools to reproduce the harmful symptoms of described technical debt? Does the modification of SATD impact this reaction? Despite the heavy maintenance costs caused by technical debt and the recent improvements of code intelligence tools, no prior works have sought to incorporate SATD towards prompt engineering. Inspired by this, this paper contributes and analyzes a dataset consisting of 36,381 TODO comments in the latest available revisions of their respective 102,424 repositories, from which we sample and manually generate 1,140 code bodies using GitHub Copilot. Our experiments show that GitHub Copilot can generate code with the symptoms of SATD, both prompted and unprompted. Moreover, we demonstrate the tool's ability to automatically repay SATD under different circumstances and qualitatively investigate the characteristics of successful and unsuccessful comments. Finally, we discuss gaps in which GitHub Copilot's successors and future researchers can improve upon code intelligence tasks to facilitate AI-assisted software maintenance.
more » « less
Full Text Available
Are Prompt Engineering and TODO Comments Friends or Foes? An Evaluation on GitHub Copilot

https://doi.org/10.1145/3597503.3639176

OBrien, David; Biswas, Sumon; Imtiaz, Sayem Mohammad; Abdalkareem, Rabe; Shihab, Emad; Rajan, Hridesh (April 2024, ACM)
Fix Fairness, Don’t Ruin Accuracy: Performance Aware Fairness Repair using AutoML

Nguyen, Giang; Biswas, Sumon; Rajan, Hridesh (December 2023, ESEC/FSE'2023: The 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering)

Full Text Available

« Prev Next »

Search for: All records