NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

IRepair: An Intent-Aware Approach to Repair Data-Driven Errors in Large Language Models

https://doi.org/10.1145/3715775

Imtiaz, Sayem Mohammad; Singh, Astha; Batole, Fraol; Rajan, Hridesh (June 2025, Proceedings of the ACM on Software Engineering)

Not a day goes by without hearing about the impressive feats of large language models (LLMs), and equally, not a day passes without hearing about their challenges. LLMs are notoriously vulnerable to biases in their dataset, leading to issues such as toxicity, harmful responses, and factual inaccuracies. While domain-adaptive training has been employed to mitigate these issues, these techniques often address all model parameters indiscriminately during the repair process, resulting in poor repair quality and reduced model versatility. In this paper, drawing inspiration from fault localization via program slicing, we introduce a novel dynamic slicing-based intent-aware LLM repair strategy, IRepair. This approach selectively targets the most error-prone sections of the model for repair. Specifically, we propose dynamically slicing the model’s most sensitive layers that require immediate attention, concentrating repair efforts on those areas. This method enables more effective repairs with potentially less impact on the model’s overall versatility by altering a smaller portion of the model. Furthermore, dynamic selection allows for a more nuanced and precise model repair compared to a fixed selection strategy. We evaluated our technique on three models from the GPT2 and GPT-Neo families, with parameters ranging from 800M to 1.6B, in a toxicity mitigation setup. Our results show that IRepair repairs errors 43.6% more effectively while causing 46% less disruption to general performance compared to the closest baseline, direct preference optimization. Our empirical analysis also reveals that errors are more concentrated in a smaller section of the model, with the top 20% of layers exhibiting 773% more error density than the remaining 80%. This highlights the need for selective repair. Additionally, we demonstrate that a dynamic selection approach is essential for addressing errors dispersed throughout the model, ensuring a robust and efficient repair.
more » « less
Free, publicly-accessible full text available June 19, 2026
Are Prompt Engineering and TODO Comments Friends or Foes? An Evaluation on GitHub Copilot

OBrien, David; Biswas, Sumon; Imtiaz, Sayem; Abdalkareem, Rabe; Shihab, Emad; Rajan, Hridesh (April 2024, Association for Computing Machinery)

Code intelligence tools such as GitHub Copilot have begun to bridge the gap between natural language and programming language. A frequent software development task is the management of technical debts, which are suboptimal solutions or unaddressed issues which hinder future software development. Developers have been found to ``self-admit'' technical debts (SATD) in software artifacts such as source code comments. Thus, is it possible that the information present in these comments can enhance code generative prompts to repay the described SATD? Or, does the inclusion of such comments instead cause code generative tools to reproduce the harmful symptoms of described technical debt? Does the modification of SATD impact this reaction? Despite the heavy maintenance costs caused by technical debt and the recent improvements of code intelligence tools, no prior works have sought to incorporate SATD towards prompt engineering. Inspired by this, this paper contributes and analyzes a dataset consisting of 36,381 TODO comments in the latest available revisions of their respective 102,424 repositories, from which we sample and manually generate 1,140 code bodies using GitHub Copilot. Our experiments show that GitHub Copilot can generate code with the symptoms of SATD, both prompted and unprompted. Moreover, we demonstrate the tool's ability to automatically repay SATD under different circumstances and qualitatively investigate the characteristics of successful and unsuccessful comments. Finally, we discuss gaps in which GitHub Copilot's successors and future researchers can improve upon code intelligence tasks to facilitate AI-assisted software maintenance.
more » « less
Full Text Available
Are Prompt Engineering and TODO Comments Friends or Foes? An Evaluation on GitHub Copilot

https://doi.org/10.1145/3597503.3639176

OBrien, David; Biswas, Sumon; Imtiaz, Sayem Mohammad; Abdalkareem, Rabe; Shihab, Emad; Rajan, Hridesh (April 2024, ACM)
Design by Contract for Deep Learning APIs

Ahmed, Shibbir; Imtiaz, Sayem Mohammad; Khairunnesa, Samantha Syeda; Cruz, Breno Dantas; Rajan, Hridesh (November 2023, Association for Computing Machinery)

Deep Learning (DL) techniques are increasingly being incorporated in critical software systems today. DL software is buggy too. Recent work in SE has characterized these bugs, studied fix patterns, and proposed detection and localization strategies. In this work, we introduce a preventative measure. We propose design by contract for DL libraries, DL Contract for short, to document the properties of DL libraries and provide developers with a mechanism to identify bugs during development. While DL Contract builds on the traditional design by contract techniques, we need to address unique challenges. In particular, we need to document properties of the training process that are not visible at the functional interface of the DL libraries. To solve these problems, we have introduced mechanisms that allow developers to specify properties of the model architecture, data, and training process. We have designed and implemented DL Contract for Python-based DL libraries and used it to document the properties of Keras, a well-known DL library. We evaluate DL Contract in terms of effectiveness, runtime overhead, and usability. To evaluate the utility of DL Contract, we have developed 15 sample contracts specifically for training problems and structural bugs. We have adopted four well-vetted benchmarks from prior works on DL bug detection and repair. For the effectiveness, DL Contract correctly detects 259 bugs in 272 real-world buggy programs, from well-vetted benchmarks provided in prior work on DL bug detection and repair. We found that the DL Contract overhead is fairly minimal for the used benchmarks. Lastly, to evaluate the usability, we conducted a survey of twenty participants who have used DL Contract to find and fix bugs. The results reveal that DL Contract can be very helpful to DL application developers when debugging their code.
more » « less
Full Text Available
Design by Contract for Deep Learning APIs

Ahmed, Shibbir Ahmed; Imtiaz, Sayem Mohammad; Khairunnesa, Samantha Syeda; Cruz, Breno Dantas; Rajan, Hridesh (December 2023, ESEC/FSE'2023: The 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering)

Full Text Available
What kinds of contracts do ML APIs need?

Khairunnesa, Samantha Syeda; Ahmed, Shibbir; Imtiaz, Sayem Mohammad; Rajan, Hridesh; Leavens, Gary T (October 2023, Empirical software engineering)
Feldt, Robert; Zimmermann, Thomas; Basili, Victor R; Briand, Lionel C (Ed.)
Recent work has shown that Machine Learning (ML) programs are error-prone and called for contracts for ML code. Contracts, as in the design by contract methodology, help document APIs and aid API users in writing correct code. The question is: what kinds of contracts would provide the most help to API users? We are especially interested in what kinds of contracts help API users catch errors at earlier stages in the ML pipeline. We describe an empirical study of posts on Stack Overflow of the four most often-discussed ML libraries: TensorFlow, Scikit-learn, Keras, and PyTorch. For these libraries, our study extracted 413 informal (English) API specifications. We used these specifications to understand the following questions. What are the root causes and effects behind ML contract violations? Are there common patterns of ML contract violations? When does understanding ML contracts require an advanced level of ML software expertise? Could checking contracts at the API level help detect the violations in early ML pipeline stages? Our key findings are that the most commonly needed contracts for ML APIs are either checking constraints on single arguments of an API or on the order of API calls. The software engineering community could employ existing contract mining approaches to mine these contracts to promote an increased understanding of ML APIs. We also noted a need to combine behavioral and temporal contract mining approaches. We report on categories of required ML contracts, which may help designers of contract languages.
more » « less
Full Text Available
Decomposing a Recurrent Neural Network into Modules for Enabling Reusability and Replacement

https://doi.org/10.1109/ICSE48619.2023.00093

Imtiaz, Sayem Mohammad; Batole, Fraol; Singh, Astha; Pan, Rangeet; Cruz, Breno Dantas; Rajan, Hridesh (May 2023, ICSE'23: The 45th International Conference on Software Engineering)

Full Text Available
What Kinds of Contracts Do ML APIs Need?

Khairunnesa, Samantha Syeda; Ahmed, Shibbir; Imtiaz, Sayem Mohammad; Rajan, Hridesh; Leavens, Gary T. (March 2023, Empirical software engineering)

Full Text Available

Search for: All records