NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Automatic Semantic Augmentation of Language Model Prompts (for Code Summarization)

https://doi.org/10.1145/3597503.3639183

Ahmed, Toufique; Pai, Kunal Suresh; Devanbu, Premkumar; Barr, Earl (April 2024, ACM)

Full Text Available
Better Patching Using LLM Prompting, via Self-Consistency

https://doi.org/10.1109/ASE56229.2023.00065

Ahmed, Toufique; Devanbu, Premkumar (September 2023, IEEE)

Full Text Available
SynShine: Improved Fixing of Syntax Errors

https://doi.org/10.1109/TSE.2022.3212635

Ahmed, Toufique; Ledesma, Noah Rose; Devanbu, Premkumar (April 2023, IEEE Transactions on Software Engineering)

Full Text Available
Large Language Models and Simple, Stupid Bugs

https://doi.org/10.1109/MSR59073.2023.00082

Jesse, Kevin; Ahmed, Toufique; Devanbu, Premkumar T.; Morgan, Emily (May 2023, Int'l Conference on Mining Software Repositories)

Full Text Available
Few-shot training LLMs for project-specific code-summarization

https://doi.org/10.1145/3551349.3559555

Ahmed, Toufique; Devanbu, Premkumar (October 2022, ACM)

Full Text Available
Extending Source Code Pre-Trained Language Models to Summarise Decompiled Binaries

https://doi.org/10.1109/SANER56733.2023.00033

Al-Kaswan, Ali; Ahmed, Toufique; Izadi, Maliheh; Sawant, Anand Ashok; Devanbu, Premkumar; van_Deursen, Arie (March 2023, IEEE)

Full Text Available
NatGen: generative pre-training by “naturalizing” source code

https://doi.org/10.1145/3540250.3549162

Chakraborty, Saikat; Ahmed, Toufique; Ding, Yangruibo; Devanbu, Premkumar T.; Ray, Baishakhi (November 2022, ESEC/FSE 2022: Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering)

Full Text Available
Multilingual training for software engineering

https://doi.org/10.1145/3510003.3510049

Ahmed, Toufique; Devanbu, Premkumar (May 2022, 2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE))

Well-trained machine-learning models, which leverage large amounts of open-source software data, have now become an interesting approach to automating many software engineering tasks. Several SE tasks have all been subject to this approach, with performance gradually improving over the past several years with better models and training methods. More, and more diverse, clean, labeled data is better for training; but constructing good-quality datasets is time-consuming and challenging. Ways of augmenting the volume and diversity of clean, labeled data generally have wide applicability. For some languages (e.g., Ruby) labeled data is less abundant; in others (e.g., JavaScript) the available data maybe more focused on some application domains, and thus less diverse. As a way around such data bottlenecks, we present evidence suggesting that human-written code in different languages (which performs the same function), is rather similar, and particularly preserving of identifier naming patterns; we further present evidence suggesting that identifiers are a very important element of training data for software engineering tasks. We leverage this rather fortuitous phenomenon to find evidence that available multilingual training data (across different languages) can be used to amplify performance. We study this for 3 different tasks: code summarization, code retrieval, and function naming. We note that this data-augmenting approach is broadly compatible with different tasks, languages, and machine-learning models.
more » « less
Full Text Available
Learning to Find Usage of Library Functions in Optimized Binaries

https://doi.org/10.1109/TSE.2021.3106572

Ahmed, Toufique; Devanbu, Premkumar; Sawant, Anand Ashok (August 2021, IEEE Transactions on Software Engineering)

Much software, whether beneficent or malevolent, is distributed only as binaries, sans source code. Absent source code, understanding binaries' behavior can be quite challenging, especially when compiled under higher levels of compiler optimization. These optimizations can transform comprehensible, ``natural" source constructions into something entirely unrecognizable. Reverse engineering binaries, especially those suspected of being malevolent or guilty of intellectual property theft, are important and time-consuming tasks. There is a great deal of interest in tools to ``decompile" binaries back into more natural source code to aid reverse engineering. Decompilation involves several desirable steps, including recreating source-language constructions, variable names, and perhaps even comments. One central step in creating binaries is optimizing function calls, using steps such as inlining. Recovering these (possibly inlined) function calls from optimized binaries is an essential task that most state-of-the-art decompiler tools try to do but do not perform very well. In this paper, we evaluate a supervised learning approach to the problem of recovering optimized function calls. We leverage open-source software and develop an automated labeling scheme to generate a reasonably large dataset of binaries labeled with actual function usages. We augment this large but limited labeled dataset with a pre-training step, which learns the decompiled code statistics from a much larger unlabeled dataset. Thus augmented, our learned labeling model can be combined with an existing decompilation tool, Ghidra, to achieve substantially improved performance in function call recovery, especially at higher levels of optimization.
more » « less
Full Text Available
Learning type annotation: is big data enough?

https://doi.org/10.1145/3468264.3473135

Jesse, Kevin; Devanbu, Premkumar T.; Ahmed, Toufique (August 2021, Proceedings of ESEC/FSE Conference)

TypeScript is a widely used optionally-typed language where developers can adopt “pay as you go” typing: they can add types as desired, and benefit from static typing. The “type annotation tax” or manual effort required to annotate new or existing TypeScript can be reduced by a variety of automatic methods. Probabilistic machine-learning (ML) approaches work quite well. ML approaches use different inductive biases, ranging from simple token sequences to complex graphical neural network (GNN) models capturing syntax and semantic relations. More sophisticated inductive biases are hand-engineered to exploit the formal nature of software. Rather than deploying fancy inductive biases for code, can we just use “big data” to learn natural patterns relevant to typing? We find evidence suggesting that this is the case. We present TypeBert, demonstrating that even with simple token-sequence inductive bias used in BERT-style models and enough data, type-annotation performance of the most sophisticated models can be surpassed.
more » « less
Full Text Available

« Prev Next »

Search for: All records