NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

CONCORD: Clone-Aware Contrastive Learning for Source Code

https://doi.org/10.1145/3597926.3598035

Ding, Yangruibo; Chakraborty, Saikat; Buratti, Luca; Pujar, Saurabh; Morari, Alessandro; Kaiser, Gail; Ray, Baishakhi (July 2023, 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA))

Deep Learning (DL) models to analyze source code have shown immense promise during the past few years. More recently, self-supervised pre-training has gained traction for learning generic code representations valuable for many downstream SE tasks, such as clone and bug detection. While previous work successfully learned from different code abstractions (e.g., token, AST, graph), we argue that it is also essential to factor in how developers code day-to-day for general-purpose representation learning. On the one hand, human developers tend to write repetitive programs referencing existing code snippets from the current codebase or online resources (e.g., Stack Overflow website) rather than implementing functions from scratch; such behaviors result in a vast number of code clones. In contrast, a deviant clone by mistake might trigger malicious program behaviors. Thus, as a proxy to incorporate developers' coding behavior into the pre-training scheme, we propose to include code clones and their deviants. In particular, we propose CONCORD, a self-supervised, contrastive learning strategy to place benign clones closer in the representation space while moving deviants further apart. We show that CONCORD's clone-aware contrastive learning drastically reduces the need for expensive pre-training resources while improving the performance of downstream SE tasks. We also empirically demonstrate that CONCORD can improve existing pre-trained models to learn better representations that consequently become more efficient in both identifying semantically equivalent programs and differentiating buggy from non-buggy code.
more » « less
Full Text Available
Towards Learning (Dis)-Similarity of Source Code from Program Contrasts

https://doi.org/10.18653/v1/2022.acl-long.436

Ding, Yangruibo; Buratti, Luca; Pujar, Saurabh; Morari, Alessandro; Ray, Baishakhi; Chakraborty, Saikat (April 2022, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers))

Full Text Available
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements

https://doi.org/10.1109/SANER53432.2022.00114

Ding, Yangruibo; Suneja, Sahil; Zheng, Yunhui; Laredo, Jim; Morari, Alessandro; Kaiser, Gail; Ray, Baishakhi (March 2022, IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER))

Automatically locating vulnerable statements in source code is crucial to assure software security and alleviate developers' debugging efforts. This becomes even more important in today's software ecosystem, where vulnerable code can flow easily and unwittingly within and across software repositories like GitHub. Across such millions of lines of code, traditional static and dynamic approaches struggle to scale. Although existing machine-learning-based approaches look promising in such a setting, most work detects vulnerable code at a higher granularity – at the method or file level. Thus, developers still need to inspect a significant amount of code to locate the vulnerable statement(s) that need to be fixed. This paper presents Velvet, a novel ensemble learning approach to locate vulnerable statements. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph and effectively understand code semantics and vulnerable patterns. To study Velvet's effectiveness, we use an off-the-shelf synthetic dataset and a recently published real-world dataset. In the static analysis setting, where vulnerable functions are not detected in advance, Velvet achieves 4.5× better performance than the baseline static analyzers on the real-world data. For the isolated vulnerability localization task, where we assume the vulnerability of a function is known while the specific vulnerable statement is unknown, we compare Velvet with several neural networks that also attend to local and global context of code. Velvet achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively, outperforming the baseline deep learning models by 5.3-29.0%.
more » « less
Full Text Available

Search for: All records