Detecting software vulnerabilities has been a challenge for decades. Many techniques have been developed to detect vulnerabilities by reporting whether a vulnerability exists in the code of software. But few of them have the capability to categorize the types of detected vulnerabilities, which is crucial for human developers or other tools to analyze and address vulnerabilities. In this paper, we present our work on identifying the types of vulnerabilities using deep learning. Our data consists of code slices parsed in a manner that captures the syntax and semantics of a vulnerability, sourced from prior work. We train deep neural networks on these features to perform multiclass classification of software vulnerabilities in the dataset. Our experiments show that our models can effectively identify the vulnerability classes of the vulnerable functions in our dataset.
more »
« less
VELVET: a noVel Ensemble Learning approach to automatically locate VulnErable sTatements
Automatically locating vulnerable statements in source code is crucial to assure software security and alleviate developers' debugging efforts. This becomes even more important in today's software ecosystem, where vulnerable code can flow easily and unwittingly within and across software repositories like GitHub. Across such millions of lines of code, traditional static and dynamic approaches struggle to scale. Although existing machine-learning-based approaches look promising in such a setting, most work detects vulnerable code at a higher granularity – at the method or file level. Thus, developers still need to inspect a significant amount of code to locate the vulnerable statement(s) that need to be fixed. This paper presents Velvet, a novel ensemble learning approach to locate vulnerable statements. Our model combines graph-based and sequence-based neural networks to successfully capture the local and global context of a program graph and effectively understand code semantics and vulnerable patterns. To study Velvet's effectiveness, we use an off-the-shelf synthetic dataset and a recently published real-world dataset. In the static analysis setting, where vulnerable functions are not detected in advance, Velvet achieves 4.5× better performance than the baseline static analyzers on the real-world data. For the isolated vulnerability localization task, where we assume the vulnerability of a function is known while the specific vulnerable statement is unknown, we compare Velvet with several neural networks that also attend to local and global context of code. Velvet achieves 99.6% and 43.6% top-1 accuracy over synthetic data and real-world data, respectively, outperforming the baseline deep learning models by 5.3-29.0%.
more »
« less
- Award ID(s):
- 1815494
- PAR ID:
- 10346729
- Date Published:
- Journal Name:
- IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER)
- Page Range / eLocation ID:
- 959 to 970
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
As graph data grows increasingly complicated, training graph neural networks (GNNs) on large-scale datasets presents significant challenges, including computational resource constraints, data redundancy, and transmission inefficiencies. While existing graph condensation techniques have shown promise in addressing these issues, they are predominantly designed for single-label datasets, where each node is associated with a single class label. However, many real-world applications, such as social network analysis and bioinformatics, involve multi-label graph datasets, where one node can have various related labels. To deal with this problem, we extend traditional graph condensation approaches to accommodate multi-label datasets by introducing modifications to synthetic dataset initialization and condensing optimization. Through experiments on eight real-world multi-label graph datasets, we prove the effectiveness of our method. In the experiment, the GCond framework, combined with K-Center initialization and binary cross-entropy loss (BCELoss), generally achieves the best performance. This benchmark for multi-label graph condensation not only enhances the scalability and efficiency of GNNs for multi-label graph data but also offers substantial benefits for diverse real-world applications. Code is available at https://github.com/liangliang6v6/Multi-GC.more » « less
-
Software vulnerabilities have become a serious problem with the emergence of new applications that contain potentially vulnerable or malicious code that can compromise the system. The growing volume and complexity of software source codes have opened a need for vulnerability detection methods to successfully predict malicious codes before being the prey of cyberattacks. As leveraging humans to check sources codes requires extensive time and resources and preexisting static code analyzers are unable to properly detect vulnerable codes. Thus, artificial intelligence techniques, mainly deep learning models, have gained traction to detect source code vulnerability. A systematic review is carried out to explore and understand the various deep learning methods employed for the task and their efficacy as a prediction model. Additionally, a summary of each process and its characteristics are examined and its implementation on specific data sets and their evaluation will be discussed.more » « less
-
Software vulnerabilities have become a serious problem with the emergence of new applications that contain potentially vulnerable or malicious code that can compromise the system. The growing volume and complexity of software source codes have opened a need for vulnerability detection methods to successfully predict malicious codes before being the prey of cyberattacks. As leveraging humans to check sources codes requires extensive time and resources and preexisting static code analyzers are unable to properly detect vulnerable codes. Thus, artificial intelligence techniques, mainly deep learning models, have gained traction to detect source code vulnerability. A systematic review is carried out to explore and understand the various deep learning methods employed for the task and their efficacy as a prediction model. Additionally, a summary of each process and its characteristics are examined and its implementation on specific data sets and their evaluation will be discussed.more » « less
-
null (Ed.)Deep learning methods for graphs achieve remarkable performance on many node-level and graph-level prediction tasks. However, despite the proliferation of the methods and their success, prevailing Graph Neural Networks (GNNs) neglect subgraphs, rendering subgraph prediction tasks challenging to tackle in many impactful applications. Further, subgraph prediction tasks present several unique challenges: subgraphs can have non-trivial internal topology, but also carry a notion of position and external connectivity information relative to the underlying graph in which they exist. Here, we introduce SubGNN, a subgraph neural network to learn disentangled subgraph representations. We propose a novel subgraph routing mechanism that propagates neural messages between the subgraph's components and randomly sampled anchor patches from the underlying graph, yielding highly accurate subgraph representations. SubGNN specifies three channels, each designed to capture a distinct aspect of subgraph topology, and we provide empirical evidence that the channels encode their intended properties. We design a series of new synthetic and real-world subgraph datasets. Empirical results for subgraph classification on eight datasets show that SubGNN achieves considerable performance gains, outperforming strong baseline methods, including node-level and graph-level GNNs, by 19.8% over the strongest baseline. SubGNN performs exceptionally well on challenging biomedical datasets where subgraphs have complex topology and even comprise multiple disconnected components.more » « less
An official website of the United States government

