One of the most significant challenges in the field of software code auditing is the presence of vulnerabilities in software source code. Every year, more and more software flaws are discovered, either internally in proprietary code or publicly disclosed. These flaws are highly likely to be exploited and can lead to system compromise, data leakage, or denial of service. To create a large-scale machine learning system for function-level
vulnerability identification, we utilized a sizable dataset of C and C++ open-source code containing millions of functions with potential buffer overflow exploits. We have developed an efficient and scalable vulnerability detection method based on neural network models that learn features extracted from the source codes. The source code is first converted into an intermediate representation to remove unnecessary components and shorten dependencies. We maintain the semantic and syntactic information using state-ofthe-
art word embedding algorithms such as GloVe and fastText. The embedded vectors are subsequently fed into neural networks such as LSTM, BiLSTM, LSTM-Autoencoder, word2vec, BERT, and GPT-2 to classify the possible vulnerabilities. Furthermore,
we have proposed a neural network model that can overcome issues associated with traditional neural networks. We have used evaluation metrics such as F1 score, precision, recall, accuracy, and total execution time to measure the performance. We have
conducted a comparative analysis between results derived from features containing a minimal text representation and semantic and syntactic information. We have found that all neural network models provide higher accuracy when we use semantic and syntactic
information as features. However, this approach requires more execution time due to the added complexity of the word embedding algorithm. Moreover, our proposed model provides higher accuracy than LSTM, BiLSTM, LSTM-Autoencoder, word2vec
and BERT models, and the same accuracy as the GPT-2 model with greater efficiency.
more »
« less
Automated Vulnerability Detection in Source Code Using Quantum Natural Language Processing
One of the most important challenges in the field of software code audit is the presence of vulnerabilities in software source code. Every year, more and more software flaws are found, either internally in proprietary code or revealed publicly. These flaws are highly likely exploited and lead to system compromise, data leakage, or denial of service. C and C++ open-source codes are now available in order to create a large-scale, classical machine-learning and quantum machine-learning system for function-level vulnerability identification. We assembled a sizable dataset of millions of open-source functions that point to potential exploits. We created an efficient and scalable vulnerability detection method based on a deep neural network model– Long Short-Term Memory (LSTM), and quantum machine learning model– Long Short-Term Memory (QLSTM), that can learn features extracted from the source codes. The source code is first converted into a minimal intermediate representation to remove the pointless components and shorten the dependency. Previous studies lack analyzing features of the source code that causes models to recognize flaws in real-life examples. Therefore, We keep the semantic and syntactic information using state-of-the-art word embedding algorithms such as Glove and fastText. The embedded vectors are subsequently fed into the classical and quantum convolutional neural networks to classify the possible vulnerabilities. To measure the
performance, we used evaluation metrics such as F1 score, precision, recall, accuracy, and total execution time. We made a comparison between the results derived from the classical LSTM and quantum LSTM using basic feature representation as well as semantic and syntactic representation. We found that the QLSTM with semantic and syntactic features detects significantly accurate vulnerability and runs faster than its classical counterpart.
more »
« less
- Award ID(s):
- 2100115
- PAR ID:
- 10415563
- Date Published:
- Journal Name:
- The Second International Conference on Ubiquitous Security (UbiSec 2022)
- Page Range / eLocation ID:
- 83-102
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Software vulnerabilities have become a serious problem with the emergence of new applications that contain potentially vulnerable or malicious code that can compromise the system. The growing volume and complexity of software source codes have opened a need for vulnerability detection methods to successfully predict malicious codes before being the prey of cyberattacks. As leveraging humans to check sources codes requires extensive time and resources and preexisting static code analyzers are unable to properly detect vulnerable codes. Thus, artificial intelligence techniques, mainly deep learning models, have gained traction to detect source code vulnerability. A systematic review is carried out to explore and understand the various deep learning methods employed for the task and their efficacy as a prediction model. Additionally, a summary of each process and its characteristics are examined and its implementation on specific data sets and their evaluation will be discussed.more » « less
-
Software vulnerabilities have become a serious problem with the emergence of new applications that contain potentially vulnerable or malicious code that can compromise the system. The growing volume and complexity of software source codes have opened a need for vulnerability detection methods to successfully predict malicious codes before being the prey of cyberattacks. As leveraging humans to check sources codes requires extensive time and resources and preexisting static code analyzers are unable to properly detect vulnerable codes. Thus, artificial intelligence techniques, mainly deep learning models, have gained traction to detect source code vulnerability. A systematic review is carried out to explore and understand the various deep learning methods employed for the task and their efficacy as a prediction model. Additionally, a summary of each process and its characteristics are examined and its implementation on specific data sets and their evaluation will be discussed.more » « less
-
Detecting software vulnerabilities has been a challenge for decades. Many techniques have been developed to detect vulnerabilities by reporting whether a vulnerability exists in the code of software. But few of them have the capability to categorize the types of detected vulnerabilities, which is crucial for human developers or other tools to analyze and address vulnerabilities. In this paper, we present our work on identifying the types of vulnerabilities using deep learning. Our data consists of code slices parsed in a manner that captures the syntax and semantics of a vulnerability, sourced from prior work. We train deep neural networks on these features to perform multiclass classification of software vulnerabilities in the dataset. Our experiments show that our models can effectively identify the vulnerability classes of the vulnerable functions in our dataset.more » « less
-
Quantum Computing (QC) has gained immense popularity as a potential solution to deal with the ever-increasing size of data and associated challenges leveraging the concept of quantum random access memory (QRAM). QC promises quadratic or exponential increases in computational time with quantum parallelism and thus offer a huge leap forward in the computation of Machine Learning algorithms. This paper analyzes speed up performance of QC when applied to machine learning algorithms, known as Quantum Machine Learning (QML). We applied QML methods such as Quantum Support Vector Machine (QSVM), and Quantum Neural Network (QNN) to detect Software Supply Chain (SSC) attacks. Due to the access limitations of real quantum computers, the QML methods were implemented on open-source quantum simulators such as IBM Qiskit and TensorFlow Quantum. We evaluated the performance of QML in terms of processing speed and accuracy and finally, compared with its classical counterparts. Interestingly, the experimental results differ to the speed up promises of QC by demonstrating higher computational time and lower accuracy in comparison to the classical approaches for SSC attacks.more » « less