Feature Engineering-Based Detection of Buffer Overflow Vulnerability in Source Code Using Neural Networks

Akter, M.; Shahriar. H.; Cardenas, J.; Ahamed, S.; Cuzzocrea, A.

One of the most significant challenges in the field of software code auditing is the presence of vulnerabilities in software source code. Every year, more and more software flaws are discovered, either internally in proprietary code or publicly disclosed. These flaws are highly likely to be exploited and can lead to system compromise, data leakage, or denial of service. To create a large-scale machine learning system for function-level vulnerability identification, we utilized a sizable dataset of C and C++ open-source code containing millions of functions with potential buffer overflow exploits. We have developed an efficient and scalable vulnerability detection method based on neural network models that learn features extracted from the source codes. The source code is first converted into an intermediate representation to remove unnecessary components and shorten dependencies. We maintain the semantic and syntactic information using state-ofthe- art word embedding algorithms such as GloVe and fastText. The embedded vectors are subsequently fed into neural networks such as LSTM, BiLSTM, LSTM-Autoencoder, word2vec, BERT, and GPT-2 to classify the possible vulnerabilities. Furthermore, we have proposed a neural network model that can overcome issues associated with traditional neural networks. We have used evaluation metrics such as F1 score, precision, recall, accuracy, and total execution time to measure the performance. We have conducted a comparative analysis between results derived from features containing a minimal text representation and semantic and syntactic information. We have found that all neural network models provide higher accuracy when we use semantic and syntactic information as features. However, this approach requires more execution time due to the added complexity of the word embedding algorithm. Moreover, our proposed model provides higher accuracy than LSTM, BiLSTM, LSTM-Autoencoder, word2vec and BERT models, and the same accuracy as the GPT-2 model with greater efficiency.

More Like this