Existing malicious code detection techniques demand the integration of multiple tools to detect different malware patterns, often suffering from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to aid security analysts in detecting malicious packages by empirically studying the effectiveness of Large Language Models (LLMs) in detecting malicious code. We present SocketAI, a malicious code review workflow to detect malicious code. To evaluate the effectiveness SocketAI, we leverage a benchmark dataset of 5,115 npm packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We also compare the effectiveness of static analysis as a pre-screener with SocketAI workflow, measuring the number of files that need to be analyzed and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious packages detected or missed by our workflow. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores, respectively. GPT-4 achieves higher accuracy with 99% precision and 97% F1 scores, while GPT-3 offers a more cost-effective balance at 91% precision and 94% F1 scores. Prescreening files with a static analyzer reduces the number of files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3 and 76.1% for GPT-4. Our qualitative analysis identified data theft, execution of arbitrary code, and suspicious domain categories as the top detected malicious packages.
more »
« less
Rethinking the Evaluation of Secure Code Generation
Large language models (LLMs) are widely used in software development. However, the code generated by LLMs often contains vulnerabilities. Several secure code generation methods have been proposed to address this issue, but their current evaluation schemes leave several concerns unaddressed. Specifically, most existing studies evaluate security and functional correctness separately, using different datasets. That is, they assess vulnerabilities using securityrelated code datasets while validating functionality with general code datasets. In addition, prior research primarily relies on a single static analyzer, CodeQL, to detect vulnerabilities in generated code, which limits the scope of security evaluation. In this work, we conduct a comprehensive study to systematically assess the improvements introduced by four state-of-the-art secure code generation techniques. Specifically, we apply both security inspection and functionality validation to the same generated code and evaluate these two aspects together. We also employ three popular static analyzers and two LLMs to identify potential vulnerabilities in the generated code. Our study reveals that existing techniques often compromise the functionality of generated code to enhance security. Their overall performance remains limited when evaluating security and functionality together. In fact, many techniques even degrade the performance of the base LLM by more than 50%. Our further inspection reveals that these techniques often either remove vulnerable lines of code entirely or generate “garbage code” that is unrelated to the intended task. Moreover, the commonly used static analyzer CodeQL fails to detect several vulnerabilities, further obscuring the actual security improvements achieved by existing techniques. Our study serves as a guideline for a more rigorous and comprehensive evaluation of secure code generation performance in future work.
more »
« less
- PAR ID:
- 10668402
- Publisher / Repository:
- 2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE ’26),
- Date Published:
- Format(s):
- Medium: X
- Location:
- Rio de Janeiro, Brazil
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract The widespread adoption of conversational LLMs for software development has raised new security concerns regarding the safety of LLM-generated content. Our motivational study outlines ChatGPT’s potential in volunteering context-specific information to the developers, promoting safe coding practices. Motivated by this finding, we conduct a study to evaluate the degree of security awareness exhibited by three prominent LLMs: Claude 3, GPT-4, and Llama 3. We prompt these LLMs with Stack Overflow questions that contain vulnerable code to evaluate whether they merely provide answers to the questions or if they also warn users about the insecure code, thereby demonstrating a degree of security awareness. Further, we assess whether LLM responses provide information about the causes, exploits, and the potential fixes of the vulnerability, to help raise users’ awareness. Our findings show that all three models struggle to accurately detect and warn users about vulnerabilities, achieving a detection rate of only 12.6% to 40% across our datasets. We also observe that the LLMs tend to identify certain types of vulnerabilities related to sensitive information exposure and improper input neutralization much more frequently than other types, such as those involving external control of file names or paths. Furthermore, when LLMs do issue security warnings, they often provide more information on the causes, exploits, and fixes of vulnerabilities compared to Stack Overflow responses. Finally, we provide an in-depth discussion on the implications of our findings, and demonstrated a CLI-based prompting tool that can be used to produce more secure LLM responses.more » « less
-
The Rust Programming Language, developed by Mozilla, has gained popularity for combining performance and memory safety. Its ownership and borrowing mechanisms ensure memory safety and reduce common bugs, making Rust safe. In contrast, Unsafe Rust introduces potential vulnerabilities that can undermine these guarantees and cause difficult security flaws. Unsafe Rust allows low-level operations, interaction with other languages, and optimizations, but bypasses most security checks, so its use should be limited, reviewed, and validated. Static and dynamic code analysis tools help validate that unsafe regions do not contain vulnerabilities. Due to Rust’s rapid growth, documentation of such tools is sparse and difficult to navigate. This research reviews, analyzes, and tests multiple static code analysis tools, aiming to provide developers with a comprehensive resource tailored to Unsafe Rust. The results include insights into each tool’s functionality, strengths, weaknesses, and use cases, equipping developers to write safer, more secure, and reliable Rust code.more » « less
-
Industry is increasingly adopting private 5G networks to securely manage their wireless devices in retail, manufacturing, natural resources, and healthcare. As with most technology sectors, open- source software is well poised to form the foundation of deployments, whether it is deployed directly or as part of well-maintained proprietary offerings. This paper seeks to examine the use of cryptography and secure randomness in open-source cellular cores. We design a set of 13 CodeQL static program analysis rules for cores written in both C/C++ and Go and apply them to 7 open-source cellular cores implementing 4G and 5G functionality. We identify two significant security vulnerabilities, including predictable generation of TMSIs and improper verification of TLS certificates, with each vulnerability affecting multiple cores. In identifying these flaws, we hope to correct implementations to fix downstream deployments and derivative proprietary projects.more » « less
-
In the context of the rising interest in code language models (code LMs) and vulnerability detection, we study the effectiveness of code LMs for detecting vulnerabilities. Our analysis reveals significant shortcomings in existing vulnerability datasets, including poor data quality, low label accuracy, and high duplication rates, leading to unreliable model performance in realistic vulnerability detection scenarios. Additionally, the evaluation methods used with these datasets are not representative of real-world vulnerability detection. To address these challenges, we introduce PRIMEVUL, a new dataset for training and evaluating code LMs for vulnerability detection. PRIMEVUL incorporates a novel set of data labeling techniques that achieve comparable label accuracy to humanverified benchmarks while significantly expanding the dataset. It also implements a rigorous data de-duplication and chronological data splitting strategy to mitigate data leakage issues, alongside introducing more realistic evaluation metrics and settings. This comprehensive approach aims to provide a more accurate assessment of code LMs’ performance in real-world conditions. Evaluating code LMs on PRIMEVUL reveals that existing benchmarks significantly overestimate the performance of these models. For instance, a state-of-the-art 7B model scored 68.26% F1 on BigVul but only 3.09% F1 on PRIMEVUL. Attempts to improve performance through advanced training techniques and larger models like GPT-3.5 and GPT-4 were unsuccessful, with results akin to random guessing in the most stringent settings. These findings underscore the considerable gap between current capabilities and the practical requirements for deploying code LMs in security roles, highlighting the need for more innovative research in this domain.more » « less
An official website of the United States government

