Several studies showed that misuses of cryptographic APIs are common in real-world code (e.g., Apache projects and Android apps). There exist several open-sourced and commercial security tools that automatically screen Java programs to detect misuses. To compare their accuracy and security guarantees, we develop two comprehensive benchmarks named CryptoAPI-Bench and ApacheCryptoAPI-Bench. CryptoAPI-Bench consists of 181 unit test cases that cover basic cases, as well as complex cases, including interprocedural, field sensitive, multiple class test cases, and path sensitive data flow of misuse cases. The benchmark also includes correct cases for testing false-positive rates. The ApacheCryptoAPI-Bench consists of 121 cryptographic cases from 10 Apache projects. We evaluate four tools, namely, SpotBugs, CryptoGuard, CrySL, and another tool (anonymous) using both benchmarks. We present their performance and comparative analysis. The ApacheCryptoAPI-Bench also examines the scalability of the tools. Our benchmarks are useful for advancing state-of-the-art solutions in the space of misuse detection.
more »
« less
Methods and Benchmark for Detecting Cryptographic API Misuses in Python
Extensive research has been conducted to explore cryptographic API misuse in Java. However, despite the tremendous popularity of the Python language, uncovering similar issues has not been fully explored. The current static code analysis tools for Python are unable to scan the increasing complexity of the source code. This limitation decreases the analysis depth, resulting in more undetected cryptographic misuses. In this research, we propose Cryptolation, a Static Code Analysis (SCA) tool that provides security guarantees for complex Python cryptographic code. Most existing analysis tools for Python solely focus on specific Frameworks such as Django or Flask. However, using a SCA approach, Cryptolation focuses on the language and not any framework. Cryptolation performs an inter-procedural data-flow analysis to handle many Python language features through variable inference (statically predicting what the variable value is) and SCA. Cryptolation covers 59 Python cryptographic modules and can identify 18 potential cryptographic misuses that involve complex language features. In this paper, we also provide a comprehensive analysis and a state-of-the-art benchmark for understanding the Python cryptographic Application Program Interface (API) misuses and their detection. Our state-of-the-art benchmark PyCryptoBench includes 1,836 Python cryptographic test cases that cover both 18 cryptographic rules and five language features. PyCryptoBench also provides a framework for evaluating and comparing different cryptographic scanners for Python. To evaluate the performance of our proposed cryptographic Python scanner, we evaluated Cryptolation against three other state-of-the-art tools: Bandit, Semgrep, and Dlint. We evaluated these four tools using our benchmark PyCryptoBench and manual evaluation of (four Top-Ranked and 939 Un-Ranked) real-world projects. Our results reveal that, overall, Cryptolation achieved the highest precision throughout our testing; and the highest accuracy on our benchmark. Cryptolation had 100% precision on PyCryptoBench, and the highest precision on real-world projects.
more »
« less
- Award ID(s):
- 1929701
- PAR ID:
- 10547777
- Publisher / Repository:
- IEEE
- Date Published:
- Journal Name:
- IEEE Transactions on Software Engineering
- Volume:
- 50
- Issue:
- 5
- ISSN:
- 0098-5589
- Page Range / eLocation ID:
- 1118 to 1129
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Cryptographic (crypto) API misuses often cause security vulnerabilities, so static and dynamic analyzers were recently proposed to detect such misuses. These analyzers differ in strengths and weaknesses, and they can miss bugs. Motivated by the inherent limitations of existing analyzers, we study runtime verification (RV) as an alternative for crypto API misuse detection. RV monitors program runs against formal specifications and was shown to be effective and efficient for amplifying the bug-finding ability of software tests. We focus on the popular JCA crypto API and write 22 RV specifications based on expert-validated rules in a static analyzer. We monitor these specifications while running tests in five benchmarks. Lastly, we compare the accuracy of our RV-based approach, RVSec, with those of three state-of-the-art crypto API misuses detectors: CogniCrypt, CryptoGuard, and CryLogger. RVSec has higher accuracy in four benchmarks and is on par with CryptoGuard in the fifth. Overall, RVSec achieves an average F1 measure of 95%, compared with 83%, 78%, and 86% for CogniCrypt, CryptoGuard, and CryLogger, respectively. We show that RV is effective for detecting crypto API misuses and highlight the strengths and limitations of these tools. We also discuss how static and dynamic analysis can complement each other for detecting crypto API misuses.more » « less
-
Software APIs exhibit rich diversity and complexity which not only renders them a common source of programming errors but also hinders program analysis tools for checking them. Such tools either expect a precise API specification, which requires program analysis expertise, or presume that correct API usages follow simple idioms that can be automatically mined from code, which suffers from poor accuracy. We propose a new approach that allows regular programmers to find API misuses. Our approach interacts with the user to classify valid and invalid usages of each target API method. It minimizes user burden by employing an active learning algorithm that ranks API usages by their likelihood of being invalid. We implemented our approach in a tool called ARBITRAR for C/C++ programs, and applied it to check the uses of 18 API methods in 21 large real-world programs, including OpenSSL and Linux Kernel. Within just 3 rounds of user interaction on average per API method, ARBITRAR found 40 new bugs, with patches accepted for 18 of them. Moreover, ARBITRAR finds all known bugs reported by a state-of-the-art tool APISAN in a benchmark suite comprising 92 bugs with a false positive rate of only 51.5% compared to APISAN’s 87.9%more » « less
-
null (Ed.)API misuses are prevalent and extremely harmful. Despite various techniques have been proposed for API-misuse detection, it is not even clear how different types of API misuses distribute and whether existing techniques have covered all major types of API misuses. Therefore, in this paper, we conduct the first large-scale empirical study on API misuses based on 528,546 historical bug-fixing commits from GitHub (from 2011 to 2018). By leveraging a state-of-the-art fine-grained AST differencing tool, GumTree, we extract more than one million bug-fixing edit operations, 51.7% of which are API misuses. We further systematically classify API misuses into nine different categories according to the edit operations and context. We also extract various frequent API-misuse patterns based on the categories and corresponding operations, which can be complementary to existing API-misuse detection tools. Our study reveals various practical guidelines regarding the importance of different types of API misuses. Furthermore, based on our dataset, we perform a user study to manually analyze the usage constraints of 10 patterns to explore whether the mined patterns can guide the design of future API-misuse detection tools. Specifically, we find that 7,541 potential misuses still exist in latest Apache projects and 149 of them have been reported to developers. To date, 57 have already been confirmed and fixed (with 15 rejected misuses correspondingly). The results indicate the importance of studying historical API misuses and the promising future of employing our mined patterns for detecting unknown API misuses.more » « less
-
Michael Pradel (Ed.)Large language models have demonstrated the ability to generate both natural language and programming language text. Although contemporary code generation models are trained on corpora with several programming languages, they are tested using benchmarks that are typically monolingual. The most widely used code generation benchmarks only target Python, so there is little quantitative evidence of how code generation models perform on other programming languages. We propose MultiPL-E, a system for translating unit test-driven code generation benchmarks to new languages. We create the first massively multilingual code generation benchmark by using MultiPL-E to translate two popular Python code generation benchmarks to 18 additional programming languages. We use MultiPL-E to extend the HumanEval benchmark and MBPP benchmark to 18 languages that encompass a range of programming paradigms and popularity. Using these new parallel benchmarks, we evaluate the multi-language performance of three state-of-the-art code generation models: Codex, CodeGen and InCoder. We find that Codex matches or even exceeds its performance on Python for several other languages. The range of programming languages represented in MultiPL-E allow us to explore the impact of language frequency and language features on model performance. Finally, the MultiPL-E approach of compiling code generation benchmarks to new programming languages is both scalable and extensible, making it straightforward to evaluate new models, benchmarks, and languages.more » « less
An official website of the United States government

