skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Static Analysis Using Intermediate Representations: A Literature Review
Static Analysis (SA) in Cybersecurity is a practice aimed at detecting vulnerabilities within the source code of a program. Modern SA applications, though highly sophisticated, lack programming language agnostic generalization, instead requiring codebase specific implementations for each programming language. The manner in which SA is implemented today, though functional, requires significant man hours to develop and maintain, higher costs due to custom applications for each language, and creates inconsistencies in implementation from SA-tool to SA-tool. One promising source of programming language generalization occurs within the compilers used to compile code for programming languages like C, C++, and Java. During the compilation process, source code of varying languages moves through several validation passes before being converted into a grammatically consistent Intermediate Representation (IR). The grammatical consistencies provided by IRs allow the same program derived from different programming languages to be represented uniformly and thus analyzed for vulnerabilities. By using IRs of compiled programming languages as the codebase of SA practices, multiple programming languages can be encompassed by a single SA tool. To begin understanding the possibilities the combination of SA and IRs may reveal, this research presents the following outcomes: 1) a systematic literature search, 2) a literature review, and 3) the classification of existing work pertaining to SA practices using IRs. The results of the study indicate that generalized Static Analysis using IRs is already a common practice in all compilers, but that the extended use of IRs in Cybersecurity SA practices aimed at finding vulnerabilities in source code remains underdeveloped.  more » « less
Award ID(s):
1750038
PAR ID:
10519999
Author(s) / Creator(s):
;
Publisher / Repository:
Academic Conferences and publishing limited
Date Published:
ISBN:
1914587707
Format(s):
Medium: X
Location:
Athens, Greece
Sponsoring Org:
National Science Foundation
More Like this
  1. Modern programming languages offer abstractions that simplify software development and allow hardware to reach its full potential. These abstractions range from the well-established OpenMP language extensions to newer C++ features like smart pointers. To properly use these abstractions in an existing codebase, programmers must determine how a given source code region interacts with Program State Elements (PSEs) (i.e., the program's variables and memory locations). We call this process Program State Element Characterization (PSEC). Without tool support for PSEC, a programmer's only option is to manually study the entire codebase. We propose a profile-based approach that automates PSEC and provides abstraction recommendations to programmers. Because a profile-based approach incurs an impractical overhead, we introduce the Compiler and Runtime Memory Observation Tool (CARMOT), a PSEC-specific compiler co-designed with a parallel runtime. CARMOT reduces the overhead of PSEC by two orders of magnitude, making PSEC practical. We show that CARMOT's recommendations achieve the same speedup as hand-tuned OpenMP directives and avoid memory leaks with C++ smart pointers. From this, we argue that PSEC tools, such as CARMOT, can provide support for the rich ecosystem of modern programming language abstractions. 
    more » « less
  2. Developers nowadays regularly use numerous programming languages with different characteristics and trade-offs. Unfortunately, implementing a software verifier for a new language from scratch is a large and tedious undertaking, requiring expert knowledge in multiple domains, such as compilers, verification, and constraint solving. Hence, only a tiny fraction of the used languages has readily available software verifiers to aid in the development of correct programs. In the past decade, there has been a trend of leveraging popular compiler intermediate representations (IRs), such as LLVM IR, when implementing software verifiers. Processing IR promises out-of-the-box multi- and cross-language verification since, at least in theory, a verifier ought to be able to handle a program in any programming language (and their combination) that can be compiled into the IR. In practice though, to the best of our knowledge, nobody has explored the feasibility and ease of such integration of new languages. In this paper, we provide a procedure for adding support for a new language into an IR-based verification toolflow. Using our procedure, we extend the SMACK verifier with prototypical support for 6 additional languages. We assess the quality of our extensions through several case studies, and we describe our experience in detail to guide future efforts in this area. 
    more » « less
  3. To keep up with changes in requirements, frameworks, and coding practices, software organizations might need to migrate code from one language to another. Source-to-source migration, or transpilation, is often a complex, manual process. Transpilation requires expertise both in the source and target language, making it highly laborious and costly. Languages models for code generation and transpilation are becoming increasingly popular. However, despite capturing code-structure well, code generated by language models is often spurious and contains subtle problems. We proposeBatFix, a novel approach that augments language models for transpilation by leveraging program repair and synthesis to fix the code generated by these models.BatFixtakes as input both the original program, the target program generated by the machine translation model, and a set of test cases and outputs a repaired program that passes all test cases. Experimental results show that our approach is agnostic to language models and programming languages.BatFixcan locate bugs spawning multiple lines and synthesize patches for syntax and semantic bugs for programs migrated fromJavatoC++andPythontoC++from multiple language models, including, OpenAI’sCodex. 
    more » « less
  4. Successful cross-language clone detection could enable researchers and developers to create robust language migration tools, facilitate learning additional programming languages once one is mastered, and promote reuse of code snippets over a broader codebase. How- ever, identifying cross-language clones presents special challenges to the clone detection problem. A lack of common underlying rep- resentation between arbitrary languages means detecting clones requires one of the following solutions: 1) a static analysis frame- work replicated across each targeted language with annotations matching language features across all languages, or 2) a dynamic analysis framework that detects clones based on runtime behavior. In this work, we demonstrate the feasibility of the latter solution, a dynamic analysis approach called SLACC for cross-language clone detection. Like prior clone detection techniques, we use input/out- put behavior to match clones, though we overcome limitations of prior work by amplifying the number of inputs and covering more data types; and as a result, achieve better clusters than prior at- tempts. Since clusters are generated based on input/output behav- ior, SLACC supports cross-language clone detection. As an added challenge, we target a static typed language, Java, and a dynamic typed language, Python. Compared to HitoshiIO, a recent clone de- tection tool for Java, SLACC retrieves 6 times as many clusters and has higher precision (86.7% vs. 30.7%). This is the first work to perform clone detection for dynamic typed languages (precision = 87.3%) and the first to perform clone detection across languages that lack a common underlying repre- sentation (precision = 94.1%). It provides a first step towards the larger goal of scalable language migration tools. 
    more » « less
  5. Nowadays, cyberattack incidents are happening on a daily basis. As a result, the demand for a larger and more challenging workforce is increasing. To handle this demand, academic institutions offer cybersecurity courses and degree programs into their curricula; however, more efforts are needed to address the high demand of the cybersecurity workforce. This work aims to bridge the gap between workforce shortage and the number of qualified graduates to fill the positions. We approach this by introducing cybersecurity concepts at the early stage of undergraduate curricula of computer science and engineering programs. Secure programming is critical as many cybersecurity incidents happen due to software vulnerabilities. However, most UG-level programming courses pay little attention to secure programming practices. As a result, many students graduate with limited knowledge of security vulnerabilities that might plague the developed software. Our goal in this work is to introduce secure programming at introductory level programming courses so that students should be aware of cybersecurity issues and use this security mindset in advanced level courses and projects in their degree programs. To accomplish this goal, we developed intuitive and interactive modules emphasizing secure programming in C++ and Java courses to help students become secure software developers. These modules will be used alongside the coursework to emphasize certain vulnerabilities within the programming environment of a specific language and allow students to learn cybersecurity topics, enforcing a solid foundation and understanding. We developed cybersecurity educational modules for C++ and Java as they are amongst the popular languages and used in introductory programming courses. While designing these modules, we kept in mind that the topics must be relevant to real-world issues in the software industry. We used a variety of resources and benchmarks to ensure the authenticity of our chosen topics, including Common Weakness Enumeration (CWE) and Common Vulnerability and Exposures (CVE). While choosing module topics to develop, we had some restrictions. For example, the topics must be introductory and easy to understand. These modules are geared towards freshman or sophomore-level UG students who have just started programming. The developed security modules have four components: power-point slides, lab description, code template for the lab, and complete solution. The complete solution for each module will be provided to the instructors to check students’ work if they adopt the modules in their courses. The modules developed for a C++ programming course include labs on input validation, integer overflow, random number generation, function call with incorrect argument type, and dangling pointers. In Java, we developed lab modules for input validation, integer overflow, null object reference, random number generator, and data encapsulation. 
    more » « less