Quantitative program analysis is an emerging area with applications to software reliability, quantitative information flow, side-channel detection and attack synthesis. Most quantitative program analysis techniques rely on model counting constraint solvers, which are typically the bottleneck for scalability. Although the effectiveness of formula caching in expediting expensive model-counting queries has been demonstrated in prior work, our key insight is that many subformulas are shared across non-identical constraints generated during program analyses. This has not been utilized by prior formula caching approaches. In this paper we present a subformula caching framework and integrate it into a model counting constraint solver. We experimentally evaluate its effectiveness under three quantitative program analysis scenarios: 1) model counting constraints generated by symbolic execution, 2) reliability analysis using probabilistic symbolic execution, 3) adaptive attack synthesis for side-channels. Our experimental results demonstrate that our subformula caching approach significantly improves the performance of quantitative program analysis.
more »
« less
Obtaining Information Leakage Bounds via Approximate Model Counting
Information leaks are a significant problem in modern software systems. In recent years, information theoretic concepts, such as Shannon entropy, have been applied to quantifying information leaks in programs. One recent approach is to use symbolic execution together with model counting constraints solvers in order to quantify information leakage. There are at least two reasons for unsoundness in quantifying information leakage using this approach: 1) Symbolic execution may not be able to explore all execution paths, 2) Model counting constraints solvers may not be able to provide an exact count. We present a sound symbolic quantitative information flow analysis that bounds the information leakage both for the cases where the program behavior is not fully explored and the model counting constraint solver is unable to provide a precise model count but provides an upper and a lower bound. We implemented our approach as an extension to KLEE for computing sound bounds for information leakage in C programs.
more »
« less
- PAR ID:
- 10603754
- Publisher / Repository:
- Association for Computing Machinery (ACM)
- Date Published:
- Journal Name:
- Proceedings of the ACM on Programming Languages
- Volume:
- 7
- Issue:
- PLDI
- ISSN:
- 2475-1421
- Format(s):
- Medium: X Size: p. 1488-1509
- Size(s):
- p. 1488-1509
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Information leaks in software can unintentionally reveal private data, yet they are hard to detect and fix. Although several methods have been proposed to detect leakage, such as static verification-based approaches, they require specialist knowledge, and are time-consuming. Recently, we introduced HyperGI, a dynamic, hypertest-based approach that can detect and produce potential fixes for hyperproperty violations. In particular, we focused on violations of the noninterference property, as it results in information flow leakage. Our instantiation of HyperGI was able to detect and reduce leakage in three small programs. Its fitness function tried to balance information leakage and program correctness but, as we pointed out, there may be tradeoffs between keeping program semantics and reducing information leakage that require developer decisions. In this work we ask if it is possible to automatically detect and repair information leakage in more realistic programs without requiring specialist knowledge. We instantiate a multi-objective version of HyperGI in a tool, called LeakReducer, which explicitly encodes the tradeoff between program correctness and information leakage. We apply LeakReducer to six leaky programs, including the well-known Heartbleed bug. LeakReducer is able to detect leakage in all, in contrast to state-of-the-art fuzzers, detecting leakage in only two programs. Moreover, LeakReducer is able to reduce leakage in all subjects, with comparable results to previous work, while scaling to much larger software.more » « less
-
Worst-case input generation aims to automatically generate inputs that exhibit the worst-case performance of programs. It has several applications, and can, for example, detect vulnerabilities to denial-of-service (DoS) attacks. However, it is non-trivial to generate worst-case inputs for concurrent programs, particularly for resources like memory where the peak cost depends on how processes are scheduled. This article presents the first sound worst-case input generation algorithm for concurrent programs under non-monotone resource metrics like memory. The key insight is to leverage resource-annotated session types and symbolic execution. Session types describe communication protocols on channels in process calculi. Equipped with resource annotations, resource-annotated session types not only encode cost bounds but also indicate how many resources can be reused and transferred between processes. This information is critical for identifying a worst-case execution path during symbolic execution. The algorithm is sound: if it returns any input, it is guaranteed to be a valid worst-case input. The algorithm is also relatively complete: as long as resource-annotated session types are sufficiently expressive and the background theory for SMT solving is decidable, a worst-case input is guaranteed to be returned. A simple case study of a web server's memory usage demonstrates the utility of the worst-case input generation algorithm.more » « less
-
Symbolic execution is an automated test input generation technique that models individual program paths as logical constraints. However, the realism of concrete test inputs generated by SMT solvers often comes into question. Existing symbolic execution tools only seek arbitrary solutions for given path constraints. These constraints do not incorporate the naturalness of inputs that observe statistical distributions, range constraints, or preferred string constants. This results in unnatural-looking inputs that fail to emulate real-world data. In this paper, we extend symbolic execution with consideration for incorporating naturalness. Our key insight is that users typically understand the semantics of program inputs, such as the distribution of height or possible values of zipcode, which can be leveraged to advance the ability of symbolic execution to produce natural test inputs. We instantiate this idea in NaturalSym, a symbolic execution-based test generation tool for data-intensive scalable computing (DISC) applications. NaturalSym generates natural-looking data that mimics real-world distributions by utilizing user-provided input semantics to drastically enhance the naturalness of inputs, while preserving strong bug-finding potential. On DISC applications and commercial big data test benchmarks, NaturalSym achieves a higher degree of realism —as evidenced by a perplexity score 35.1 points lower on median, and detects 1.29× injected faults compared to the state-of-the-art symbolic executor for DISC, BigTest. This is because BigTest draws inputs purely based on the satisfiability of path constraints constructed from branch predicates, while NaturalSym is able to draw natural concrete values based on user-specified semantics and prioritize using these values in input generation. Our empirical results demonstrate that NaturalSym finds injected faults 47.8× more than NaturalFuzz (a coverage-guided fuzzer) and 19.1× more than ChatGPT. Meanwhile, TestMiner (a mining-based approach) fails to detect any injected faults. NaturalSym is the first symbolic executor that combines the notion of input naturalness in symbolic path constraints during SMT-based input generation. We make our code available at https://github.com/UCLA-SEAL/NaturalSym.more » « less
-
We present a novel approach to automatically recover information about the address space layout of remote processes in the presence of Address Space Layout Randomization (ASLR). Our system, dubbed Sleak, performs static analysis and symbolic execution of binary executable programs, and identifies program paths and input parameters leading to partial (i.e., only a few bits) or complete (i.e., the whole address) information disclosure vulnerabilities, revealing addresses of known objects of the target service or application. Sleak takes, as input, the binary executable program, and generates a symbolic expression for each program output that leaks information about the addresses of objects, such as stack variables, heap structures, or function pointers. By comparing these expressions with the concrete output of a remote process executing the same binary program image, our system is able to recover from a few bits to whole addresses of objects of the target application or service. Discovering the address of a single object in the target application is often enough to guess the layout of entire sections of the address space, which can be leveraged by attackers to bypass ASLR.more » « less
An official website of the United States government
