Information leaks in software can unintentionally reveal private data, yet they are hard to detect and fix. Although several methods have been proposed to detect leakage, such as static verification-based approaches, they require specialist knowledge, and are time-consuming. Recently, we introduced HyperGI, a dynamic, hypertest-based approach that can detect and produce potential fixes for hyperproperty violations. In particular, we focused on violations of the noninterference property, as it results in information flow leakage. Our instantiation of HyperGI was able to detect and reduce leakage in three small programs. Its fitness function tried to balance information leakage and program correctness but, as we pointed out, there may be tradeoffs between keeping program semantics and reducing information leakage that require developer decisions. In this work we ask if it is possible to automatically detect and repair information leakage in more realistic programs without requiring specialist knowledge. We instantiate a multi-objective version of HyperGI in a tool, called LeakReducer, which explicitly encodes the tradeoff between program correctness and information leakage. We apply LeakReducer to six leaky programs, including the well-known Heartbleed bug. LeakReducer is able to detect leakage in all, in contrast to state-of-the-art fuzzers, detecting leakage in only two programs. Moreover, LeakReducer is able to reduce leakage in all subjects, with comparable results to previous work, while scaling to much larger software.
more »
« less
Data Leakage in Notebooks: Static Detection and Better Processes
Data science pipelines to train and evaluate models with machine learning may contain bugs just like any other code. Leakage between training and test data can lead to overestimating the model’s accuracy during offline evaluations, possibly leading to deployment of low-quality models in production. Such leakage can happen easily by mistake or by following poor practices, but may be tedious and challenging to detect manually. We develop a static analysis approach to detect common forms of data leakage in data science code. Our evaluation shows that our analysis accurately detects data leakage and that such leakage is pervasive among over 100,000 analyzed public notebooks. We discuss how our static analysis approach can help both practitioners and educators, and how leakage prevention can be designed into the development process.
more »
« less
- Award ID(s):
- 2131477
- PAR ID:
- 10444835
- Date Published:
- Journal Name:
- ASE '22: Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering
- Page Range / eLocation ID:
- 1 to 12
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Existing malicious code detection techniques demand the integration of multiple tools to detect different malware patterns, often suffering from high misclassification rates. Therefore, malicious code detection techniques could be enhanced by adopting advanced, more automated approaches to achieve high accuracy and a low misclassification rate. The goal of this study is to aid security analysts in detecting malicious packages by empirically studying the effectiveness of Large Language Models (LLMs) in detecting malicious code. We present SocketAI, a malicious code review workflow to detect malicious code. To evaluate the effectiveness SocketAI, we leverage a benchmark dataset of 5,115 npm packages, of which 2,180 packages have malicious code. We conducted a baseline comparison of GPT-3 and GPT-4 models with the state-of-the-art CodeQL static analysis tool, using 39 custom CodeQL rules developed in prior research to detect malicious Javascript code. We also compare the effectiveness of static analysis as a pre-screener with SocketAI workflow, measuring the number of files that need to be analyzed and the associated costs. Additionally, we performed a qualitative study to understand the types of malicious packages detected or missed by our workflow. Our baseline comparison demonstrates a 16% and 9% improvement over static analysis in precision and F1 scores, respectively. GPT-4 achieves higher accuracy with 99% precision and 97% F1 scores, while GPT-3 offers a more cost-effective balance at 91% precision and 94% F1 scores. Prescreening files with a static analyzer reduces the number of files requiring LLM analysis by 77.9% and decreases costs by 60.9% for GPT-3 and 76.1% for GPT-4. Our qualitative analysis identified data theft, execution of arbitrary code, and suspicious domain categories as the top detected malicious packages.more » « less
-
Ko, Hanseok (Ed.)Malware represents a significant security concern in today’s digital landscape, as it can destroy or disable operating systems, steal sensitive user information, and occupy valuable disk space. However, current malware detection methods, such as static-based and dynamic-based approaches, struggle to identify newly developed ("zero-day") malware and are limited by customized virtual machine (VM) environments. To overcome these limitations, we propose a novel malware detection approach that leverages deep learning, mathematical techniques, and network science. Our approach focuses on static and dynamic analysis and utilizes the Low-Level Virtual Machine (LLVM) to profile applications within a complex network. The generated network topologies are input into the GraphSAGE architecture to efficiently distinguish between benign and malicious software applications, with the operation names denoted as node features. Importantly, the GraphSAGE models analyze the network’s topological geometry to make predictions, enabling them to detect state-of-the-art malware and prevent potential damage during execution in a VM. To evaluate our approach, we conduct a study on a dataset comprising source code from 24,376 applications, specifically written in C/C++, sourced directly from widely-recognized malware and various types of benign software. The results show a high detection performance with an Area Under the Receiver Operating Characteristic Curve (AUROC) of 99.85%. Our approach marks a substantial improvement in malware detection, providing a notably more accurate and efficient solution when compared to current state-of-the-art malware detection methods. The code is released at https://github.com/HantangZhang/MGN.more » « less
-
Shallow groundwater resources overlaying deep saline formations used in carbon storage applications are subjected to a potential contamination threat by CO2/brine leakage via natural or anthropogenically-induced conductive pathways in the confining caprock. Identifying the leakage source location and rate is critical for developing remediation plans and designing corrective actions. Owing to limited information about the flow and transport characteristics of deep regimes and high cost of obtaining data on their response to CO2 injection operation, estimating accurate source settings (i.e., location and rate) can be extremely challenging. Under such conditions, Bayesian inverse frameworks become useful tools to help identify potential leakage patterns. This study tests and validates an ensemble-based data-assimilation approach that reduces the uncertainty in the prior knowledge about source settings through conditioning forward transport models using relatively inexpensive easy-to-acquire shallow zone data. The approach incorporates the newly developed ensemble smoother tool in the inversion code “PEST++” with the transport code “FEFLOW” to perform history matching and uncertainty analysis. A novel parameterization method that allows the disposition of potential source was used to search the leakage location during calibration process. In the absence of field data, the approach was validated using experimental data generated in ~8 m long soil tank simulating leakage from storage zone migrating to the shallow aquifer. The results show that source location uncertainty can be reasonably reduced using shallow zone data collected from near-surface aquifers. However, more prior information about the system and deeper data are essential to estimate a practical probability range for the leakage rate.more » « less
-
Safety violations in programmable logic controllers (PLCs), caused either by faults or attacks, have recently garnered significant attention. However, prior efforts at PLC code vetting suffer from many drawbacks. Static analyses and verification cause significant false positives and cannot reveal specific runtime contexts. Dynamic analyses and symbolic execution, on the other hand, fail due to their inability to handle real-world PLC programs that are event-driven and timing sensitive. In this paper, we propose VetPLC, a temporal context-aware, program analysis-based approach to produce timed event sequences that can be used for automatic safety vetting. To this end, we (a) perform static program analysis to create timed event causality graphs in order to understand causal relations among events in PLC code and (b) mine temporal invariants from data traces collected in Industrial Control System (ICS) testbeds to quantitatively gauge temporal dependencies that are constrained by machine operations. Our VetPLC prototype has been implemented in 15K lines of code. We evaluate it on 10 real-world scenarios from two different ICS settings. Our experiments show that VetPLC outperforms state-of-the-art techniques and can generate event sequences that can be used to automatically detect hidden safety violations.more » « less
An official website of the United States government

