skip to main content


Title: Holistic Control-Flow Protection on Real-Time Embedded Systems with Kage
This paper presents Kage: a system that protects the control data of both application and kernel code on microcontroller-based embedded systems. Kage consists of a Kage-compliant embedded OS that stores all control data in separate memory regions from untrusted data, a compiler that transforms code to protect these memory regions efficiently and to add forward-edge control-flow integrity checks, and a secure API that allows safe updates to the protected data. We implemented Kage as an extension to FreeRTOS, an embedded real-time operating system. We evaluated Kage’s performance using the CoreMark benchmark. Kage incurred a 5.2% average run-time overhead and 49.8% code size overhead. Furthermore, the code size overhead was only 14.2% when compared to baseline FreeRTOS with the MPU enabled. We also evaluated Kage’s security guarantees by measuring and analyzing reachable code-reuse gadgets. Compared to FreeRTOS, Kage reduces the number of reachable gadgets from 2,276 to 27, and the remaining 27 gadgets cannot be stitched together to launch a practical attack.  more » « less
Award ID(s):
1955498
PAR ID:
10350950
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Usenix Security Symposium
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Rowhammer is an increasingly threatening vulnerability that grants an attacker the ability to flip bits in memory without directly accessing them. Despite efforts to mitigate Rowhammer via software and defenses built directly into DRAM modules, more recent generations of DRAM are actually more susceptible to malicious bit-flips than their predecessors. This phenomenon has spawned numerous exploits, showing how Rowhammer acts as the basis for various vulnerabilities that target sensitive structures, such as Page Table Entries (PTEs) or opcodes, to grant control over a victim machine. However, in this paper, we consider Rowhammer as a more general vulnerability, presenting a novel exploit vector for Rowhammer that targets particular code patterns. We show that if victim code is designed to return benign data to an unprivileged user, and uses nested pointer dereferences, Rowhammer can flip these pointers to gain arbitrary read access in the victim's address space. Furthermore, we identify gadgets present in the Linux kernel, and demonstrate an end-to-end attack that precisely flips a targeted pointer. To do so we developed a number of improved Rowhammer primitives, including kernel memory massaging, Rowhammer synchronization, and testing for kernel flips, which may be of broader interest to the Rowhammer community. Compared to prior works' leakage rate of .3 bits/s, we show that such gadgets can be used to read out kernel data at a rate of 82.6 bits/s. By targeting code gadgets, this work expands the scope and attack surface exposed by Rowhammer. It is no longer sufficient for software defenses to selectively pad previously exploited memory structures in flip-safe memory, as any victim code that follows the pattern in question must be protected. 
    more » « less
  2. Transactional memory is a concurrency control mechanism that dynamically determines when threads may safely execute critical sections of code. It provides the performance of fine-grained locking mechanisms with the simplicity of coarse-grained locking mechanisms. With hardware based transactions, the protection of shared data accesses and updates can be evaluated at runtime so that only true collisions to shared data force serialization. This paper explores the use of transactional memory as an alternative to conventional synchronization mechanisms for managing the pending event set in a Time Warp synchronized parallel simulator. In particular, we explore the application of Intel’s hardware-based transactional memory (TSX) to manage shared access to the pending event set by the simulation threads. Comparison between conventional locking mechanisms and transactional memory access is performed to evaluate each within the warped Time Warp synchronized parallel simulation kernel. In this testing, evaluation of both forms of transactional memory found in the Intel Haswell processor, Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM), are evaluated. The results show that RTM generally outperforms conventional locking mechanisms and that HLE provides consistently better performance than conventional locking mechanisms, in some cases as much as 27%. 
    more » « less
  3. Deeply embedded systems powered by microcontrollers are becoming popular with the emergence of Internet-of-Things (IoT) technology. However, these devices primarily run C/C\({+}{+}\)code and are susceptible to memory bugs, which can potentially lead to both control data attacks and non-control data attacks. Existing defense mechanisms (such as control-flow integrity (CFI), dataflow integrity (DFI) and write integrity testing (WIT), etc.) consume a massive amount of resources, making them less practical in real products. To make it lightweight, we design a bitmap-based allowlist mechanism to unify the storage of the runtime data for protecting both control data and non-control data. The memory requirements are constant and small, regardless of the number of deployed defense mechanisms. We store the allowlist in the TrustZone to ensure its integrity and confidentiality. Meanwhile, we perform an offline analysis to detect potential collisions and make corresponding adjustments when it happens. We have implemented our idea on an ARM Cortex-M-based development board. Our evaluation results show a substantial reduction in memory consumption when deploying the proposed CFI and DFI mechanisms, without compromising runtime performance. Specifically, our prototype enforces CFI and DFI at a cost of just 2.09% performance overhead and 32.56% memory overhead on average.

     
    more » « less
  4. Garbage collection (GC) support for unmanaged languages can reduce programming burden in reasoning about liveness of dynamic objects. It also avoids temporal memory safety violations and memory leaks.SoundGC for weakly-typed languages such as C/C++, however, remains an unsolved problem. Current value-based GC solutions examine values of memory locations to discover the pointers, and the objects they point to. The approach is inherently unsound in the presence of arbitrary type casts and pointer manipulations, which are legal in C/C++. Such language features are regularly used, especially in low-level systems code.

    In this paper, we propose Dynamic Pointer Provenance Tracking to realize sound GC. We observe that pointers cannot be created out-of-thin-air, and they must have provenance to at least one valid allocation. Therefore, by tracking pointer provenance from the source (e.g., malloc) through both explicit data-flow and implicit control-flow, our GC has sound and precise information to compute the set of all reachable objects at any program state. We discuss several static analysis optimizations, that can be employed during compilation aided with profiling, to significantly reduce the overhead of dynamic provenance tracking from nearly 8× to 16% for well-behaved programs that adhere to the C standards. Pointer provenance based sound GC invocation is also 13% faster and reclaims 6% more memory on average, compared to an unsound value-based GC.

     
    more » « less
  5. Abstract—Dense linear algebra kernels are critical for wireless, and the oncoming proliferation of 5G only amplifies their importance. Due to the inductive nature of many such algorithms, parallelism is difficult to exploit: parallel regions have fine grain producer/consumer interaction with iteratively changing dependence distance, reuse rate, and memory access patterns. This causes a high overhead both for multi-threading due to fine-grain synchronization, and for vectorization due to the nonrectangular iteration domains. CPUs, DSPs, and GPUs perform order-of-magnitude below peak. Our insight is that if the nature of inductive dependences and memory accesses were explicit in the hardware/software interface, then a spatial architecture could efficiently execute parallel code regions. To this end, we first extend the traditional dataflow model with first class primitives for inductive dependences and memory access patterns (streams). Second, we develop a hybrid spatial architecture combining systolic and dataflow execution to attain high utilization at low energy and area cost. Finally, we create a scalable design through a novel vector-stream control model which amortizes control overhead both in time and spatially across architecture lanes. We evaluate our design, REVEL, with a full stack (compiler, ISA, simulator, RTL). Across a suite of linear algebra kernels, REVEL outperforms equally-provisioned DSPs by 4.6×—37×. Compared to state-of-the-art spatial architectures, REVEL is mean 3.4× faster. Compared to a set of ASICs, REVEL is only 2× the power and half the area. 
    more » « less