skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: CARAT CAKE: replacing paging via compiler/kernel cooperation
Virtual memory, specifically paging, is undergoing significant innovation due to being challenged by new demands from modern workloads. Recent work has demonstrated an alternative software only design that can result in simplified hardware requirements, even supporting purely physical addressing. While we have made the case for this Compiler- And Runtime-based Address Translation (CARAT) concept, its evaluation was based on a user-level prototype. We now report on incorporating CARAT into a kernel, forming Compiler- And Runtime-based Address Translation for CollAborative Kernel Environments (CARAT CAKE). In our implementation, a Linux-compatible x64 process abstraction can be based either on CARAT CAKE, or on a sophisticated paging implementation. Implementing CARAT CAKE involves kernel changes and compiler optimizations/transformations that must work on all code in the system, including kernel code. We evaluate CARAT CAKE in comparison with paging and find that CARAT CAKE is able to achieve the functionality of paging (protection, mapping, and movement properties) with minimal overhead. In turn, CARAT CAKE allows significant new benefits for systems including energy savings, larger L1 caches, and arbitrary granularity memory management.  more » « less
Award ID(s):
1908488
PAR ID:
10344722
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
Page Range / eLocation ID:
98 to 114
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    As specialized hardware accelerators such as GPUs become increasingly popular, developers are looking for ways to target these platforms with high-level APIs. One promising approach is kernel libraries such as PyTorch or cuML, which provide interfaces that mirror CPU-only counterparts such as NumPy or Scikit-Learn. Unfortunately, these libraries are hard to develop and to adopt incrementally: they only support a subset of their CPU equivalents, only work with datasets that fit in device memory, and require developers to reason about data placement and transfers manually. To address these shortcomings, we present a new approach called offload annotations (OAs) that enables heterogeneous GPU computing in existing workloads with few or no code modifications. An annotator annotates the types and functions in a CPU library with equivalent kernel library functions and provides an offloading API to specify how the inputs and outputs of the function can be partitioned into chunks that fit in device memory and transferred between devices. A runtime then maps existing CPU functions to equivalent GPU kernels and schedules execution, data transfers and paging. In data science workloads using CPU libraries such as NumPy and Pandas, OAs enable speedups of up to 1200⇥ and a median speedup of 6.3⇥ by transparently offloading functions to a GPU using existing kernel libraries. In many cases, OAs match the performance of handwritten heterogeneous implementations. Finally, OAs can automatically page data in these workloads to scale to datasets larger than GPU memory, which would need to be done manually with most current GPU libraries. 
    more » « less
  2. The classical paging problem, introduced by Sleator and Tarjan in 1985, formalizes the problem of caching pages in RAM in order to minimize IOs. Their online formulation ignores the cost of address translation: programs refer to data via virtual addresses, and these must be translated into physical locations in RAM. Although the cost of an individual address translation is much smaller than that of an IO, every memory access involves an address translation, whereas IOs can be infrequent. In practice, one can spend money to avoid paging by over-provisioning RAM; in contrast, address translation is effectively unavoidable. Thus address-translation costs can sometimes dominate paging costs, and systems must simultane- ously optimize both. To mitigate the cost of address translation, all modern CPUs have translation lookaside buffers (TLBs), which are hardware caches of common address translations. What makes TLBs interesting is that a single TLB entry can potentially encode the address translation for many addresses. This is typically achieved via the use of huge pages, which translate runs of contiguous virtual addresses to runs of contiguous physical addresses. Huge pages reduce TLB misses at the cost of increasing the IOs needed to maintain contiguity in RAM. This tradeoff between TLB misses and IOs suggests that the classical paging problem does not tell the full story. This paper introduces the Address-Translation Problem, which formalizes the problem of maintaining a TLB, a page table, and RAM in order to minimize the total cost of both TLB misses and IOs. We present an algorithm that achieves the benefits of huge pages for TLB misses without the downsides of huge pages for IOs. 
    more » « less
  3. null (Ed.)
    In current operating system kernels and run-time systems, timing is based on hardware timer interrupts, introducing inherent overheads that limit granularity. For example, the scheduling quantum of preemptive threads is limited, resulting in this abstraction being restricted to coarse-grain parallelism. Compiler-based timing replaces interrupts from the hardware timer with callbacks from compiler-injected code. We describe a system that achieves low-overhead timing using whole-program compiler transformations and optimizations combined with kernel and run-time support. A key novelty is new static analyses that achieve predictable, periodic run-time behavior from the transformed code, regardless of control-flow path. We transform the code of a kernel and run-time system to use compiler-based timing and leverage the resulting fine-grain timing to extend an implementation of fibers (cooperatively scheduled threads),attaining what is effectively preemptive scheduling. The result combines the fine granularity of the cooperative fiber model with the ease of programming of the preemptive thread model. 
    more » « less
  4. Wasm is gaining popularity outside the Web as a well-specifed low-level binary format with ISA portability, low memory footprint and polyglot targetability, enabling efficient in- process sandboxing of untrusted code. Despite these advantages, Wasm adoption for new domains is often hindered by the lack of many standard system interfaces which precludes reusability of existing software and slows ecosystem growth. This paper proposes thin kernel interfaces for Wasm, which directly expose OS userspace syscalls without breaking intra- process sandboxing, enabling a new class of virtualization with Wasm as a universal binary format. By virtualizing the bottom layer of userspace, kernel interfaces enable effortless application ISA portability, compiler backend reusability, and armor programs with Wasm’s built-in control flow integrity and arbitrary code execution protection. Furthermore, existing capability-based APIs for Wasm, such as WASI, can be implemented as a Wasm module over kernel interfaces, improving reuse, robustness, and portability through better layering. We present an implementation of this concept for two kernels – Linux and Zephyr – by extending a modern Wasm engine and evaluate our system’s performance on a number of sophisticated applications which can run for the first time on Wasm. 
    more » « less
  5. null (Ed.)
    In modern runtime systems, memory layout calculations are hand-coded in systems languages. Primitives in these languages are not powerful enough to describe a rich set of layouts, leading to reliance on ad-hoc macros, numerous interrelated static constants, and other boilerplate code. Memory management policies must also carefully orchestrate their application of address calculations in order to modify memory cooperatively, a task ill-suited to low-level systems languages at hand which lack proper safety mechanisms. In this paper we introduce Floorplan, a declarative language for specifying high level memory layouts. Constraints formerly implemented by describing how to compute locations are, in Floorplan, defined declaratively using explicit layout constructs. The challenge here was to discover constructs capable of sufficiently enabling the automatic generation of address calculations. Floorplan is implemented as a compiler for generating a Rust library. In a case study of an existing implementation of the immix garbage collection algorithm, Floorplan eliminates 55 out of the 63 unsafe lines of code: 100% of unsafe lines pertaining to memory safety. 
    more » « less