skip to main content


Title: CBMM: Financial Advice for Kernel Memory Managers
First-party datacenter workloads present new challenges to kernel memory management (MM), which allocates and maps memory and must balance competing performance concerns in an increasingly complex environment. In a datacenter, performance must be both good and consistent to satisfy service-level objectives. Unfortunately, current MM designs often exhibit inconsistent, opaque behavior that is difficult to reproduce, decipher, or fix, stemming from (1) a lack of high-quality information for policymaking, (2) the cost-unawareness of many current MM policies, and (3) opaque and distributed policy implementations that are hard to debug. For example, the Linux huge page implementation is distributed across 8 files and can lead to page fault latencies in the 100s of ms. In search of a MM design that has consistent behavior, we designed Cost-Benefit MM (CBMM), which uses empirically based cost-benefit models and pre-aggregated profiling information to make MM policy decisions. In CBMM, policy decisions follow the guiding principle that userspace benefits must outweigh userspace costs. This approach balances the performance benefits obtained by a kernel policy against the cost of applying it. CBMM has competitive performance with Linux and HawkEye, a recent research system, for all the workloads we ran, and in the presence of fragmentation, CBMM is 35% faster than Linux on average. Meanwhile, CBMM nearly always has better tail latency than Linux or HawkEye, particularly on fragmented systems. It reduces the cost of the most expensive soft page faults by 2-3 orders of magnitude for most of our workloads, and reduces the frequency of 10-1000 us-long faults by around 2 orders of magnitude for multiple workloads.  more » « less
Award ID(s):
1815656
NSF-PAR ID:
10380347
Author(s) / Creator(s):
Date Published:
Journal Name:
Proceedings of the USENIX Annual Technical Conference
Page Range / eLocation ID:
593-608
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. First-party datacenter workloads present new challenges to kernel memory management (MM), which allocates and maps memory and must balance competing performance concerns in an increasingly complex environment. In a datacenter, performance must be both good \textit{and} consistent to satisfy service-level objectives. Unfortunately, current MM designs often exhibit inconsistent, opaque behavior that is difficult to reproduce, decipher, or fix, stemming from (1) a lack of high-quality information for policymaking, (2) the cost-unawareness of many current MM policies, and (3) opaque and distributed policy implementations that are hard to debug. For example, the Linux huge page implementation is distributed across 8 files and can lead to page fault latencies in the 100s of ms. In search of a MM design that has consistent behavior, we designed Cost-Benefit MM (CBMM), which uses empirically based cost-benefit models and pre-aggregated profiling information to make MM policy decisions. In CBMM, policy decisions follow the guiding principle that \textit{userspace benefits must outweigh userspace costs}. This approach balances the performance benefits obtained by a kernel policy against the cost of applying it. CBMM has competitive performance with Linux and HawkEye, a recent research system, for all the workloads we ran, and in the presence of fragmentation, CBMM is 35% faster than Linux on average. Meanwhile, CBMM nearly always has better tail latency than Linux or HawkEye, particularly on fragmented systems. It reduces the cost of the most expensive soft page faults by 2-3 orders of magnitude for most of our workloads, and reduces the frequency of 10-1000 mu s-long faults by around 2 orders of magnitude for multiple workloads. 
    more » « less
  2. Modern memory hierarchies are increasingly complex, with more memory types and richer topologies. Unfortunately kernel memory managers lack the extensibility that many other parts of the kernel use to support diversity. This makes it difficult to add and deploy support for new memory configurations, such as tiered memory: engineers must navigate and modify the monolithic memory management code to add support, and custom kernels are needed to deploy such support until it is upstreamed. We take inspiration from filesystems and note that VFS, the extensible interface for filesystems, supports a huge variety of filesystems for different media and different use cases, and importantly, has interfaces for memory management operations such as controlling virtual-to-physical mapping and handling page faults. We propose writing memory management systems as filesystems using VFS, bringing extensibility to kernel memory management. We call this idea File-Based Memory Management (FBMM). Using this approach, many recent memory management extensions, e.g., tiering support, can be written without modifying existing memory management code. We prototype FBMM in Linux to show that the overhead of extensibility is low (within 1.6%) and that it enables useful extensions. 
    more » « less
  3. null (Ed.)
    Extended Berkeley Packet Filter (BPF) has emerged as a powerful method to extend packet-processing functionality in the Linux operating system. BPF allows users to write code in high-level languages (like C or Rust) and execute them at specific hooks in the kernel, such as the network device driver. To ensure safe execution of a user-developed BPF program in kernel context, Linux uses an in-kernel static checker. The checker allows a program to execute only if it can prove that the program is crash-free, always accesses memory within safe bounds, and avoids leaking kernel data. BPF programming is not easy. One, even modest-sized BPF programs are deemed too large to analyze and rejected by the kernel checker. Two, the kernel checker may incorrectly determine that a BPF program exhibits unsafe behaviors. Three, even small performance optimizations to BPF code (e.g., 5% gains) must be meticulously hand-crafted by expert developers. Traditional optimizing compilers for BPF are often inadequate since the kernel checker's safety constraints are incompatible with rule-based optimizations. We present K2, a program-synthesis-based compiler that automatically optimizes BPF bytecode with formal correctness and safety guarantees. K2 produces code with 6--26% reduced size, 1.36%--55.03% lower average packet-processing latency, and 0--4.75% higher throughput (packets per second per core) relative to the best clang-compiled program, across benchmarks drawn from Cilium, Facebook, and the Linux kernel. K2 incorporates several domain-specific techniques to make synthesis practical by accelerating equivalence-checking of BPF programs by 6 orders of magnitude. 
    more » « less
  4. null (Ed.)
    Traditional end-host network stacks are struggling to keep up with rapidly increasing datacenter access link bandwidths due to their unsustainable CPU overheads. Motivated by this, our community is exploring a multitude of solutions for future network stacks: from Linux kernel optimizations to partial hardware o!oad to clean-slate userspace stacks to specialized host network hardware. The design space explored by these solutions would bene"t from a detailed understanding of CPU ine#ciencies in existing network stacks. This paper presents measurement and insights for Linux kernel network stack performance for 100Gbps access link bandwidths. Our study reveals that such high bandwidth links, coupled with relatively stagnant technology trends for other host resources (e.g., core speeds and count, cache sizes, NIC bu$er sizes, etc.), mark a fundamental shift in host network stack bottlenecks. For instance, we "nd that a single core is no longer able to process packets at line rate, with data copy from kernel to application bu$ers at the receiver becoming the core performance bottleneck. In addition, increase in bandwidth-delay products have outpaced the increase in cache sizes, resulting in ine#cient DMA pipeline between the NIC and the CPU. Finally, we "nd that traditional loosely-coupled design of network stack and CPU schedulers in existing operating systems becomes a limiting factor in scaling network stack performance across cores. Based on insights from our study, we discuss implications to design of future operating systems, network protocols, and host hardware. 
    more » « less
  5. null (Ed.)
    Memory approximation techniques are commonly limited in scope, targeting individual levels of the memory hierarchy. Existing approximation techniques for a full memory hierarchy determine optimal configurations at design-time provided a goal and application. Such policies are rigid: they cannot adapt to unknown workloads and must be redesigned for different memory configurations and technologies. We propose SEAMS: the first self-optimizing runtime manager for coordinating configurable approximation knobs across all levels of the memory hierarchy. SEAMS continuously updates and optimizes its approximation management policy throughout runtime for diverse workloads. SEAMS optimizes the approximate memory configuration to minimize energy consumption without compromising the quality threshold specified by application developers. SEAMS can (1) learn a policy at runtime to manage variable application quality of service ( QoS ) constraints, (2) automatically optimize for a target metric within those constraints, and (3) coordinate runtime decisions for interdependent knobs and subsystems. We demonstrate SEAMS’ ability to efficiently provide functions (1)–(3) on a RISC-V Linux platform with approximate memory segments in the on-chip cache and main memory. We demonstrate SEAMS’ ability to save up to 37% energy in the memory subsystem without any design-time overhead. We show SEAMS’ ability to reduce QoS violations by 75% with < 5% additional energy. 
    more » « less