skip to main content


Title: Adaptive Placement for In-memory Storage Functions
Fast networks and the desire for high resource utilization in data centers and the cloud have driven disaggregation. Application compute is separated from storage, but this leads to high overheads when data must move over the network for simple operations on it. Alternatively, systems could allow applications to run application logic within storage via user-defined functions. Unfortunately, this ties provisioning and utilization of storage and compute resources together again. We present a new approach to executing storage-level functions in an in-memory key-value store that avoids this problem by dynamically deciding where to execute functions over data. Users write storage functions that are logically decoupled from storage, but storage servers choose where to run invocations of these functions physically. By using a server-internal cost model and observing function execution, servers choose to directly run inexpensive functions, while preferring to execute functions with high CPU-cost at client machines. We show that with this approach storage servers can reduce network request processing costs, avoid server compute bottlenecks, and improve aggregate storage system throughput. We realize our approach on an in-memory key-value store that executes 3.2 million strict serializable user-defined storage functions per second with 100 us response times. When running a mix of logic from different applications, it provides throughput better than running that logic purely at storage servers (85% more) or purely at clients (10% more). For our workloads, it also reduces latency (up to 2x) and transactional aborts (up to 33%) over pure client-side execution.  more » « less
Award ID(s):
1750558
NSF-PAR ID:
10224443
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
2020 USENIX Annual Technical Conference
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Memcached is a widely used key-value store. It is structured as a multithreaded user-level server, accessed over socket connections by a potentially distributed collection of clients. Because socket communication is so much more expensive than a single operation on a K-V store, much of the client library is devoted to batching of requests. Batching is not always feasible, however, and the cost of communication seems particularly unfortunate when—as is often the case—clients are co-located on a single machine with the server, and have access to the same physical memory. Fortunately, recent work on _protected libraries_ has shown that it is possible, on current Intel processors, to amplify access rights quickly when calling into a specially configured user-level library. Library instances in separate processes can then share data safely, even in the face of independent process failures. We have used protected libraries to implement a new version of memcached in which client threads execute the code of the server themselves, without the need to send messages. Compared to the original, our new version is both significantly simpler, containing 24% less code, and dramatically faster, with a 11–56× reduction in latency and a roughly 2× increase in throughput. 
    more » « less
  2. In-memory key-value stores that use kernel-bypass networking serve millions of operations per second per machine with microseconds of latency. They are fast in part because they are simple, but their simple interfaces force applications to move data across the network. This is inefficient for operations that aggregate over large amounts of data, and it causes delays when traversing complex data structures. Ideally, applications could push small functions to storage to avoid round trips and data movement; however, pushing code to these fast systems is challenging. Any extra complexity for interpreting or isolating code cuts into their latency and throughput benefits. We present Splinter, a low-latency key-value store that clients extend by pushing code to it. Splinter is designed for modern multi-tenant data centers; it allows mutually distrusting tenants to write their own fine-grained extensions and push them to the store at runtime. The core of Splinter’s design relies on type- and memory-safe extension code to avoid conventional hardware isolation costs. This still allows for bare-metal execution, avoids data copying across trust boundaries, and makes granular storage functions that perform less than a microsecond of compute practical. Our measurements show that Splinter can process 3.5 million remote extension invocations per second with a median round-trip latency of less than 9 μs at densities of more than 1,000 tenants per server. We provide an implementation of Facebook’s TAO as an 800 line extension that, when pushed to a Splinter server, improves performance by 400 Kop/s to perform 3.2 Mop/s over online graph data with 30 μs remote access times. 
    more » « less
  3. null (Ed.)
    Multi-user oblivious storage allows users to access their shared data on the cloud while retaining access pattern obliviousness and data confidentiality simultaneously. Most secure and efficient oblivious storage systems focus on the utilization of the maximum network bandwidth in serving concurrent accesses via a trusted proxy. How- ever, since the proxy executes a standard ORAM protocol over the network, the performance is capped by the network bandwidth and latency. Moreover, some important features such as access control and security against active adversaries have not been thoroughly explored in such proxy settings. In this paper, we propose MOSE, a multi-user oblivious storage system that is efficient and enjoys from some desirable security properties. Our main idea is to harness a secure enclave, namely Intel SGX, residing on the untrusted storage server to execute proxy logic, thereby, minimizing the network bottleneck of proxy-based designs. In this regard, we address various technical design challenges such as memory constraints, side-channel attacks and scalability issues when enabling proxy logic in the secure enclave. We present a formal security model and analysis for secure enclave multi-user ORAM with access control. We optimize MOSE to boost its throughput in serving concurrent requests. We implemented MOSE and evaluated its performance on commodity hardware. Our evaluation confirmed the efficiency of MOSE, where it achieves approximately two orders of magnitudes higher throughput than the state-of-the-art proxy-based design, and also, its performance is scalable proportional to the available system resources. 
    more » « less
  4. null (Ed.)
    We present POSH, a framework that accelerates shell applications with I/O-heavy components, such as data analytics with command-line utilities. Remote storage such as networked filesystems can severely limit the performance of these applications: data makes a round trip over the network for relatively little computation at the client. Reducing the data movement by moving the code to the data can improve performance. POSH automatically optimizes unmodified I/O-intensive shell applications running over remote storage by offloading the I/O-intensive portions to proxy servers closer to the data. A proxy can run directly on a storage server, or on a machine closer to the storage layer than the client. POSH intercepts shell pipelines and uses metadata called annotations to decide where to run each command within the pipeline. We address three principal challenges that arise: an annotation language that allows POSH to understand which files a command will access, a scheduling algorithm that places commands to minimize data movement, and a system runtime to execute a distributed schedule but retain local semantics. We benchmark POSH on real shell pipelines such as image processing, network security analysis, log analysis, distributed system debugging, and git. We find that POSH provides speedups ranging from 1.6× to 15× compared to NFS, without requiring any modifications to the applications. 
    more » « less
  5. On large-scale high performance computing (HPC) systems, applications are provisioned with aggregated resources to meet their peak demands for brief periods. This results in resource underutilization because application requirements vary a lot during execution. This problem is particularly pronounced for deep learning applications that are running on leadership HPC systems with a large pool of burst buffers in the form of flash or non-volatile memory (NVM) devices. In this paper, we examine the I/O patterns of deep neural networks and reveal their critical need of loading many small samples randomly for successful training. We have designed a specialized Deep Learning File System (DLFS) that provides a thin set of APIs. Particularly, we design the metadata management of DLFS through an in-memory tree-based sample directory and its file services through the user-level SPDK protocol that can disaggregate the capabilities of NVM Express (NVMe) devices to parallel training tasks. Our experimental results show that DLFS can dramatically improve the throughput of training for deep neural networks on NVMe over Fabric, compared with the kernel-based Ext4 file system. Furthermore, DLFS achieves efficient user-level storage disaggregation with very little CPU utilization. 
    more » « less