<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>XRP: In-Kernel Storage Functions with eBPF</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>07/11/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10409936</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Yuhong Zhong</author><author>Haoyu Li</author><author>Yu Jian Wu</author><author>Ioannis Zarkadas</author><author>Jeffrey Tao</author><author>Evan Mesterhazy</author><author>Michael Makris</author><author>Junfeng Yang</author><author>Amy Tai</author><author>Ryan Stutsman</author><author>Asaf Cidon</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[With the emergence of microsecond-scale NVMe storage devices, the Linux kernel storage stack overhead has become significant, almost doubling access times. We present XRP, a framework that allows applications to execute user-defined storage functions, such as index lookups or aggregations, from an eBPF hook in the NVMe driver, safely bypassing most of the kernel’s storage stack. To preserve file system semantics, XRP propagates a small amount of kernel state to its NVMe driver hook where the user-registered eBPF functions are called. We show how two key-value stores, BPF-KV, a simple B+-tree key-value store, and WiredTiger, a popular log-structured merge tree storage engine, can leverage XRP to significantly improve throughput and latency.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>With the rise of new high performance memory technologies, such as 3D XPoint and low latency NAND, new NVMe storage devices can now achieve up to 7 GB/s bandwidth and latencies as low as 3 &#181;s [11, <ref type="bibr">19,</ref><ref type="bibr">24,</ref><ref type="bibr">26]</ref>. At such high performance, the kernel storage stack becomes a major source of overhead impeding both application-observed latency and IOPS. For the latest 3D XPoint devices, the kernel's storage stack doubles the I/O latency, and it incurs an even greater overhead for throughput ( &#167;2.1). As storage devices become even faster, the kernel's relative overhead is poised to worsen.</p><p>Existing approaches to tackle this problem tend to be radical, requiring intrusive application-level changes or new hardware. Complete kernel bypass through libraries such as SPDK <ref type="bibr">[82]</ref> allows applications to directly access underlying devices, but such libraries also force applications to implement their own file systems, to forgo isolation and safety, and to poll for I/O completion which wastes CPU cycles when I/O utilization is low. Others have shown that applications using SPDK suffer from high average and tail latencies and severely reduced throughput when the schedulable thread count exceeds the number of available cores <ref type="bibr">[54]</ref>; we confirm this in &#167;6, showing that in such cases applications indeed suffer a 3&#215; throughput loss with SPDK.</p><p>In contrast to these approaches, we seek a readilydeployable mechanism that can provide fast access to emerging fast storage devices that requires no specialized hardware and no significant changes to the application while working with existing kernels and file systems. To this end, we rely on BPF (Berkeley Packet Filter <ref type="bibr">[67,</ref><ref type="bibr">68]</ref>) which lets applications offload simple functions to the Linux kernel <ref type="bibr">[8]</ref>. Similar to kernel bypass, by embedding application-logic deep in the kernel stack, BPF can eliminate overheads associated with kernel-user crossings and the associated context switches. Unlike kernel bypass, BPF is an OS-supported mechanism that ensures isolation, does not lead to low utilization due to busywaiting, and allows a large number of threads or processes to share the same core, leading to better overall utilization.</p><p>The support of BPF in the Linux kernel makes it an attractive interface for allowing applications to speed up storage I/O. However, using BPF to speed up storage introduces several unique challenges. Unlike existing packet filtering and tracing use cases, where each BPF function can operate in a self-contained manner on a particular packet or system trace -for example, network packet headers specify which flow they below to -a storage BPF function may need to synchronize with other concurrent application-level operations or require multiple function calls to traverse a large on-disk data structure, a workload pattern we call "resubmission" of I/Os ( &#167;2.3). Unfortunately the state required for resubmission such as access-control information or metadata on how individual storage blocks fit in the larger data structure they belong to is not available at lower layers.</p><p>To tackle these challenges, we design and implement XRP (eXpress Resubmission Path), a high-performance storage data path using Linux eBPF. XRP is inspired by XDP, the recent efficient Linux eBPF networking hook <ref type="bibr">[28]</ref>. In order to maximize its performance benefit, XRP uses a hook in the NVMe driver's interrupt handler, thereby bypassing the kernel's block, file system and system call layers. This allows XRP to trigger BPF functions directly from the NVMe driver as each I/O completes, enabling quick resubmission of I/Os that traverse other blocks on the storage device.</p><p>The key challenge in XRP is that the low-level NVMe driver lacks the context that the higher levels provide. Those layers contain information such as who owns a block (file system layer), how to interpret the block's data, and how to traverse the on-disk data structure (application layer).</p><p>Our insight is that many storage-optimized data structures that power real-world databases <ref type="bibr">[10,</ref><ref type="bibr">12,</ref><ref type="bibr">20,</ref><ref type="bibr">27,</ref><ref type="bibr">44,</ref><ref type="bibr">66,</ref><ref type="bibr">70,</ref><ref type="bibr">80]</ref> -such as on-disk B-trees, log-structured merge trees, and log segments -are typically implemented on a small set of large files, and they are updated orders of magnitude less frequently than they are read; we validate this in &#167;3. Hence, we exclusively focus XRP on operations contained within one file and on data structures that have a fixed layout on disk. Consequently, the NVMe driver only requires a minimal amount of the file system mapping state, which we term the metadata digest; this information is small enough that it can be passed from the file system to the NVMe driver so it can safely perform I/O resubmissions. This allows XRP to safely support some of the most popular on-disk data structures.</p><p>We present a design and implementation of XRP on Linux, with support for ext4, which can easily be extended to other file systems. XRP enables the NVMe interrupt handler to resubmit storage I/Os based on user-defined BPF functions.</p><p>We augment two key-value stores with XRP: BPF-KV, a B + -tree based key-value store that is custom-designed for supporting BPF functions, and WiredTiger's log-structured merge tree, which is used as one of MongoDB's storage engines <ref type="bibr">[27]</ref>. With random 512 B object reads on BPF-KV with multiple threads using a B + -tree that has three index levels on disk, XRP has 47%-94% higher throughput and 20%-34% lower p99 latency than read(). XRP also enables more efficient sharing of cores among applications than kernel bypass: it is able to provide 56% better p99 latency than SPDK with two threads sharing the same core. In addition, XRP is able to consistently improve WiredTiger's performance by up to 24% under YCSB <ref type="bibr">[41]</ref>. We open source XRP and our changes to BPF-KV and WiredTiger at <ref type="url">https://github.com/xrp-project/XRP</ref>.</p><p>We make the following contributions.</p><p>1. New Datapath. XRP is the first datapath that enables the use of BPF to offload storage functions to the kernel. 2. Performance. XRP improves the throughput of a B-tree lookup by up to 2.5&#215; compared to normal system calls. 3. Utilization. XRP provides latencies that approach kernel bypass, but unlike kernel bypass, it allows cores to be efficiently shared by the same threads and processes. 4. Extensibility. XRP supports different storage use cases, including different data structures and storage operations (e.g., index traversals, range queries, aggregations).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Background and Motivation</head><p>In this section we show why the Linux kernel is becoming a primary bottleneck with fast NVMe devices and provide a  primer on BPF.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1">Software is Now the Storage Bottleneck</head><p>New media like 3D Xpoint <ref type="bibr">[1]</ref> and low-latency NAND <ref type="bibr">[26]</ref>, have led to new NVMe storage devices that exhibit singledigit &#181;s latencies and millions of IOPS <ref type="bibr">[11,</ref><ref type="bibr">19,</ref><ref type="bibr">24,</ref><ref type="bibr">26]</ref>. The kernel storage stack is becoming a major performance bottleneck when accessing these devices. Figure <ref type="figure">1</ref> shows the percentage of time spent in the Linux stack when issuing a 512 B random read I/O on different storage devices. While the software overhead for the first generation of fast NVMe devices (first generation Intel Optane or Z-NAND) was nonnegligible (~15%), with the latest generation of devices (Intel Optane SSD P5800X) the software overhead accounts for about half of the latency of each read request. The kernel's relative overhead will only get worse as storage devices become even faster.</p><p>Where is the time going? Table <ref type="table">1</ref> shows the time spent in the different storage layers when issuing a random 512 B read with O_DIRECT on Optane P5800X. The experimental setup, which is used throughout this section, is a server with 6-core i5-8500 3 GHz with 16 GB of memory, using Ubuntu 20.04, and Linux 5.8.0. We also disable processor C-states and turbo boost, use the maximum performance governor, and disable KPTI <ref type="bibr">[30]</ref>. The experiment shows that the most expensive layer is the file system (ext4), followed by the block layer (bio) and the kernel crossing, and that the total software overhead accounts for 48.6% of the average latency.</p><p>Why not just bypass the kernel? One approach to eliminate kernel overhead is to bypass it altogether <ref type="bibr">[7,</ref><ref type="bibr">65,</ref><ref type="bibr">82,</ref><ref type="bibr">83]</ref>, leaving just the cost to post a request to the NVMe driver and the device's latency. However, kernel bypass is no panacea: each user is entrusted with full access to the device; they must also construct their own user space file systems <ref type="bibr">[73,</ref><ref type="bibr">74]</ref>. This means that there is no mechanism to enforce fine-grained  isolation or to share capacity among different applications accessing the same device. In addition, there is no efficient way for user space applications to receive interrupts on I/O completions, so applications must directly poll on device completion queues to obtain high performance. Consequently, when I/O is not the bottleneck, cores cannot be shared among processes, which results in significant under-utilization and wasted CPU. Furthermore, when more than one polling thread shares the same processor, the CPU contention between them coupled with the lack of synchronization lead all polling threads to experience degraded tail latency and significantly lower overall throughput. Recent work has highlighted this issue <ref type="bibr">[54]</ref> and we reproduce it in &#167;6.2.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">BPF Primer</head><p>BPF (Berkeley Packet Filter) is an interface that allows users to offload a simple function to be executed by the kernel. Linux's framework for BPF is called eBPF (extended BPF) <ref type="bibr">[23]</ref>. Linux eBPF is commonly used for filtering packets (e.g., TCPdump) <ref type="bibr">[5,</ref><ref type="bibr">6,</ref><ref type="bibr">28,</ref><ref type="bibr">52]</ref>, load balancing and packet forwarding <ref type="bibr">[5,</ref><ref type="bibr">18,</ref><ref type="bibr">25,</ref><ref type="bibr">60]</ref>, tracing [2,4,50], packet steering <ref type="bibr">[46]</ref>, network scheduling <ref type="bibr">[53,</ref><ref type="bibr">58]</ref> and network security checks <ref type="bibr">[15]</ref>. Functions are verified by the kernel at install-time to ensure they are safe; for example, they are checked to make sure they do not contain too many instructions, unbounded loops, or accesses to out-of-bounds memory addresses <ref type="bibr">[29]</ref>. After verification, which typically takes a few seconds or less, the eBPF functions can be called normally.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3">The Potential Benefit of BPF</head><p>BPF can be a mechanism for avoiding data movement between the kernel and user space in cases when a logical lookup requires a sequence of "  as in pointer-chasing workloads. For example, to traverse a B-tree index, a lookup at each level traverses the kernel's entire storage stack only to be thrown away by the application once it obtains the pointer to the next child node in the tree. Instead of a sequence of system calls from user space, each of the intermediate pointer lookups could be executed by a BPF function, which would parse the B-tree node to find the pointer to the relevant child node. The kernel would then submit the I/O to fetch the next node. Chaining a sequence of such BPF functions could avoid the cost of traversing kernel layers and moving data to user space.</p><p>Other popular on-disk data structures, such as logstructured merge trees (LSM trees) <ref type="bibr">[70]</ref>, also have such auxiliary pointer lookups which can be accelerated using BPF functions. Other types of operations that would benefit from such an approach include range queries, iterators, and other types of aggregations (e.g., obtain the maximum or average value in a range of key-value pairs). In all of these cases, only a single result or a small subset of the objects that might be accessed by the storage system ultimately need to be returned to the application.</p><p>The BPF function that resubmits (dispatches) I/O in auxiliary I/O workloads could be placed at any layer of the kernel. Figure <ref type="figure">2</ref> shows the I/O paths for both normal user space dispatch and for two possible locations of BPF resubmission hooks: in the syscall layer and in the NVMe driver. Zhong et al. <ref type="bibr">[85]</ref> compared the performance improvement from a resubmission hook in both locations on workloads with auxiliary I/O by measuring the speedup of lookup queries on an on-disk B-tree of depth 10. The baseline for comparison is reading I/O through the read system call. Table <ref type="table">2</ref> summarizes the results.</p><p>Best Case Acceleration. Dispatching the I/O requests from the NVMe driver provides a significant latency reduction (up to 49%) and corresponding speedup (up to 2.5&#215;), since it bypasses almost the entire kernel software stack. On the other hand, as expected, issuing the BPF functions from the syscall dispatch layer only provides a maximum speedup of 1.25&#215;, since the requests only benefit from eliminating kernel boundary crossings, which only account for 5-6% of the kernel overhead (Table <ref type="table">1</ref>). After reaching CPU saturation, the computation savings of reissuing the submissions from the NVMe driver translate into throughput improvements of 1.8-2.5&#215;, depending on the number of threads in the workload <ref type="bibr">[85]</ref>.  Placing an eBPF hook anywhere in the kernel may improve throughput between 1.2-2.5&#215;. However, pushing the I/O dispatching as close as possible to the storage device dramatically improves the performance of a traversal. Hence to obtain the highest possible speedup, XRP's resubmission hook should reside in the NVMe driver.</p><p>What about io_uring? io_uring is a new Linux system call framework <ref type="bibr">[9]</ref> that allows processes to submit batches of asynchronous I/O requests, which amortizes user-kernel crossings. However, each I/O submitted with io_uring still passes through all the layers shown in Table <ref type="table">1</ref>, so each individual I/O still incurs the full storage stack overhead. In fact, BPF I/O resubmissions are largely complementary to io_uring: io_uring can efficiently submit batches of I/Os that trigger different I/O chains managed by BPF in the kernel.</p><p>Figure <ref type="figure">3</ref> shows throughput improvements when using io_uring with a BPF hook in the NVMe driver. I/O Chain Length denotes the total number of I/Os, including the initial I/O and the resubmitted I/Os. Figure <ref type="figure">3</ref> shows that BPF can increase throughput with respect to io_uring by up to 1.5&#215; for small batch sizes and up to 2.5&#215; as batch sizes increase.</p><p>In summary, BPF can benefit both legacy read and io_uring system calls. By placing the hook in the kernel NVMe driver, BPF may increase throughput of both legacy I/O and single-threaded io_uring by up to 2.5&#215;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Design Challenges and Principles</head><p>As shown in the previous section, I/O resubmission must occur as close to the device as possible in order to reap the greatest benefits. In the NVMe software stack, this is the NVMe interrupt handler. However, executing the resubmissions from within the NVMe interrupt handler, which lacks the context of the file system layer, introduces two major challenges.</p><p>Challenge 1: address translation and security. The NVMe driver has no access to file system metadata. In the example of an index traversal, XRP issues a read I/O to a particular block and executes a BPF function that would extract the offset of the next block it would like to query. However, this offset is meaningless to the NVMe layer, since it cannot tell which physical block the offset corresponds to without having access to the file's metadata and extents. Even if the application developer made the effort to embed physical block addresses to avoid the translation of the file system offset, which would be burdensome, the BPF function could access any block on the device, including blocks that belong to a file that the user does not have permissions to access.</p><p>Challenge 2: concurrency and caching. It is challenging to enable concurrent reads and writes issued from the file system with XRP. A write issued from the file system will only be reflected in the page cache, which is not visible to XRP. In addition, any writes that modify the layout of the data structure (e.g., modify the pointers to the next block) that are issued concurrently to read requests could lead XRP to accidentally fetch the wrong data. Both of these could be addressed by locking, but accessing locks from within the NVMe interrupt handler may be expensive.</p><p>Observation: most on-disk data structures are stable. Both of these challenges would make it difficult to implement arbitrary concurrent BPF storage functions. However, we make the observation that the files of many storage engines (e.g., LSM trees and B-trees) remain relatively stable. Some data structures simply do not modify on-disk storage structures in-place. For example, once an LSM tree writes its index files (called SSTables) to disk, they are immutable until they are deleted <ref type="bibr">[12,</ref><ref type="bibr">27,</ref><ref type="bibr">44]</ref>. Accessing these immutable on-disk storage structures requires less synchronization effort. Similarly, even though some on-disk B-tree index implementations support in-place updates, their file extents remain stable for long periods of time. We verify this in a 24-hour YCSB <ref type="bibr">[41]</ref> (40% reads, 40% updates, 20% inserts, Zipfian 0.7) experiment on MariaDB running TokuDB <ref type="bibr">[20]</ref>, which uses a fractal tree (an on-disk B-tree variant) as its lookup index. We found the index file's extents only changed every 159 seconds on average, with only 5 extent changes in 24 hours unmapping any blocks, making it possible to cache file system metadata in the NVMe driver without the overhead of frequent updates. We also make the observation that in all of these storage engines, the indices are stored on a small number of large files, and each index does not span multiple files.</p><p>Design principles. These observations and experiments inform the following design principles.</p><p>&#8226; One file at a time. We initially restrict XRP to only issue chained resubmissions on a single file. This greatly simplifies address translation and access control, and it minimizes the metadata that we need to push down to the NVMe driver (the metadata digest, &#167;4.1.3). &#8226; Stable data structures. XRP targets data structures, whose layout (i.e. pointers) remain immutable for a long period of time (i.e. seconds or more). Such data structures include the indices used in many popular commercial storage engines, such as RocksDB <ref type="bibr">[44]</ref>, LevelDB <ref type="bibr">[12]</ref>, TokuDB <ref type="bibr">[12]</ref> and WiredTiger <ref type="bibr">[27]</ref>. Since the cost of im-plementing locks in the NVMe layer is high, we also initially do not plan to support operations that require locks during the traversal or iteration of data structures. &#8226; User-managed caches. XRP does not interface with the page cache, so XRP functions cannot safely be run concurrently if blocks are buffered in the kernel page cache. This constraint is acceptable since popular storage engines often implement their own user space caches <ref type="bibr">[20,</ref><ref type="bibr">27,</ref><ref type="bibr">39,</ref><ref type="bibr">44]</ref>; Commonly they do this to fine-tune their caching and prefetching policies and to cache data in an applicationmeaningful way (e.g., cache key-value pairs or database rows instead of physical blocks). &#8226; Slow path fallback. XRP is best-effort; if a traversal fails for some reason (e.g., the extent mappings become stale), the application must retry or fall back to dispatching the I/O requests using user space system calls.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">XRP Design and Implementation</head><p>This section presents XRP's design and implementation with Linux eBPF and ext4. We describe the kernel modifications that enable XRP's resubmission logic in the interrupt handler, and how applications are modified to use XRP. We also discuss XRP's synchronization and scheduling limitations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Resubmission Logic</head><p>The core of XRP augments the NVMe interrupt handler with resubmission logic that consists of a BPF hook, a file system translation step, and the construction and resubmission of the next NVMe request at the new physical offsets (Figure <ref type="figure">4</ref>). Our modifications to the Linux kernel consist of ~900 lines of code: ~500 lines for the BPF hook and the changes to the NVMe driver, ~400 lines for the file system translation step.</p><p>When an NVMe request completes, the device generates an interrupt that causes the kernel to context switch into the interrupt handler. For each NVMe request that is completed in the interrupt context, XRP calls its associated BPF function (bpf _ func _ 0 in Figure <ref type="figure">4</ref>), the pointer of which is stored in a field in the kernel I/O request struct (i.e. struct bio). After calling the BPF function, XRP invokes the metadata digest, which is usually a digest of file system state that enables XRP to translate the logical address of the next resubmission. Finally, XRP prepares the next NVMe command resubmission by setting the corresponding fields in the NVMe request, and it appends the request to the NVMe submission queue (SQ) for that core.</p><p>For a particular NVMe request, the resubmission logic is called as many times as necessary for subsequent completions as determined by the specific BPF function registered with the NVMe request. For example, for traversing a tree-like data structure, the BPF function would resubmit I/O requests for branch nodes and end resubmission whenever a leaf node is found. In our current prototype there is no hard limit on the number of resubmissions before the completion returns control to the application; such a limit would be necessary to prevent unbounded execution. A hard limit can be enforced by maintaining a resubmission counter in each I/O request descriptor. Since I/O request descriptors cannot be accessed from user space or from XRP's BPF programs, their hard resubmission limits cannot be overridden by users even if XRP has multiple BPF functions that execute request resubmissions. BPF function contexts are per-request, while the metadata digest is shared across all invocations of the interrupt handler across all cores. Safe concurrent access to the metadata digest relies on read-copy-update (RCU) ( &#167;4.1.3).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.1">BPF Hook</head><p>XRP introduces a new BPF type (BPF _ PROG _ TYPE _ XRP) with the signature shown in Listing 1 -any BPF function that matches the signature can be called from the hook. &#167;5 presents one concrete BPF function matching this signature that is used in our application. For example, for on-disk data structure traversal, the BPF function typically contains logic to extract the next offset to fetch from the block.</p><p>BPF _ PROG _ TYPE _ XRP programs require a context with five fields, categorized into fields that are inspected or modified by the BPF caller (resubmission logic in the interrupt handler), and fields that should be private to the BPF function. Fields that are accessed externally include data, which buffers data read from the disk (e.g., a B-tree page waiting to be parsed by the BPF function). done is a boolean that notifies the resubmission logic whether to return to the user or continue resubmitting I/O requests. next _ addr and size are arrays of logical addresses and their corresponding sizes that indicate the next logical addresses for resubmission.</p><p>In order to support data structures with fanout, multiple next _ addr values can be supplied. By default we limit fanout to 16; on-disk data structures align their components to small multiples of device pages, so we have not encountered a need for higher fanout per completion. For example, chained hash table buckets are likely implemented as a chain of individual physical pages and the elements of an on-disk linked list are likely implemented at the granularity of physical pages. Setting a corresponding size field to zero issues no I/O.</p><p>scratch is a scratch space that is private to the user and the BPF function. It can be used to pass the parameters from the user to the BPF function. Also, the BPF function can use it to store intermediate data in between I/O resubmissions and to return data to the user. For example, in the first BPF invocation, the application can store a search key in the scratch buffer so that the BPF function can compare it with the keys in the disk block in order to find the next offset. When the I/O chain reaches the leaf node of the B-tree, the BPF function then places the key-value pair in the scratch buffer to return it back to the application. For simplicity, we assume that the size of the scratch buffer is always 4 KB. We find that a 4 KB scratch buffer is sufficient to support a BPF function for a production key-value store ( &#167;5). BPF functions can also use BPF maps to store more data if their intermediate data cannot fit into the scratch buffer. Each BPF context is private to one NVMe request, so no locking is needed when working with BPF context state. Letting the user supply a scratch buffer (instead of using BPF map) avoids the overhead of processes and functions having to call bpf _ map _ lookup _ elem to access the scratch buffer.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.2">BPF Verifier</head><p>The BPF verifier ensures memory safety by tracking the semantics of the value stored in each register <ref type="bibr">[14]</ref>. A valid value can either be a scalar or a pointer. SCALAR _ TYPE represents a value that cannot be dereferenced. The verifier defines various pointer types; most of them include extra constraints beyond the no out-of-bound access requirement. For example, PTR _ TO _ CTX is the type for the pointer to a BPF context. It can only be dereferenced using a constant offset so the verifier can identify which context field a memory operation void update _ mapping(struct inode * inode); void lookup _ mapping(struct inode * inode, off _ t offset, size _ t len, struct mapping * result);</p><p>Listing 2: Metadata digest: XRP exposes an interface to share logical-to-physical-block mappings between the file system and the IRQ handler.</p><p>accesses. Each BPF function type also defines a callback function is _ valid _ access() to perform additional checks on context accesses and to return the value type of the context field. PTR _ TO _ MEM describes a pointer referring to a fixed-size memory region. It supports dereferencing using a variable offset as long as the access is always within bounds. The data and scratch fields of the BPF _ PROG _ TYPE _ XRP context are PTR _ TO _ MEM and the rest are SCALAR _ TYPE. We augment the verifier to allow the BPF _ PROG _ TYPE _ XRP's is _ valid _ access() callback to pass the size of the data buffer or scratch buffer to the verifier so that it can perform the boundary check. We discussed our proposed modification to the verifier with the Linux eBPF maintainers, and they think it is sensible.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.3">The Metadata Digest</head><p>In the conventional storage stack, the logical block offsets in on-disk data structures are translated by the file system in order to identify the next physical block to read. This translation step also enforces access control and security, preventing reading in regions that are not mapped to the open file. In XRP, the next logical address for a lookup is given by the next _ addr field after the BPF call. However, translating this logical address to a physical address is challenging since the interrupt handler has no notion of a file and does not perform physical address translation.</p><p>To solve this, we implement the metadata digest, a thin interface between the file system and the interrupt handler that lets the file system share its logical-to-physical-block mappings with the interrupt handler, enabling safe eBPF-based on-disk resubmissions. The metadata digest consists of two functions (Listing 2). The update function is called within the file system when the logical-to-physical mapping is updated. The lookup function is called within the interrupt handler; it returns the mapping for a given offset and length. The lookup function also enforces access control by preventing BPF functions from requesting resubmissions for blocks outside of the open file. The inode address of the open file is passed to the interrupt handler in order to query the metadata digest. If an invalid logical address is detected, XRP returns to user space immediately with an error code. The application can then fall back to normal system calls to attempt its request again.</p><p>These two functions are specific to each file system, and even for a particular file system, there may be multiple ways to implement the metadata digest, presenting a tradeoff between ease of implementation and performance. For example, in our implementation for ext4, the metadata digest consists of a cached version of the extent status tree, which stores the physical-to-logical block mappings. This cached tree is accessed by the update and lookup function of the interface, and it uses read-copy-update (RCU) for concurrency control. RCU enables the lookup function to be lockless and fast (96 ns on average).</p><p>To keep the cached tree up-to-date with the extents in ext4, the update function is called in two places in ext4: whenever extents are inserted or removed from the main extent tree. To prevent a race condition where an extent is modified while there is an inflight read on it, we maintain a version number for each extent to track its changes. After data is read, but before it is passed to the BPF function, a second metadata digest lookup is performed. If the corresponding extent no longer exists or its version number has changed, XRP will abort the operation. Since application-level synchronization usually prevents concurrent modifications and lookups on the same region of a file at the same time, version mismatches should only occur if the application is buggy or malicious.</p><p>An alternative, simpler implementation of the metadata digest for ext4 could simply pass through to existing update and access functions of the extent tree in ext4. In this case, the update function would be a no-op, because ext4 already keeps its extent tree up-to-date. However, such an implementation would be much slower on the lookup path, because the extent lookup function in ext4 acquires a spinlock, which would be prohibitively expensive in the interrupt handler.</p><p>For now, XRP only supports the ext4 file system, but the metadata digest can be easily implemented for other file systems. For example, in f2fs <ref type="bibr">[64]</ref>, logical-to-physical-block mappings are stored in the node address table (NAT). Similar to the ext4 implementation, an implementation of its metadata digest could cache a local copy of the NAT, which would be consulted in lookup _ mapping. Then update _ mapping would need to be called anywhere in f2fs where the NAT is updated.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.4">Resubmitting NVMe Requests</head><p>After looking up the physical block offsets, XRP prepares the next NVMe request. Because this logic occurs in the interrupt handler, to avoid the (slow) kmalloc calls needed to prepare NVMe requests, XRP reuses the existing NVMe request struct (i.e. struct nvme _ command) of the just-completed request. XRP simply updates the physical sector and block addresses of the existing NVMe request to the new offsets derived from the mapping lookup. Reusing NVMe request structs for immediate resubmission is safe because neither user space nor XRP BPF programs can access the raw NVMe request structs.</p><p>While struct bpf _ xrp supports a maximum fanout of 16, in the current implementation a resubmitted I/O request can only fetch as many physical segments as the initial NVMe request. For example, if an initial NVMe request only fetches a single block, then all subsequent resubmissions for that request can only fetch a single physical segment. During a resubmission chain, if the BPF call returns multiple valid addresses in next _ addr, XRP will abort the request. This limitation can be worked around by allocating and setting up 16 dummy NVMe commands in the first I/O request so that subsequent resubmissions can express fanout if necessary.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Synchronization Limitations</head><p>BPF currently only supports a limited spinlock for synchronization. The verifier only allows BPF programs to acquire one lock at a time, and they must release the lock before returning. Also, user space applications do not have direct access to these BPF spinlocks. Instead, they must invoke the bpf() syscall; the syscall can read or write the lock-protected structure while holding the lock for the duration of that operation. Hence, complex modifications that require synchronizing across multiple reads and writes cannot be accomplished in user space.</p><p>Users can implement custom spinlocks using BPF atomic operations. This allows both BPF functions and user space programs to acquire any spinlock directly. However, the termination constraint prohibits BPF functions from spinning to wait for a spinlock infinitely. Another option for synchronization is RCU. Since XRP BPF programs are run in the NVMe interrupt handler, which cannot be preempted, de-facto they are already in an RCU read-side critical section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Interaction with Linux Schedulers</head><p>Process scheduler. Interestingly, we observed that a microsecond-scale storage device like Optane SSD interferes with Linux's CFS when multiple processes share the same core, even when all I/O is issued from user space. For example, in the case where an I/O-heavy and compute-heavy process share the same core, the I/O interrupts generated by the I/O-heavy process will be handled in the timeslice of the compute-heavy process. This may cause the computeheavy process to be starved of CPU; in the worst case in our experiments, the compute-heavy process only received about 34% of what would be a "fair" allocation of CPU time. We experimentally verified this does not occur when using a slower storage device, which generates interrupts much less frequently. While XRP exarcerbates this problem by generating chains of interrupts, this issue is not specific to eBPF, and can also be caused by network-driven interrupts <ref type="bibr">[59]</ref>. We leave this problem for future work. I/O scheduler. XRP bypasses Linux's I/O scheduler, which sits at the block layer. However, the noop scheduler is already the default I/O scheduler for NVMe devices, and the NVMe standard supports arbitration at hardware queues if fairness is a requirement <ref type="bibr">[17]</ref>. We present two case studies on how applications should be modified to use XRP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Case Studies</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">BPF-KV</head><p>We built a simple key-value store, called BPF-KV, with which we can evaluate XRP against other baselines: Linux's synchronous and asynchronous system calls and kernel bypass (SPDK <ref type="bibr">[82]</ref>). BPF-KV is designed to store a large number of small objects and to provide good read performance even under uniform access patterns. BPF-KV uses a B + -tree index to find the location of objects, and the objects themselves are stored in an unsorted log. For simplicity, BPF-KV uses fixed-sized keys (8 B) and values (64 B). The index and the log are both stored in one large file. The index nodes use a simple page format with a header followed by keys followed by values. Leaf nodes contain a file offset pointing to the next leaf node, enabling efficient index traversal for range queries and aggregation. Object sizes are fixed, so updates occur inplace in the unsorted log. Newly inserted items are appended to the log; their index is initially stored in an in-memory hash table . Once the hash table fills, BPF-KV merges it with the on-disk B + -tree file.</p><p>Caching. BPF-KV implements a user space DRAM cache for index blocks and objects. To reduce the number of I/Os it needs to issue for lookups, BPF-KV caches the top k levels of the B + -tree index. With a sufficiently large number of objects, it is not possible to fit the entire index in the cache. Consider the case where BPF-KV is used to store 10 billion 64 B objects. In BPF-KV's index, each node is 512 B (matching the access granularity of the Optane SSD); hence, the tree has a fanout of 31 (i.e. each internal node can store pointers to 31 children). Therefore, 10 billion objects would require an index with 8 levels. Fitting 6 index levels in DRAM is expensive and would require 14 GB, while fitting 7 levels or more becomes prohibitively expensive (437 GB of DRAM or more). So, to support a large number of keys, BPF-KV would require at the minimum 3-4 I/Os from storage for each lookup, including a final I/O to fetch the actual key-value pair from disk. Also note that having a hard memory budget for caching the index is common in many real-world key-value stores (e.g., RocksDB <ref type="bibr">[45]</ref>, DocumentDB <ref type="bibr">[78]</ref>, SplinterDB <ref type="bibr">[40]</ref>, TokuDB <ref type="bibr">[20]</ref>), since the index cache often competes with other parts of the system that need memory, such as filters and the object cache. BPF-KV also maintains a least recently used (LRU) object cache of the most popular key-value pairs. Before looking up an object on disk, BPF-KV first checks whether it is stored in the object cache. If not, it checks whether it is indexed in the in-memory hash table. If the item is not found in the in-memory hash table, it looks up the object by accessing the first k cached levels of the index. Once it encounters an index node that is not cached, it completes the index and the final lookup on disk.</p><p>To find an object without XRP, BPF-KV traverses the Btree until the desired value is found using an I/O request per level. For example, if the index contains 7 levels and the first 3 are cached and read from DRAM, then the traversal will issue 4 I/Os to navigate the rest of the tree, followed by a final I/O to fetch the object from the log.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>BPF function.</head><p>Listing 4 shows the BPF function used in BPF-KV to lookup a key-value pair. We omit the code to handle the final lookup in the log for simplicity. struct node defines the layout of B + -tree index nodes whose size is 512 B. The BPF function bpfkv _ bpf first extracts the target key stored in the scratch buffer, and then it linearly searches the slots in the current node to find the next node to read.</p><p>Interface modifications. We replace read calls with read _ xrp. Before calling into read _ xrp, BPF-KV first allocates a buffer for the scratch space and calculates the offset at which to start the lookup.</p><p>Range queries. BPF-KV supports range queries returning a variable number of objects. We implement a BPF function that runs as a state machine, allowing the operation to be suspended and resumed when objects are returned to the application for processing. The BPF function state, including the beginning and end of the range, and the retrieved objects, are stored in the scratch space (up to 32 72-byte key-value pairs). On the initial invocation, the function traverses to the leaf node that contains the starting key. Once the first key in the range is found, the function stores the leaf node in the scratch space and requests the block containing the corresponding value. On the next BPF invocation, the function stores the value in the scratch space and it continues the index scan struct node { uint64 _ t num; uint64 _ t type; uint64 _ t key <ref type="bibr">[31]</ref>; uint64 _ t ptr <ref type="bibr">[31]</ref>; }; on the cached leaf node. When the leaf node has been read completely, the function submits a request for the next leaf node using the node's next-leaf file offset. The function returns to the application in three cases: 1) the function reaches a key past the end of the range; 2) the function reaches the end of the index; 3) the function fills the scratch space with values read from the log. In the last case, the application can process the values and re-invoke the BPF function with the range query state, allowing the range query to resume from where it left off.</p><p>Aggregations. BPF-KV also supports aggregation operations, such as SUM, MAX and MIN. We implement these operations on top of the BPF range query function by setting a bit that causes the function to perform the corresponding aggregation instead of returning the individual values. Since aggregation queries return a single answer, storing values in the scratch space does not limit the number of I/O resubmits the BPF function can request.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">WiredTiger</head><p>WiredTiger is a popular key-value store that is the default backend for MongoDB <ref type="bibr">[27]</ref>. We use it as a case study since it is a relatively simple and open key-value store that is used in production. WiredTiger provides an option to use an LSM tree where data is split into different levels; each level contains a single file. Each file uses a B-tree index with the key-value pairs embedded in the tree's leaf nodes. The files are readonly; updates and inserts are written into a buffer in memory. When the buffer is full, the data is written out in a new file. We configure the B-tree page size to be the same as our Optane SSD's block size (512 B). Our modification to WiredTiger is around 500 lines of code, which mainly consist of buffer allocation, extending function signatures and wrapping the XRP syscall. XRP helps accelerate reads that are serviced from disk, and it does not affect updates or inserts, which are always absorbed by WiredTiger's in-memory buffer.</p><p>BPF function. To use XRP, WiredTiger installs a BPF function similar to the one shown in Listing 4. The difference is in order to find the next lookup address from the current page, the BPF function contains a port of WiredTiger's B-tree page parsing code. This parsing logic replaces the for loop in Listing 4.</p><p>The WiredTiger BPF function also makes several modifications to make the BPF program compile correctly and pass the BPF verifier. The modifications mainly consist of adding bounds on loops to avoid infinite loops, masking pointers to eliminate out-of-bound access, and initializing local variables to prevent access to uninitialized registers. We also use the BPF function-by-function verification feature <ref type="bibr">[3]</ref> to break a complex function into several simple sub-functions. This allows BPF functions to be verified independently, so the functions that have been verified do not need another round of verification when being called by other functions. The function-by-function verification feature also supports more complex BPF programs without exceeding the verifier's restrictions on function length.</p><p>Caching. WiredTiger maintains a least recently used (LRU) cache for its B-tree internal pages and leaf pages. When looking up a new key-value pair, WiredTiger caches the entire lookup path including the leaf page in the cache. In order to comply with WiredTiger caching semantics, the BPF function described in the previous section also returns all traversed pages so that WiredTiger can cache them. The BPF function stores traversed pages in the scratch buffer of its context. When the scratch buffer is exhausted, the BPF function will stop resubmitting requests and return to user space immediately. After WiredTiger adds those pages into its cache, it will call read _ xrp again to continue the lookup starting at the previous page. Since we set the size of the scratch buffer to 4 KB, a BPF function can store up to 6 traversed 512 B pages in the scratch buffer, which leaves room for necessary metadata such as the search key.</p><p>Interface modifications. To integrate WiredTiger with XRP, we replace normal read calls with read _ xrp. read _xrp is called when the next page is not in the cache and needs to be read from disk. The eviction policy of WiredTiger enforces that only the pages without any cached children pages can be evicted, so any uncached page will not have cached descendants. Therefore, it is safe to call read _ xrp to read all of the remaining path from disk without checking the application-level cache again. If read _ xrp fails for any reason, WiredTiger falls back to the normal lookup path. We allocate a data and scratch buffer for each WiredTiger session to avoid the overhead of allocating and freeing buffers for every request. WiredTiger sessions synchronously process Table <ref type="table">3</ref>: Average latency of a random key lookup with BPF-KV as a function of the depth of the B + -tree stored on-disk. # ops is the number of index I/Os per lookup.</p><p>one request at a time, which avoids concurrency issues.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Evaluation</head><p>In this section we seek to answer the following questions:</p><p>1. What are the overheads of using BPF for storage ( &#167;6.1)? 2. How does XRP scale to multiple threads ( &#167;6.2)? 3. What types of operations can XRP support ( &#167;6.3)? 4. Can XRP accelerate a real-world key-value store ( &#167;6.4)?</p><p>Experimental setup. All experiments are conducted on a 6-core i5-8500 3 GHz server with 16 GB of memory, using Ubuntu 20.04, and Linux 5.12.0 with an Intel Optane 5800X prototype. All experiments use O_DIRECT, turn off hyperthreading, disable processor C-states and turbo boost, use the maximum performance governor, and enable KPTI <ref type="bibr">[30]</ref>. We use WiredTiger 4.4.0 in the experiments.</p><p>Baselines. We compare the following configurations: (a) XRP, (b) SPDK (a popular kernel-bypass library), (c) standard read() system calls, and (d) standard io_uring system calls.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.1">BPF-KV</head><p>Latency. To answer the first evaluation question, we measure the performance of BPF-KV on a benchmark that performs a million read operations with keys drawn randomly with uniform probability. The experiment varies the number of levels of the tree that are stored on-disk. In this subsection, we disable caching of data objects and index nodes to focus on the overhead of looking up on-disk items. The measured average latency is shown in Table <ref type="table">3</ref>. The leftmost column represents the number of chained I/Os that are required to lookup the key in the index (not including the final data lookup). For example, if the number of operations is 4, then BPF-KV is configured with an on-disk tree of depth 4, and it also needs to issue one more I/O to fetch the key-value pair from the log. There are a few takeaways from this experiment. First, XRP improves latency over read(), because XRP saves one or more storage layer traversals when it traverses the index or moves from the index to the log. Indeed, one can see that XRP's latency increases by about 3.5-3.9 &#181;s for each additional I/O operation, which is close to the device's latency (Table <ref type="table">1</ref>). This means that XRP achieves close to optimal latency for resubmitted requests. The same is true for io_uring: in the case of submitting I/O requests synchronously without batching, read() and io_uring are almost equivalent. Second, SPDK exhibits better latency than XRP since XRP must pass through the kernel's storage stack once to initiate the index traversal, while SPDK completely bypasses the kernel. Nonetheless, XRP's marginal added latency when the depth of the B + -tree is increased is close to SPDK's (2.6 &#181;s-3.4 &#181;s). For this reason, in the case of a 6-level index, XRP is only 45% slower than SPDK while read() is 142% slower than SPDK. Importantly, XRP achieves this without resorting to polling. This means that, unlike with SPDK, processes can continue to use CPU cores efficiently for other work; XRP's use of CPU time is limited to what is specifically needed to resubmit I/Os in the background and to keep I/O device utilization high.</p><p>Figure <ref type="figure">5a</ref> and Figure <ref type="figure">5b</ref> present the 99 th -percentile latency and 99.9 th -percentile latency of XRP, respectively. When running with a single thread, similar to the average latency results, XRP reduces both 99 th -percentile latency and 99.9 thpercentile latency by up to 30% compared to read() and io_uring. Note that our experiment runs as a closed loop, so XRP is running at a higher throughput than read() and io_uring. At identical throughput XRP would show additional improvement over these baselines. Interestingly, when the number of threads exceeds the number of cores ( <ref type="formula">6</ref>) by more than 3, SPDK's 99.9 th -percentile latency increases significantly. This is due to the fact that with SPDK all threads are busy-polling, and cannot effectively share the same core with other threads. To this end, we measure the percentage of requests whose latencies are greater than or equal to 1 ms and present the data in Figure <ref type="figure">5c</ref>. The results show that SPDK has 0.03% of such requests with 7 threads, and this percentage increases to 0.28% when the number of threads reaches 24. In contrast, io_uring, read(), and XRP always have fewer than 0.01% of such requests.</p><p>Throughput. Figure <ref type="figure">6a</ref> shows the throughput of XRP. As expected, as the index depth increases, XRP's speedup is higher compared to standard system calls. Figures <ref type="figure">6b</ref> and<ref type="figure">6c</ref> show the throughput speedups with a varying number of threads with an index of depth 3 and 6, respectively. Both figures show the speedup of XRP relative to issuing standard system calls does not decrease even as I/O and XRP BPF functions are scaled across several cores. Once again, XRP provides equal to or higher throughput compared to SPDK once the number of threads is 9 or higher.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.2">Thread Scaling</head><p>Since storage applications often use a large number of concurrent threads that access I/O devices, for example in order to process concurrent requests and to perform background garbage collection <ref type="bibr">[12,</ref><ref type="bibr">20,</ref><ref type="bibr">27,</ref><ref type="bibr">44]</ref>, XRP needs to be able to provide good tail latency and throughput under a large number threads. We analyze how XRP scales as a function of the num-            ber of threads and compare it to SPDK. We run an open loop experiment, where the amount of load matches the maximum bandwidth of the Intel device (5M IOPS for 512 B random reads). Figure <ref type="figure">7a</ref> compares the throughput of XRP (integrated with io_uring) to SPDK with BPF-KV using 6 on-disk index levels, where each thread represents a different tenant. Two major observations are: 1) when using 6 working threads (the number of CPU cores on the machine) both SPDK and XRP can achieve a throughput close to the hardware limit (the grey dashed line); 2) once the thread count exceeds the CPU cores, SPDK's throughput steadily decreases while XRP still provides stable throughput. SPDK's throughput collapse stems from its polling-based approach; SPDK threads never yield, leaving scheduling up to Linux's CFS which works in coarse 6 ms timeslices. However, idle XRP threads will voluntarily yield the CPU to busy threads, so more CPU cycles are spent on actual work. Figure <ref type="figure">7b</ref> presents the throughput-latency relationship under 12 working threads as a function of the load. With more threads than CPU cores, both average and tail latencies also increase more significantly in SPDK, as each thread waits longer to be scheduled than in XRP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.3">Range Query</head><p>Figure <ref type="figure">8</ref> compares the average latency and the throughput of running a range query with XRP against performing the query with read() system calls. In both cases the range query performs a single index traversal to find the first object, and traverses the leaf nodes of the index to find the address of subsequent objects. The index depth is 6 in this experiment. Even though the XRP range query can only retrieve 32 objects per syscall, the results show this adds negligible overhead. XRP's performance speedup remains relatively constant as a function of the length of the aggregation, since XRP performs only one storage stack traversal for every 32 values retrieved. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.4">WiredTiger</head><p>To understand whether XRP can benefit a real-world database, we evaluate the performance of WiredTiger with and without XRP on YCSB <ref type="bibr">[41]</ref>. We run the different YCSB workloads so that their runtime takes more or less the same time: YCSB A, B, C and E use 10M operations, D uses 50M operations and E uses 3M. The baseline WiredTiger uses pread() to read Btree pages, while the WiredTiger with XRP uses read _ xrp().</p><p>We populate the database with 1 billion key-value pairs and set the size of both key and value to 16 B. The total size of the database is 46 GB. WiredTiger runs eviction threads to evict pages when its cache usage is close to full, and we set the number of eviction threads to 2.</p><p>Throughput. Figure <ref type="figure">9</ref> shows the total throughput of WiredTiger with different cache sizes and different numbers of client threads. We configure WiredTiger with 512 MB, 1 GB, 2 GB, and 4 GB cache sizes to ensure that WiredTiger can cache at least 1% of its database while not exhausting all the available memory on the machine. We run up to To study the effect of access distribution on XRP, we run YCSB C with a varying Zipfian constant and with a uniform distribution. Figure <ref type="figure">10a</ref> shows that XRP's benefit decreases when the Zipfian constant becomes larger (i.e. , the distribution is more skewed) because of the increased cache hit ratio. Note that skews greater than 0.99 represent very high skew levels. We also see that the throughput gain on WiredTiger is lower than that on BPF-KV with the uniform YCSB C. This is again because WiredTiger spends 37% of its total time on non-I/O operations.</p><p>Tail latency. We measure the tail read latency of WiredTiger with and without XRP under a fixed load: 20 kop/s per client thread for YCSB A, B, C, D, F, and 5 kops/s per client thread for YCSB E. Since YCSB E has scans instead of reads, we set a lower load for it and measure the tail scan latency instead of the tail read latency. Figure <ref type="figure">10b</ref> shows that XRP can reduce the 99 th -percentile latency by up to 40%. Similar to the throughput, the 99 th -percentile latency improve-ment mostly decreases with a larger cache size, and XRP does not have significant effect on YCSB D and E.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7">Related Work</head><p>There are four areas of related work: (a) using BPF to accelerate I/O (typically networking), (b) kernel-bypass systems, (c) near-storage compute, and (d) extensible operating systems and library file systems.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>BPF for I/O.</head><p>There is a large number of systems and frameworks that use BPF to accelerate I/O processing, primarily focused on networking and tracing use cases <ref type="bibr">[2,</ref><ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref><ref type="bibr">15,</ref><ref type="bibr">18,</ref><ref type="bibr">25,</ref><ref type="bibr">28,</ref><ref type="bibr">37,</ref><ref type="bibr">46,</ref><ref type="bibr">49,</ref><ref type="bibr">50,</ref><ref type="bibr">52]</ref>. Most closely related to XRP, XDP <ref type="bibr">[28]</ref> accelerates networking I/O by adding a hook in the NIC driver's RX path. It then provides an interface for eBPF programs that either filter, redirect, or bounce the packet.</p><p>There are no existing systems that use BPF to resubmit storage requests from within the kernel. Kourtis et al. <ref type="bibr">[62]</ref> propose a system that uses eBPF functions as an interface to submit disaggregated storage requests in order to avoid crossing the network. In their system, resubmissions occur from a user space service sitting at the host and are not serviced by the kernel itself, since the network is the primary bottleneck (not the kernel software stack). ExtFUSE <ref type="bibr">[36]</ref> allows user space file systems on Linux to load BPF functions into the kernel to serve low-level file system requests and thus eliminates unnecessary context switches. While ExtFUSE accelerates user space file systems, it provides no performance benefits for an application that already uses a standard kernel file system (e.g., ext4), since it does not allow applications to bypass the kernel's storage stack. BMC <ref type="bibr">[49]</ref> uses BPF to accelerate memcached by intercepting packets on the network path at the host. The BPF functions can then access a separate small kernel-based cache, which serves as a first-level cache and is not synchronized with the user space memcached application. Zhong et al. <ref type="bibr">[85]</ref> provide motivation for using BPF for accelerating storage from within the kernel, but do not provide a concrete design, implementation or evaluation.</p><p>Kernel bypass. In order to reduce the kernel's overhead when processing I/O, several libraries and operating systems have been designed to let users directly access I/O devices <ref type="bibr">[7,</ref><ref type="bibr">33,</ref><ref type="bibr">34,</ref><ref type="bibr">42,</ref><ref type="bibr">47,</ref><ref type="bibr">57,</ref><ref type="bibr">65,</ref><ref type="bibr">69,</ref><ref type="bibr">71,</ref><ref type="bibr">72,</ref><ref type="bibr">[82]</ref><ref type="bibr">[83]</ref><ref type="bibr">[84]</ref>. Most relevant to our work, Intel's SPDK <ref type="bibr">[82]</ref> is a popular kernel-bypass library for storage. In general, the downside of allowing users to access I/O directly is that applications must directly poll for I/O to obtain high performance. This means that cores cannot be shared among processes, which leads to significant underutilization when I/O is not the bottleneck.</p><p>Near-storage compute. There are several systems that allow applications to offload their storage functions to the processor embedded within or attached to a storage device <ref type="bibr">[16,</ref><ref type="bibr">22,</ref><ref type="bibr">31,</ref><ref type="bibr">38,</ref><ref type="bibr">43,</ref><ref type="bibr">51,</ref><ref type="bibr">55,</ref><ref type="bibr">61,</ref><ref type="bibr">63,</ref><ref type="bibr">74,</ref><ref type="bibr">75,</ref><ref type="bibr">77,</ref><ref type="bibr">81]</ref>. The downside of this approach is that it requires specialized storage devices, dedicated hardware, or both.</p><p>Extensible operating systems and library file systems. Our approach is reminiscent of extensible operating systems and library file systems from the 1990s. Extensible operating systems (e.g., SPIN <ref type="bibr">[35]</ref> and VINO <ref type="bibr">[76,</ref><ref type="bibr">79]</ref>) allow extension of kernel functionality via user-defined functions. For example, a client can write kernel extensions that read and decompress video frames from disk. Another related approach is library file systems, such as XN <ref type="bibr">[48,</ref><ref type="bibr">56]</ref>. Similar to XRP, XN allows userspace library file systems to load untrusted metadata translation functions into the kernel, while guaranteeing disk block protection without understanding file systems' data structures. These approaches required using dedicated operating and file systems, while XRP is compatible with Linux and its standard file systems. ExtOS <ref type="bibr">[32]</ref>, a more recent extensible OS, minimizes data movements in read() and splice() by using BPF functions to filter data before copying them to user space or another file, but it still incurs the full storage stack overhead and does not allow I/O request resubmissions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="8">Conclusions and Future Work</head><p>BPF has the potential to accelerate applications using fast NVMe devices by moving computation closer to the device. XRP lets applications write functions that can resubmit dependent storage requests to achieve speedups close to kernelbypass while retaining the advantages of being OS-integrated. Beyond fast lookups, we envision XRP can be used for many types of functions such as compaction, compression and deduplication. In addition, XRP in the future can be developed as a common interface for other use cases where computation needs to be moved closer to storage, such as programmable storage devices and networked storage systems. For example, XRP could be used as an interface that can dynamically support both in-kernel offloading, as well as offloading functions to a smart storage device or an FPGA. Another direction we plan to explore is networked storage. XRP storage functions could be chained with XDP networking functions to create a datapath that bypasses both the kernel's networking and storage paths. throughput and low latency. In 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 49-65, Broomfield, CO, October 2014. USENIX Association.</p></div></body>
		</text>
</TEI>
