We present ContextPrefetcher, a host-guided high-performant prefetching framework for near-storage accelerators that prefetches data blocks from storage (e.g., NAND) to device-level RAM. Efficiently prefetching data blocks to device-level RAM reduces storage access costs and improves I/O performance. We introduce a novel abstraction, Cross-layered Context (CLC), a virtual entity that spans across the host and the device and is used for identifying, managing, and tracking active and inactive data such as files, objects (within object stores), or a range of blocks. To support efficient prefetching of actively used CLCs to device memory without incurring near-device resource (memory and compute) bottlenecks, ContextPrefetcher delegates prefetching management to the host, guiding near-device compute to prefetch blocks of active CLC. Finally, ContextPrefetcher facilitates the swift reclamation of blocks associated with inactive CLC. Preliminary evaluation against state-of-the-art near-storage accelerator designs demonstrates performance gains of up to 1.34X.
more »
« less
This content will become publicly available on April 17, 2025
CrossPrefetch: Accelerating I/O Prefetching for Modern Storage
We introduce CrossPrefetch, a novel cross-layered I/O
prefetching mechanism that operates across the OS and
a user-level runtime to achieve optimal performance. Existing
OS prefetching mechanisms suffer from rigid interfaces
that do not provide information to applications on the
prefetch effectiveness, suffer from high concurrency bottlenecks,
and are inefficient in utilizing available system memory.
CrossPrefetch addresses these limitations by dividing
responsibilities between the OS and runtime, minimizing
overhead, and achieving low cache misses, lock contentions,
and higher I/O performance.
CrossPrefetch tackles the limitations of rigid OS prefetching
interfaces by maintaining and exporting cache state and
prefetch effectiveness to user-level runtimes. It also addresses
scalability and concurrency bottlenecks by distinguishing
between regular I/O and prefetch operations paths and introduces
fine-grained prefetch indexing for shared files. Finally,
CrossPrefetch designs low-interference access pattern prediction
combined with support for adaptive and aggressive
techniques to exploit memory capacity and storage bandwidth.
Our evaluation of CrossPrefetch, encompassing microbenchmarks,
macrobenchmarks, and real-world workloads,
illustrates performance gains of up to 1.22x-3.7x in
I/O throughput.We also evaluate CrossPrefetch across different
file systems and local and remote storage configurations.
more »
« less
- Award ID(s):
- 2231724
- PAR ID:
- 10545237
- Publisher / Repository:
- ACM
- Date Published:
- ISBN:
- 9798400703720
- Page Range / eLocation ID:
- 102 to 116
- Format(s):
- Medium: X
- Location:
- La Jolla CA USA
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We present ContextPrefetcher, a host-guided high-performant prefetching framework for near-storage accelerators that prefetches data blocks from storage (e.g., NAND) to devicelevel RAM. Efficiently prefetching data blocks to device-level RAM reduces storage access costs and improves I/O performance. We introduce a novel abstraction, Cross-layered Context (CLC), a virtual entity that spans across the host and the device and is used for identifying, managing, and tracking active and inactive data such as files, objects (within object stores), or a range of blocks. To support efficient prefetching of actively used CLCs to device memory without incurring near-device resource (memory and compute) bottlenecks, ContextPrefetcher delegates prefetching management to the host, guiding near-device compute to prefetch blocks of active CLC. Finally, ContextPrefetcher facilitates the swift reclamation of blocks associated with inactive CLC. Preliminary evaluation against state-of-the-art near-storage accelerator designs demonstrates performance gains of up to 1.34×.more » « less
-
null (Ed.)Modern data center applications have rapidly expanding instruction footprints that lead to frequent instruction cache misses, increasing cost and degrading data center performance and energy efficiency. Mitigating instruction cache misses is challenging since existing techniques (1) require significant hardware modifications, (2) expect impractical on-chip storage, or (3) prefetch instructions based on inaccurate understanding of program miss behavior. To overcome these limitations, we first investigate the challenges of effective instruction prefetching. We then use insights derived from our investigation to develop I-SPY, a novel profile-driven prefetching technique. I-SPY uses dynamic miss profiles to drive an offline analysis of I-cache miss behavior, which it uses to inform prefetching decisions. Two key techniques underlie I-SPY's design: (1) conditional prefetching, which only prefetches instructions if the program context is known to lead to misses, and (2) prefetch coalescing, which merges multiple prefetches of non-contiguous cache lines into a single prefetch instruction. I-SPY exposes these techniques via a family of light-weight hardware code prefetch instructions. We study I-SPY in the context of nine data center applications and show that it provides an average of 15.5% (up to 45.9%) speedup and 95.9% (and up to 98.4%) reduction in instruction cache misses, outperforming the state-of-the-art prefetching technique by 22.5%. We show that I-SPY achieves performance improvements that are on average 90.5% of the performance of an ideal cache with no misses.more » « less
-
Data cache prefetching is a well-established optimization to overcome the limits of the cache hierarchy and keep the processor pipeline fed with data. In principle, accurate, well-timed prefetches can sidestep the majority of cache misses and dramatically improve performance. In practice, however, it is challenging to identify which data to prefetch and when to do so. In particular, data can be easily requested too early, causing eviction of useful data from the cache, or requested too late, failing to avoid cache misses. Competition for limited off-chip memory bandwidth must also be balanced between prefetches and a program's regular "demand" accesses. Due to these challenges, prefetching can both help and hurt performance, and the outcome can depend on program structure, decisions about what to prefetch and when to do it, and, as we demonstrate in a series of experiments, program input, processor microarchitecture, and their interaction as well. To try to meet these challenges, we have designed the RPG2 system for online prefetch injection and tuning. RPG2 is a pure-software system that operates on running C/C++ programs, profiling them, injecting prefetch instructions, and then tuning those prefetches to maximize performance. Across dozens of inputs, we find that RPG2 can provide speedups of up to 2.15×, comparable to the best profile-guided prefetching compilers, but can also respond when prefetching ends up being harmful and roll back to the original code - something that static compilers cannot. RPG2 improves prefetching robustness by preserving its performance benefits, while avoiding slowdowns.more » « less
-
Modern server workloads have large code footprints which are prone to front-end bottlenecks due to instruction cache capacity misses. Even with the aggressive fetch directed instruction prefetching (FDIP), implemented in modern processors, there are still significant front-end stalls due to I-Cache misses. A major portion of misses that occur on a BPU-predicted path are tolerated by FDIP without causing stalls. Prior work on instruction prefetching, however, has not been designed to work with FDIP processors. Their singular goal is reducing I-Cache misses, whereas FDIP processors are designed to tolerate them. Designing an instruction prefetcher that works in conjunction with FDIP requires identifying the fraction of cache misses that impact front-end performance (that are not fully hidden by FDIP), and only targeting them. In this paper, we propose Priority Directed Instruction Prefetching (PDIP), a novel instruction prefetching technique that complements FDIP by issuing prefetches for only targets where FDIP struggles - along the resteer path of front-end stall-causing events. PDIP identifies these targets and associates them with a trigger for future prefetch. At a 43.5KB budget, PDIP achieves up to 5.1% IPC speedup on important workloads such as cassandra and a geomean IPC speedup of 3.2% across 16 benchmarks.more » « less