NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Can a Client-Server Cache Tango Accelerate Disaggregated Storage?

https://doi.org/10.1145/3736548.3737838

Ma, Linjie; Zhang, Jian; Nguyen, Marie; Kannan, Sudarsun (July 2025, ACM)

Free, publicly-accessible full text available July 10, 2026
Kamino: Efficient VM Allocation at Scale with Latency-Driven Cache-Aware Scheduling

Domingo, David; Barbalho, Hugo; Molinaro, Marco; Liu, Kuan; Pan, Abhisek; Dion, David; Moscibroda, Thomas; Kannan, Sudarsun; Menache, Ishai (July 2025, ACM)

In virtual machine (VM) allocation systems, caching repetitive and similar VM allocation requests and associated resolution rules is crucial for reducing computational costs and meeting strict latency requirements. While modern allocation systems distribute requests among multiple allocator agents and use caching to improve performance, current schedulers often neglect the cache state and latency considerations when assigning each new request to an agent. Due to the high variance in costs of cache hits and misses and the associated processing overheads of updating the caches, simple load-balancing and cache-aware mechanisms result in high latencies. We introduce Kamino, a high-performance, latencydriven and cache-aware request scheduling system aimed at minimizing end-to-end latencies. Kamino employs a novel scheduling algorithm grounded in theory which uses partial indicators from the cache state to assign each new request to the agent with the lowest estimated latency. Evaluation of Kamino using a high-fidelity simulator on large-scale production workloads shows a 42% reduction in average request latencies. Our deployment of Kamino in the control plane of a large public cloud confirms these improvements, with a 33% decrease in cache miss rates and a 17% reduction in memory usage
more » « less
Free, publicly-accessible full text available July 7, 2026
Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems

https://doi.org/10.1145/3655038

Zhang, Jian; Nguyen, Marie; Kashyap, Sanidhya; Kannan, Sudarsun (July 2024, ACM)

We present ContextPrefetcher, a host-guided high-performant prefetching framework for near-storage accelerators that prefetches data blocks from storage (e.g., NAND) to devicelevel RAM. Efficiently prefetching data blocks to device-level RAM reduces storage access costs and improves I/O performance. We introduce a novel abstraction, Cross-layered Context (CLC), a virtual entity that spans across the host and the device and is used for identifying, managing, and tracking active and inactive data such as files, objects (within object stores), or a range of blocks. To support efficient prefetching of actively used CLCs to device memory without incurring near-device resource (memory and compute) bottlenecks, ContextPrefetcher delegates prefetching management to the host, guiding near-device compute to prefetch blocks of active CLC. Finally, ContextPrefetcher facilitates the swift reclamation of blocks associated with inactive CLC. Preliminary evaluation against state-of-the-art near-storage accelerator designs demonstrates performance gains of up to 1.34×.
more » « less
Full Text Available
CrossPrefetch: Accelerating I/O Prefetching for Modern Storage

https://doi.org/10.1145/3617232.3624872

Garg, Shaleen; Zhang, Jian; Pitchumani, Rekha; Parashar, Manish; Xie, Bing; Kannan, Sudarsun (April 2024, ACM)

We introduce CrossPrefetch, a novel cross-layered I/O prefetching mechanism that operates across the OS and a user-level runtime to achieve optimal performance. Existing OS prefetching mechanisms suffer from rigid interfaces that do not provide information to applications on the prefetch effectiveness, suffer from high concurrency bottlenecks, and are inefficient in utilizing available system memory. CrossPrefetch addresses these limitations by dividing responsibilities between the OS and runtime, minimizing overhead, and achieving low cache misses, lock contentions, and higher I/O performance. CrossPrefetch tackles the limitations of rigid OS prefetching interfaces by maintaining and exporting cache state and prefetch effectiveness to user-level runtimes. It also addresses scalability and concurrency bottlenecks by distinguishing between regular I/O and prefetch operations paths and introduces fine-grained prefetch indexing for shared files. Finally, CrossPrefetch designs low-interference access pattern prediction combined with support for adaptive and aggressive techniques to exploit memory capacity and storage bandwidth. Our evaluation of CrossPrefetch, encompassing microbenchmarks, macrobenchmarks, and real-world workloads, illustrates performance gains of up to 1.22x-3.7x in I/O throughput.We also evaluate CrossPrefetch across different file systems and local and remote storage configurations.
more » « less
Full Text Available
OmniCache: Collaborative Caching for Near-storage Accelerators

Zhang, Jian; Ren, Yujie; Nguyen, Marie; Min, Changwoo; Kannan, Sudarsun (February 2024, USENIX Association)

We propose OmniCache, a novel caching design for nearstorage accelerators that combines near-storage and host memory capabilities to accelerate I/O and data processing. First, OmniCache introduces a “near-cache” approach, maximizing data access to the nearest cache for I/O and processing operations. Second, OmniCache presents collaborative caching for concurrent I/O and data processing by using host and device caches. Third, OmniCache incorporates a dynamic modeldriven offloading support, which actively monitors hardware and software metrics for efficient processing across host and device processors. Finally, OmniCache explores the extensibility for newly-introduced CXL, a memory expansion technology. OmniCache demonstrates significant performance gains of up to 3.24X for I/O workloads and 3.06X for data processing workloads.
more » « less
Full Text Available

Search for: All records