skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 10:00 PM to 12:00 PM ET on Tuesday, March 25 due to maintenance. We apologize for the inconvenience.


Title: OmniCache: Collaborative Caching for Near-storage Accelerators
We propose OmniCache, a novel caching design for near-storage accelerators that combines near-storage and host memory capabilities to accelerate I/O and data processing. First, OmniCache introduces a “near-cache” approach, maximizing data access to the nearest cache for I/O and processing operations. Second, OmniCache presents collaborative caching for concurrent I/O and data processing by using host and device caches. Third, OmniCache incorporates a dynamic model-driven offloading support, which actively monitors hardware and software metrics for efficient processing across host and device processors. Finally, OmniCache explores the extensive- ability for the newly-introduced CXL, a memory expansion technology. OmniCache demonstrates significant performance gains of up to 3.24X for I/O workloads and 3.06X for data processing workloads.  more » « less
Award ID(s):
1910593
PAR ID:
10568783
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
USENIX
Date Published:
ISBN:
978-1-939133-38-0
Format(s):
Medium: X
Location:
Santa Clara CA USA
Sponsoring Org:
National Science Foundation
More Like this
  1. We propose OmniCache, a novel caching design for nearstorage accelerators that combines near-storage and host memory capabilities to accelerate I/O and data processing. First, OmniCache introduces a “near-cache” approach, maximizing data access to the nearest cache for I/O and processing operations. Second, OmniCache presents collaborative caching for concurrent I/O and data processing by using host and device caches. Third, OmniCache incorporates a dynamic modeldriven offloading support, which actively monitors hardware and software metrics for efficient processing across host and device processors. Finally, OmniCache explores the extensibility for newly-introduced CXL, a memory expansion technology. OmniCache demonstrates significant performance gains of up to 3.24X for I/O workloads and 3.06X for data processing workloads. 
    more » « less
  2. Recommendation systems have been widely embedded into many Internet services. For example, Meta’s deep learning recommendation model (DLRM) shows high predictive accuracy of click-through rate in processing large-scale embedding tables. The SparseLengthSum (SLS) kernel of the DLRM dominates the inference time of the DLRM due to intensive irregular memory accesses to the embedding vectors. Some prior works directly adopt near-data processing (NDP) solutions to obtain higher memory bandwidth to accelerate SLS. However, their inferior memory hierarchy induces a low performance-cost ratio and fails to fully exploit the data locality. Although some software-managed cache policies were proposed to improve the cache hit rate, the incurred cache miss penalty is unacceptable considering the high overheads of executing the corresponding programs and the communication between the host and the accelerator. To address the issues aforementioned, we proposeEMS-i, an efficient memory system design that integrates Solid State Drive (SSD) into the memory hierarchy using Compute Express Link (CXL) for recommendation system inference. We specialize the caching mechanism according to the characteristics of various DLRM workloads and propose a novel prefetching mechanism to further improve the performance. In addition, we delicately design the inference kernel and develop a customized mapping scheme for SLS operation, considering the multi-level parallelism in SLS and the data locality within a batch of queries. Compared to the state-of-the-art NDP solutions,EMS-iachieves up to 10.9× speedup over RecSSD and the performance comparable to RecNMP with 72% energy savings.EMS-ialso saves up to 8.7× and 6.6 × memory cost w.r.t. RecSSD and RecNMP, respectively. 
    more » « less
  3. Federated learning (FL) has emerged as a new paradigm of machine learning (ML) with the goal of collaborative learning on the vast pool of private data available across distributed edge devices. The focus of most existing works in FL systems has been on addressing the challenges of computation and communication heterogeneity inherent in training with edge devices. However, the crucial impact of I/O and the role of limited on-device storage has not been explored fully in FL context. Without policies to exploit the on-device storage for placement of client data samples, and schedule clients based on I/O benefits, FL training can lead to inefficiencies, such as increased training time and impacted accuracy convergence. In this paper, we propose FedCaSe, a framework for efficiently caching client samples in-situ on limited on-device storage and scheduling client participation. FedCaSe boosts the I/O performance by exploiting a unique characteristic--- the experience, i.e., relative impact on overall performance, of data samples and clients. FedCaSe utilizes this information in adaptive caching policies for sample placement inside the limited memory of edge clients. The framework also exploits the experience information to orchestrate the future selection of clients. Our experiments with representative workloads and policies show that compared to the state of the art, FedCaSe improves the training time by 2.06× for accuracy convergence at the scale of thousands of clients. 
    more » « less
  4. We present ContextPrefetcher, a host-guided high-performant prefetching framework for near-storage accelerators that prefetches data blocks from storage (e.g., NAND) to device-level RAM. Efficiently prefetching data blocks to device-level RAM reduces storage access costs and improves I/O performance. We introduce a novel abstraction, Cross-layered Context (CLC), a virtual entity that spans across the host and the device and is used for identifying, managing, and tracking active and inactive data such as files, objects (within object stores), or a range of blocks. To support efficient prefetching of actively used CLCs to device memory without incurring near-device resource (memory and compute) bottlenecks, ContextPrefetcher delegates prefetching management to the host, guiding near-device compute to prefetch blocks of active CLC. Finally, ContextPrefetcher facilitates the swift reclamation of blocks associated with inactive CLC. Preliminary evaluation against state-of-the-art near-storage accelerator designs demonstrates performance gains of up to 1.34X. 
    more » « less
  5. We present ContextPrefetcher, a host-guided high-performant prefetching framework for near-storage accelerators that prefetches data blocks from storage (e.g., NAND) to devicelevel RAM. Efficiently prefetching data blocks to device-level RAM reduces storage access costs and improves I/O performance. We introduce a novel abstraction, Cross-layered Context (CLC), a virtual entity that spans across the host and the device and is used for identifying, managing, and tracking active and inactive data such as files, objects (within object stores), or a range of blocks. To support efficient prefetching of actively used CLCs to device memory without incurring near-device resource (memory and compute) bottlenecks, ContextPrefetcher delegates prefetching management to the host, guiding near-device compute to prefetch blocks of active CLC. Finally, ContextPrefetcher facilitates the swift reclamation of blocks associated with inactive CLC. Preliminary evaluation against state-of-the-art near-storage accelerator designs demonstrates performance gains of up to 1.34×. 
    more » « less