skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 8:00 PM ET on Friday, March 21 until 8:00 AM ET on Saturday, March 22 due to maintenance. We apologize for the inconvenience.


Title: SDC: a software defined cache for efficient data indexing
CPU cache has been used to bridge the processor-memory performance gap to enable high-performance computing. As the cache is of limited capacity, for its maximum efficacy it should (1) avoid caching data that are less likely to be accessed and (2) identify and cache data that would otherwise cost a program multiple memory accesses to reach. Unfortunately, existing cache architectures are inadequate on these two efforts. First, to cost-effectively exploit the spatial locality, they adopt a relatively large and fixed-size cache line as the caching unit. Thus, much of the space in a cache line can be wasted when the data locality is weak. Second, for easy use, the cache is designed to be transparent to programs, which hinders programs from fully exploiting its performance potentials. To address these problems, we propose a high-performance Software Defined Cache (SDC) architecture providing a simple and generic key-value abstraction that allows (1) caching data at a granularity smaller than a cache line, and (2) enabling programs to explicitly insert, retrieve, and invalidate data in the cache with new instructions. By providing a program with the ability of explicitly using the cache as a lookaside key-value buffer, SDC enables a much more efficient cache without disruptively changing the existing cache organization and without substantially increasing hardware cost. We have prototyped SDC on the gem5 simulator and evaluated it with various data index structures and workloads. Experiment results show that SDC can improve the cache performance for the workloads by up to 5.3× over current cache design.  more » « less
Award ID(s):
1815303
PAR ID:
10119150
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
ICS '19 Proceedings of the ACM International Conference on Supercomputing
Page Range / eLocation ID:
82 to 93
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In-memory key-value cache systems, such as Memcached and Redis, are essential in today's data centers. A key mission of such cache systems is to identify the most valuable data for caching. To achieve this, the current system design keeps track of each key-value item's access and attempts to make accurate estimation on its temporal locality. All it aims is to achieve the highest cache hit ratio. However, as cache capacity quickly increases, the overhead of managing metadata for a massive amount of small key-value items rises to an unbearable level. Put it simply, the current fine-grained, heavy-cost approach cannot continue to scale.

    In this paper, we have performed an experimental study on the scalability challenge of the current key-value cache system design and quantitatively analyzed the inherent issues related to the metadata operations for cache management. We further propose a key-value cache management scheme, calledCatalyst, based on a highly efficient metadata structure, which allows us to make effective caching decisions in a scalable way. By offloading non-essential metadata operations to GPU, we can further dedicate the limited CPU and memory resources to the main service operations for improved throughput and latency. We have developed a prototype based on Memcached. Our experimental results show that our scheme can significantly enhance the scalability and improve the cache system performance by a factor of up to 4.3.

     
    more » « less
  2. Recommendation systems have been widely embedded into many Internet services. For example, Meta’s deep learning recommendation model (DLRM) shows high predictive accuracy of click-through rate in processing large-scale embedding tables. The SparseLengthSum (SLS) kernel of the DLRM dominates the inference time of the DLRM due to intensive irregular memory accesses to the embedding vectors. Some prior works directly adopt near-data processing (NDP) solutions to obtain higher memory bandwidth to accelerate SLS. However, their inferior memory hierarchy induces a low performance-cost ratio and fails to fully exploit the data locality. Although some software-managed cache policies were proposed to improve the cache hit rate, the incurred cache miss penalty is unacceptable considering the high overheads of executing the corresponding programs and the communication between the host and the accelerator. To address the issues aforementioned, we proposeEMS-i, an efficient memory system design that integrates Solid State Drive (SSD) into the memory hierarchy using Compute Express Link (CXL) for recommendation system inference. We specialize the caching mechanism according to the characteristics of various DLRM workloads and propose a novel prefetching mechanism to further improve the performance. In addition, we delicately design the inference kernel and develop a customized mapping scheme for SLS operation, considering the multi-level parallelism in SLS and the data locality within a batch of queries. Compared to the state-of-the-art NDP solutions,EMS-iachieves up to 10.9× speedup over RecSSD and the performance comparable to RecNMP with 72% energy savings.EMS-ialso saves up to 8.7× and 6.6 × memory cost w.r.t. RecSSD and RecNMP, respectively. 
    more » « less
  3. Caching techniques are widely used in today’s computing infrastructure from virtual memory management to server cache and memory cache. This paper builds on two observa- tions. First, the space utilization in cache can be improved by varying the cache size based on dynamic application demand. Second, it is easier to predict application behavior statistically than precisely. This paper presents a new variable-size cache that uses statistical knowledge of program behavior to maximize the cache performance. We measure performance using data access traces from real-world workloads, including Memcached traces from Facebook and storage traces from Microsoft Research. In an offline setting, the new cache is demonstrated to outperform even OPT, the optimal fixed- size cache which makes use of precise knowledge of program behavior. 
    more » « less
  4. For hosting data-serving and caching workloads based on key-value stores in clouds, the cost of memory represents a significant portion of the hosting expenses. The emergence of cheaper, but slower, types of memories, such as NVDIMMs, opens opportunities to reduce the hosting costs for such workloads. The question explored in this paper is how to determine adequate allocations of different memory types in future systems with heterogeneous memory components, so as to retain desired performance SLOs and maximize the cost efficiency of the memory resource. We develop Mnemo, a memory sizing and data tiering consultant, that permits quick exploration of the cost-benefit tradeoffs associated with different configurations of the hybrid memory components used by key-value store workloads. Using experimental evaluation with different workload patterns, Mnemo is able to afford applications such as Redis, Memcached and DynamoDB, with substantial reduction in their hosting costs, at negligible impact on application performance, thus improving the overall system memory cost efficiency. 
    more » « less
  5. In-memory key-value caches are widely used as a performance-critical layer in web applications, disk-based storage, and distributed systems. The Least Recently Used (LRU) replacement policy has become the de facto standard in those systems since it exploits workload locality well. However, the LRU implementation can be costly due to the rigid data structure in maintaining object priority, as well as the locks for object order updating. Redis as one of the most effective and prevalent deployed commercial systems adopts an approximated LRU policy, where the least recently used item from a small, randomly sampled set of items is chosen to evict. This random sampling-based policy is lightweight and shows its flexibility. We observe that there can exist a significant miss ratio gap between exact LRU and random sampling-based LRU under different sampling size $K$ s. Therefore existing LRU miss ratio curve (MRC) construction techniques cannot be directly applied without loss of accuracy. In this paper, we introduce a new probabilistic stack algorithm named KRR to accurately model random sampling based-LRU, and extend it to handle both fixed and variable objects in key-value caches. We present an efficient stack update algorithm that reduces the expected running time of KRR significantly. To improve the performance of the in-memory multi-tenant key-value cache that utilizes random sampling-based replacement, we propose kRedis, a reference locality- and latency-aware memory partitioning scheme. kRedis guides the memory allocation among the tenants and dynamically customizes $K$ to better exploit the locality of each individual tenant. Evaluation results over diverse workloads show that our model generates accurate miss ratio curves for both fixed and variable object size workloads, and enables practical, low-overhead online MRC prediction. Equipped with KRR, kRedis delivers up to a 50.2% average access latency reduction, and up to a 262.8% throughput improvement compared to Redis. Furthermore, by comparing with pRedis, a state-of-the-art design of memory allocation in Redis, kRedis shows up to 24.8% and 61.8% improvements in average access latency and throughput, respectively. 
    more » « less