Despite its success in many areas, deep learning is a poor fit for use in hardware predictors because these models are impractically large and slow, but this paper shows how we can use deep learning to help design a new cache replacement policy. We first show that for cache replacement, a powerful LSTM learning model can in an offline setting provide better accuracy than current hardware predictors. We then perform analysis to interpret this LSTM model, deriving a key insight that allows us to design a simple online model that matches the offline model's accuracy with orders of magnitude lower cost. The result is the Glider cache replacement policy, which we evaluate on a set of 33 memory-intensive programs from the SPEC 2006, SPEC 2017, and GAP (graph-processing) benchmark suites. In a single-core setting, Glider outperforms top finishers from the 2nd Cache Replacement Championship, reducing the miss rate over LRU by 8.9%, compared to reductions of 7.1% for Hawkeye, 6.5% for MPPPB, and 7.5% for SHiP++. On a four-core system, Glider improves IPC over LRU by 14.7%, compared with improvements of 13.6% (Hawkeye), 13.2% (MPPPB), and 11.4% (SHiP++). 
                        more » 
                        « less   
                    
                            
                            Effective Mimicry of Belady’s MIN Policy
                        
                    
    
            The past decade has seen the rise of highly successful cache replacement policies that are based on binary prediction. For example, the Hawkeye policy learns whether lines loaded by a given PC are Cache Friendly (likely to remain in the cache if Belady’s MIN policy had been used) or Cache Averse (likely to be evicted by Belady’s MIN policy). In this paper, we instead present a cache replacement policy that is based on multiclass prediction, which allows it to directly mimic Belady’s MIN policy in a surprisingly simple and effective way. Our policy uses a PC-based predictor to learn each cache line’s reuse distance; it then evicts lines based on their predicted time of reuse. We show that our use of multiclass prediction is more effective than binary prediction because it allows for a finer-grained ordering of cache lines during eviction and because it is more robust to prediction errors.Our empirical results show that our new policy, which we refer to as Mockingjay, outperforms the previous state-of-the-art on both single-core and multi-core platforms and both with and without a prefetcher. For example, with no prefetcher, on a mix of 100 multi-core workloads from the SPEC 2006, SPEC 2017, and GAP benchmark suites, Mockingjay sees an average improvement over LRU of 15.2%, compared to 7.6% for SHiP and 12.9% for Hawkeye. On a single-core platform, Mockingjay’s improvement over LRU is 5.7%, which approaches the 6.0% improvement of Belady MIN’s unrealizable policy. On a single-core platform (with a prefetcher) running the high-MPKI CVP workloads, Mockingjay’s improvement over LRU is 20.1%, compared to 13.4% for Hawkeye. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1823546
- PAR ID:
- 10334308
- Date Published:
- Journal Name:
- International Symposium on High-Performance Computer Architecture
- Page Range / eLocation ID:
- 558-572
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            In-memory key-value caches are widely used as a performance-critical layer in web applications, disk-based storage, and distributed systems. The Least Recently Used (LRU) replacement policy has become the de facto standard in those systems since it exploits workload locality well. However, the LRU implementation can be costly due to the rigid data structure in maintaining object priority, as well as the locks for object order updating. Redis as one of the most effective and prevalent deployed commercial systems adopts an approximated LRU policy, where the least recently used item from a small, randomly sampled set of items is chosen to evict. This random sampling-based policy is lightweight and shows its flexibility. We observe that there can exist a significant miss ratio gap between exact LRU and random sampling-based LRU under different sampling size $$K$$ s. Therefore existing LRU miss ratio curve (MRC) construction techniques cannot be directly applied without loss of accuracy. In this paper, we introduce a new probabilistic stack algorithm named KRR to accurately model random sampling based-LRU, and extend it to handle both fixed and variable objects in key-value caches. We present an efficient stack update algorithm that reduces the expected running time of KRR significantly. To improve the performance of the in-memory multi-tenant key-value cache that utilizes random sampling-based replacement, we propose kRedis, a reference locality- and latency-aware memory partitioning scheme. kRedis guides the memory allocation among the tenants and dynamically customizes $$K$$ to better exploit the locality of each individual tenant. Evaluation results over diverse workloads show that our model generates accurate miss ratio curves for both fixed and variable object size workloads, and enables practical, low-overhead online MRC prediction. Equipped with KRR, kRedis delivers up to a 50.2% average access latency reduction, and up to a 262.8% throughput improvement compared to Redis. Furthermore, by comparing with pRedis, a state-of-the-art design of memory allocation in Redis, kRedis shows up to 24.8% and 61.8% improvements in average access latency and throughput, respectively.more » « less
- 
            Temporal prefetchers have the potential to prefetch arbitrary memory access patterns, but they require large amounts of metadata that must typically be stored in DRAM. In 2013, the Irregular Stream Buffer (ISB), showed how this metadata could be cached on chip and managed implicitly by synchronizing its contents with that of the TLB. This paper reveals the inefficiency of that approach and presents a new metadata management scheme that uses a simple metadata prefetcher to feed the metadata cache. The result is the Managed ISB (MISB), a temporal prefetcher that significantly advances the state-of-the-art in terms of both traffic overhead and IPC. Using a highly accurate proprietary simulator for single-core workloads, and using the ChampSim simulator for multi-core workloads, we evaluate MISB on programs from the SPEC CPU 2006 and CloudSuite benchmarks suites. Our results show that for single-core workloads, MISB improves performance by 22.7%, compared to 10.6% for an idealized STMS and 4.5% for a realistic ISB. MISB also significantly reduces off-chip traffic; for SPEC, MISB's traffic overhead of 70% is roughly one fifth of STMS's (342%) and one sixth of ISB's (411%). On 4-core multi-programmed workloads, MISB improves performance by 27.5%, compared to 13.6% for idealized STMS. For CloudSuite, MISB improves performance by 12.8% (vs. 6.0% for idealized STMS), while achieving a traffic reduction of 7 × (83.5% for MISB vs. 572.3% for STMS).more » « less
- 
            Velegrakis, Y.; Zeinalipour-Yazti, D.; Chrysanthis, P.K.; Guerra, F. (Ed.)Distributed caches are widely deployed to serve social networks and web applications at billion-user scales. This paper presents Cache-on-Track (CoT), a decentralized, elastic, and predictive caching framework for cloud environments. CoT proposes a new cache replacement policy specifically tailored for small front-end caches that serve skewed workloads with small update percentage. Small front-end caches are mainly used to mitigate the load-imbalance across servers in the distributed caching layer. Front-end servers use a heavy hitter tracking algorithm to continuously track the top-k hot keys. CoT dynamically caches the top-C hot keys out of the tracked keys. CoT’s main advantage over other replacement policies is its ability to dynamically adapt its tracker and cache sizes in response to workload distribution changes. Our experiments show that CoT’s replacement policy consistently outperforms the hit-rates of LRU, LFU, and ARC for the same cache size on different skewed workloads. Also, CoT slightly outperforms the hit-rate of LRU-2 when both policies are configured with the same tracking (history) size. CoT achieves server size load-balance with 50% to 93.75% less front-end cache in comparison to other replacement policies. Finally, experiments show that CoT’s resizing algorithm successfully auto-configures the tracker and cache sizes to achieve back-end load-balance in the presence of workload distribution changes.more » « less
- 
            Machine learning algorithms have shown potential to improve prefetching performance by accurately predicting future memory accesses. Existing approaches are based on the modeling of text prediction, considering prefetching as a classification problem for sequence prediction. However, the vast and sparse memory address space leads to large vocabulary, which makes this modeling impractical. The number and order of outputs for multiple cache line prefetching are also fundamentally different from text prediction. We propose TransFetch, a novel way to model prefetching. To reduce vocabulary size, we use fine-grained address segmentation as input. To predict unordered sets of future addresses, we use delta bitmaps for multiple outputs. We apply an attention-based network to learn the mapping between input and output. Prediction experiments demonstrate that address segmentation achieves 26% - 36% higher F1-score than delta inputs and 15% - 24% higher F1-score than page & offset inputs for SPEC 2006, SPEC 2017, and GAP benchmarks. Simulation results show that TransFetch achieves 38.75% IPC improvement compared with no prefetching, outperforming the best-performing rule-based prefetcher BOP by 10.44% and ML-based prefetcher Voyager by 6.64%.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    