skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Efficient metadata management for irregular data prefetching
Temporal prefetchers have the potential to prefetch arbitrary memory access patterns, but they require large amounts of metadata that must typically be stored in DRAM. In 2013, the Irregular Stream Buffer (ISB), showed how this metadata could be cached on chip and managed implicitly by synchronizing its contents with that of the TLB. This paper reveals the inefficiency of that approach and presents a new metadata management scheme that uses a simple metadata prefetcher to feed the metadata cache. The result is the Managed ISB (MISB), a temporal prefetcher that significantly advances the state-of-the-art in terms of both traffic overhead and IPC. Using a highly accurate proprietary simulator for single-core workloads, and using the ChampSim simulator for multi-core workloads, we evaluate MISB on programs from the SPEC CPU 2006 and CloudSuite benchmarks suites. Our results show that for single-core workloads, MISB improves performance by 22.7%, compared to 10.6% for an idealized STMS and 4.5% for a realistic ISB. MISB also significantly reduces off-chip traffic; for SPEC, MISB's traffic overhead of 70% is roughly one fifth of STMS's (342%) and one sixth of ISB's (411%). On 4-core multi-programmed workloads, MISB improves performance by 27.5%, compared to 13.6% for idealized STMS. For CloudSuite, MISB improves performance by 12.8% (vs. 6.0% for idealized STMS), while achieving a traffic reduction of 7 × (83.5% for MISB vs. 572.3% for STMS).  more » « less
Award ID(s):
1823546
PAR ID:
10113800
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Efficient metadata management for irregular data prefetching
Page Range / eLocation ID:
449 to 461
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Temporal prefetching offers great potential, but this potential is difficult to achieve because of the need to store large amounts of prefetcher metadata off chip. To reduce the latency and traffic of off-chip metadata accesses, recent advances in temporal prefetching have proposed increasingly complex mechanisms that cache and prefetch this off-chip metadata. This paper suggests a return to simplicity: We present a temporal prefetcher whose metadata resides entirely on chip. The key insights are (1) only a small portion of prefetcher metadata is important, and (2) for most workloads with irregular accesses, the benefits of an effective prefetcher outweigh the marginal benefits of a larger data cache. Thus, our solution, the Triage prefetcher, identifies important metadata and uses a portion of the LLC to store this metadata, and it dynamically partitions the LLC between data and metadata. Our empirical results show that when compared against spatial prefetchers that use only on-chip metadata, Triage performs well, achieving speedups on irregular subset of SPEC2006 of 23.5% compared to 5.8% for the previous state-of-the-art. When compared against state-of-the-art temporal prefetchers that use off-chip metadata, Triage sacrifices performance on single-core systems (23.5% speedup vs. 34.7% speedup), but its 62% lower traffic overhead translates to better performance in bandwidth-constrained 16-core systems (6.2% speedup vs. 4.3% speedup). 
    more » « less
  2. This paper presents Voyager, a novel neural network for data prefetching. Unlike previous neural models for prefetching, which are limited to learning delta correlations, our model can also learn address correlations, which are important for prefetching irregular sequences of memory accesses. The key to our solution is its hierarchical structure that separates addresses into pages and offsets and that introduces a mechanism for learning important relations among pages and offsets. Voyager provides significant prediction benefits over current data prefetchers. For a set of irregular programs from the SPEC 2006 and GAP benchmark suites, Voyager sees an average IPC improvement of 41.6% over a system with no prefetcher, compared with 21.7% and 28.2%, respectively, for idealized Domino and ISB prefetchers. We also find that for two commercial workloads for which current data prefetchers see very little benefit, Voyager dramatically improves both accuracy and coverage. At present, slow training and prediction preclude neural models from being practically used in hardware, but Voyager’s overheads are significantly lower—in every dimension—than those of previous neural models. For example, computation cost is reduced by 15- 20×, and storage overhead is reduced by 110-200×. Thus, Voyager represents a significant step towards a practical neural prefetcher. 
    more » « less
  3. Temporal prefetchers are powerful because they can prefetch irregular sequences of memory accesses, but temporal prefetchers are commercially infeasible because they store large amounts of metadata in DRAM. This paper presents Triage, the first temporal data prefetcher that does not require off-chip metadata. Triage builds on two insights: (1) Metadata are not equally useful, so the less useful metadata need not be saved, and (2) for irregular workloads, it is more profitable to use portions of the LLC to store metadata than data. We also introduce novel schemes to identify useful metadata, to compress metadata, and to determine the fraction of the LLC to dedicate for metadata. 
    more » « less
  4. The past decade has seen the rise of highly successful cache replacement policies that are based on binary prediction. For example, the Hawkeye policy learns whether lines loaded by a given PC are Cache Friendly (likely to remain in the cache if Belady’s MIN policy had been used) or Cache Averse (likely to be evicted by Belady’s MIN policy). In this paper, we instead present a cache replacement policy that is based on multiclass prediction, which allows it to directly mimic Belady’s MIN policy in a surprisingly simple and effective way. Our policy uses a PC-based predictor to learn each cache line’s reuse distance; it then evicts lines based on their predicted time of reuse. We show that our use of multiclass prediction is more effective than binary prediction because it allows for a finer-grained ordering of cache lines during eviction and because it is more robust to prediction errors.Our empirical results show that our new policy, which we refer to as Mockingjay, outperforms the previous state-of-the-art on both single-core and multi-core platforms and both with and without a prefetcher. For example, with no prefetcher, on a mix of 100 multi-core workloads from the SPEC 2006, SPEC 2017, and GAP benchmark suites, Mockingjay sees an average improvement over LRU of 15.2%, compared to 7.6% for SHiP and 12.9% for Hawkeye. On a single-core platform, Mockingjay’s improvement over LRU is 5.7%, which approaches the 6.0% improvement of Belady MIN’s unrealizable policy. On a single-core platform (with a prefetcher) running the high-MPKI CVP workloads, Mockingjay’s improvement over LRU is 20.1%, compared to 13.4% for Hawkeye. 
    more » « less
  5. Non-volatile Memory (NVM) offers the opportunity to build large, durable B+ trees with markedly higher performance and faster post-crash recovery than is possible with traditional disk- or flash-based persistence. Unfortunately, cache flush and fence instructions, required for crash consistency and failure atomicity on many machines, introduce substantial overhead not present in non-persistent trees, and force additional NVM reads and writes. The overhead is particularly pronounced in workloads that benefit from cache reuse due to good temporal locality or small working sets---traits commonly observed in real-world applications. In this paper, we propose a buffered durable B+ tree (BD+Tree) that improves performance and reduces NVM traffic viarelaxedpersistence. Execution of a BD+Tree is divided intoepochsof a few milliseconds each; if a crash occurs in epoche,the tree recovers to its state as of the end of epoche-2. (The persistence boundary can always be made current with an explicit sync operation, which quickly advances the epoch by 2.) NVM writes within an epoch are aggregated for delayed persistence, thereby increasing cache reuse and reducing traffic to NVM. In comparison to state-of-the-art persistent B+ trees, our micro-benchmark experiments show that BD+Tree can improve throughput by up to 2.4x and reduce NVM writes by up to 90% when working sets are small or workloads exhibit strong temporal locality. On real-world workloads that benefit from cache reuse, BD+Tree realizes throughput improvements of 1.1--2.4x and up to a 99% decrease in NVM writes. Even on uniform workloads, with working sets that significantly exceed cache capacity, BD+Tree still improves throughput by 1--1.3x. The performance advantage of BD+Tree increases with larger caches, suggesting ongoing benefits as CPUs evolve toward gigabyte cache capacities. 
    more » « less