- Award ID(s):
- 1823546
- NSF-PAR ID:
- 10334423
- Date Published:
- Journal Name:
- nternational Conference on Architectural Support for Programming Languages and Operating Systems
- Page Range / eLocation ID:
- 861 to 873
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Temporal prefetching offers great potential, but this potential is difficult to achieve because of the need to store large amounts of prefetcher metadata off chip. To reduce the latency and traffic of off-chip metadata accesses, recent advances in temporal prefetching have proposed increasingly complex mechanisms that cache and prefetch this off-chip metadata. This paper suggests a return to simplicity: We present a temporal prefetcher whose metadata resides entirely on chip. The key insights are (1) only a small portion of prefetcher metadata is important, and (2) for most workloads with irregular accesses, the benefits of an effective prefetcher outweigh the marginal benefits of a larger data cache. Thus, our solution, the Triage prefetcher, identifies important metadata and uses a portion of the LLC to store this metadata, and it dynamically partitions the LLC between data and metadata. Our empirical results show that when compared against spatial prefetchers that use only on-chip metadata, Triage performs well, achieving speedups on irregular subset of SPEC2006 of 23.5% compared to 5.8% for the previous state-of-the-art. When compared against state-of-the-art temporal prefetchers that use off-chip metadata, Triage sacrifices performance on single-core systems (23.5% speedup vs. 34.7% speedup), but its 62% lower traffic overhead translates to better performance in bandwidth-constrained 16-core systems (6.2% speedup vs. 4.3% speedup).more » « less
-
Machine learning algorithms have shown potential to improve prefetching performance by accurately predicting future memory accesses. Existing approaches are based on the modeling of text prediction, considering prefetching as a classification problem for sequence prediction. However, the vast and sparse memory address space leads to large vocabulary, which makes this modeling impractical. The number and order of outputs for multiple cache line prefetching are also fundamentally different from text prediction. We propose TransFetch, a novel way to model prefetching. To reduce vocabulary size, we use fine-grained address segmentation as input. To predict unordered sets of future addresses, we use delta bitmaps for multiple outputs. We apply an attention-based network to learn the mapping between input and output. Prediction experiments demonstrate that address segmentation achieves 26% - 36% higher F1-score than delta inputs and 15% - 24% higher F1-score than page & offset inputs for SPEC 2006, SPEC 2017, and GAP benchmarks. Simulation results show that TransFetch achieves 38.75% IPC improvement compared with no prefetching, outperforming the best-performing rule-based prefetcher BOP by 10.44% and ML-based prefetcher Voyager by 6.64%.more » « less
-
Abstract Graph analytics shows promise for solving challenging problems on relational data. However, memory constraints arise from the large size of graphs and the high complexity of algorithms. Data prefetching is a crucial technique to hide memory access latency by predicting and fetching data into the memory cache beforehand. Traditional prefetchers struggle with fixed rules in adapting to complex memory access patterns in graph analytics. Machine learning (ML) algorithms, particularly long short-term memory (LSTM) models, excel in memory access prediction. However, they encounter challenges such as difficulty in learning interleaved access patterns and high storage costs when predicting in large memory address space. In addition, there remains a gap between designing a high-performance ML-based memory access predictor and developing an effective ML-based prefetcher for an existing memory system. In this work, we propose a novel Attention-based prefetching framework to accelerate graph analytics applications. To achieve high-performance memory access prediction, we propose A2P, a novel Attention-based memory Access Predictor for graph analytics. We use the multi-head self-attention mechanism to extract features from memory traces. We design a novel
bitmap labeling method to collect future deltas within a spatial range, making interleaved patterns easier to learn. We introduce a novelsuper page concept, allowing the model to surpass physical page constraints. To integrate A2P into a memory system, we design a three-module prefetching framework composed of an existing memory hierarchy, a prefetch controller, and the predictor A2P. In addition, we propose a hybrid design to combine A2P and existing hardware prefetchers for higher prefetching performance. We evaluate A2P and the prefetching framework using the widely used GAP benchmark. Prediction experiments show that for the top three predictions, A2P outperforms the widely used state-of-the-art LSTM-based model by 23.1% w.r.t. Precision, 21.2% w.r.t. Recall, and 10.4% w.r.t. Coverage. Prefetching experiments show that A2P provides 18.4% IPC Improvement on average, outperforming state-of-the-art prefetchers BO by 17.2%, ISB by 15.0%, and Delta-LSTM by 10.9%. The hybrid prefetcher combining A2P and ISB achieves 21.7% IPC Improvement, outperforming the hybrid of BO and ISB by 16.3%. -
Prefetching is a well-studied technique for addressing the memory access stall time of contemporary microprocessors. However, despite a large body of related work, the memory access behavior of applications is not well understood, and it remains difficult to predict whether a particular application will benefit from a given prefetcher technique. In this work we propose a novel methodology to classify the memory access patterns of applications, enabling well-informed reasoning about the applicability of a certain prefetcher. Our approach leverages instruction dataflow information to uncover a wide range of access patterns, including arbitrary combinations of offsets and indirection. These combinations or prefetch kernels represent reuse, strides, reference locality, and complex address generation. By determining the complexity and frequency of these access patterns, we enable reasoning about prefetcher timeliness and criticality, exposing the limitations of existing prefetchers today. Moreover, using these kernels, we are able to compute the next address for the majority of top-missing instructions, and we propose a software prefetch injection methodology that is able to outperform state-of-the-art hardware prefetchers.more » « less
-
The Von Neumann bottleneck is a persistent problem in computer architecture, causing stalls and wasted CPU cycles. The Van Neumann bottleneck is particularly relevant for memory-intensive workloads whose working set does not fit into the microprocessor’s cache and hence memory accesses suffer the high access latency of DRAM. One technique to address this bottleneck is to prefetch data from memory into on-chip caches. While prefetching has proven successful, for simple access patterns such as strides, existing prefetchers are incapable of providing benefit for applications with complex, irregular access patterns. A neural network-based prefetcher shows promise for these challenging workloads. We provide a better understanding of what type of memory access patterns an LSTM neural network can learn by training individual models on microbenchmarks with well-characterized memory access patterns. We explore a range of model parameters and provide a better understanding of what model is ideal to use. We achieve over 95% accuracy on the microbenchmarks and find a strong relationship between lookback (history window) size and the ability of the model to learn the pattern. We find also an upper limit on the number of concurrent distinct memory access streams that can be learned by a model of a given size.more » « less