skip to main content


Title: Reducing Load Latency with Cache Level Prediction
High load latency that results from deep cache hierarchies and relatively slow main memory is an important limiter of single-thread performance. Data prefetch helps reduce this latency by fetching data up the hierarchy before it is requested by load instructions. However, data prefetching has shown to be imperfect in many situations. We propose cache-level prediction to complement prefetchers. Our method predicts which memory hierarchy level a load will access allowing the memory loads to start earlier, and thereby saves many cycles. The predictor provides high prediction accuracy at the cost of just one cycle added latency to L1 misses. Level prediction reduces the memory access latency by 20% on average, and provides speedup of 10.3% over a conventional baseline, and 6.1% over a boosted baseline on generic, graph, and HPC applications.  more » « less
Award ID(s):
1719061
NSF-PAR ID:
10340193
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Proceedings of the 2022 IEEE International Symposium on High Performance Computer Architecture (HPCA)
Page Range / eLocation ID:
648 to 661
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    As multicore systems continue to grow in scale and on-chip memory capacity, the on-chip network bandwidth and latency become problematic bottlenecks. Because of this, overheads in data transfer, the coherence protocol and replacement policies become increasingly important. Unfortunately, even in well-structured programs, many natural optimizations are difficult to implement because of the reactive and centralized nature of traditional cache hierarchies, where all requests are initiated by the core for short, cache line granularity accesses. For example, long-lasting access patterns could be streamed from shared caches without requests from the core. Indirect memory access can be performed by chaining requests made from within the cache, rather than constantly returning to the core. Our primary insight is that if programs can embed information about long-term memory stream behavior in their ISAs, then these streams can be floated to the appropriate level of the memory hierarchy. This decentralized approach to address generation and cache requests can lead to better cache policies and lower request and data traffic by proactively sending data before the cores even request it. To evaluate the opportunities of stream floating, we enhance a tiled multicore cache hierarchy with stream engines to process stream requests in last-level cache banks. We develop several novel optimizations that are facilitated by stream exposure in the ISA, and subsequent exposure to caches. We evaluate using a cycle-level execution-driven gem5-based simulator, using 10 data-processing workloads from Rodinia and 2 streaming kernels written in OpenMP. We find that stream floating enables 52% and 39% speedup over an inorder and OOO core with state of art prefetcher design respectively, with 64% and 49% energy efficiency advantage. 
    more » « less
  2. Abstract

    Graph analytics shows promise for solving challenging problems on relational data. However, memory constraints arise from the large size of graphs and the high complexity of algorithms. Data prefetching is a crucial technique to hide memory access latency by predicting and fetching data into the memory cache beforehand. Traditional prefetchers struggle with fixed rules in adapting to complex memory access patterns in graph analytics. Machine learning (ML) algorithms, particularly long short-term memory (LSTM) models, excel in memory access prediction. However, they encounter challenges such as difficulty in learning interleaved access patterns and high storage costs when predicting in large memory address space. In addition, there remains a gap between designing a high-performance ML-based memory access predictor and developing an effective ML-based prefetcher for an existing memory system. In this work, we propose a novel Attention-based prefetching framework to accelerate graph analytics applications. To achieve high-performance memory access prediction, we propose A2P, a novel Attention-based memory Access Predictor for graph analytics. We use the multi-head self-attention mechanism to extract features from memory traces. We design a novelbitmap labelingmethod to collect future deltas within a spatial range, making interleaved patterns easier to learn. We introduce a novelsuper pageconcept, allowing the model to surpass physical page constraints. To integrate A2P into a memory system, we design a three-module prefetching framework composed of an existing memory hierarchy, a prefetch controller, and the predictor A2P. In addition, we propose a hybrid design to combine A2P and existing hardware prefetchers for higher prefetching performance. We evaluate A2P and the prefetching framework using the widely used GAP benchmark. Prediction experiments show that for the top three predictions, A2P outperforms the widely used state-of-the-art LSTM-based model by 23.1% w.r.t. Precision, 21.2% w.r.t. Recall, and 10.4% w.r.t. Coverage. Prefetching experiments show that A2P provides 18.4% IPC Improvement on average, outperforming state-of-the-art prefetchers BO by 17.2%, ISB by 15.0%, and Delta-LSTM by 10.9%. The hybrid prefetcher combining A2P and ISB achieves 21.7% IPC Improvement, outperforming the hybrid of BO and ISB by 16.3%.

     
    more » « less
  3. Computational storage adds computing to storage devices, providing potential benefits in offload, data-reduction, and lower energy. Successful computational SSD architectures should match growing flash bandwidth, which in turn requires high SSD DRAM memory bandwidth. This creates a memory wall scaling problem, resulting from SSDs’ stringent power and cost constraints. A survey of recent computational SSD research shows that many computational storage offloads are suited to stream computing. To exploit this opportunity, we propose a novel general-purpose computational SSD and core architecture, called ASSASIN (Architecture Support for Stream computing to Accelerate computatIoNal Storage). ASSASIN provides a unified set of compute engines between SSD DRAM and the flash array. This eliminates the SSD DRAM bottleneck by enabling direct computing on flash data streams. ASSASIN further employs a crossbar to achieve performance even when flash data layout is uneven and preserve independence for page layout decisions in the flash translation layer. With stream buffers and scratchpad memories, ASSASIN core’s memory hierarchy and instruction set extensions provide superior low-latency access at low-power and effectively keep streaming flash data out of the in-SSD cache-DRAM memory hierarchy, thereby solving the memory wall. Evaluation shows that ASSASIN delivers 1.5x - 2.4x speedup for offloaded functions compared to state-of-the-art computational SSD architectures. Further, ASSASIN’s streaming approach yields 2.0x power efficiency and 3.2x area efficiency improvement. And these performance benefits at the level of computational SSDs translate to 1.1x - 1.5x end-to-end speedups on data analytics workloads. 
    more » « less
  4. Applications in the cloud are vulnerable to several attack scenarios. In one possibility, an untrusted cloud operator can examine addresses on the memory bus and use this information leak to violate privacy guarantees, even if data is encrypted. The Oblivious RAM (ORAM) construct was introduced to eliminate such information leak and these frameworks have seen many innovations in recent years. In spite of these innovations, the overhead associated with ORAM is very significant. This paper takes a step forward in reducing ORAM memory bandwidth overheads. We make the case that, similar to a cache hierarchy, a lightweight ORAM that fronts the full-fledged ORAM provides a boost in efficiency. The lightweight ORAM has a smaller capacity and smaller depth, and it can relax some of the many constraints imposed on the full-fledged ORAM. This yields a 2-level hierarchy with a relaxed ORAM and a full ORAM. The relaxed ORAM adopts design parameters that are optimized for efficiency and not capacity. We introduce a novel metadata management technique to further reduce the bandwidth for relaxed ORAM access. Relaxed ORAM accesses preserve the indistinguishability property and are equipped with an integrity verification system. Finally, to eliminate information leakage through LLC and relaxed ORAM hit rates, we introduce a deterministic memory scheduling policy. On a suite of memory-intensive applications, we show that the best Relaxed Hierarchical ORAM (ρ) model yields a performance improvement of 50%, relative to a Freecursive ORAM baseline. 
    more » « less
  5. Persistent memory presents a great opportunity for crash-consistent computing in large-scale computing systems. The ability to recover data upon power outage or crash events can significantly improve the availability of large-scale systems, while improving the performance of persistent data applications (e.g., database applications). However, persistent memory suffers from high write latency and requires specific programming model (e.g., Intel’s PMDK) to guarantee crash consistency, which results in long latency to persist data. To mitigate these problems, recent standards advocate for sufficient back-up power that can flush the whole cache hierarchy to the persistent memory upon detection of an outage, i.e., extending the persistence domain to include the cache hierarchy. In the secure NVM with extended persistent domain(EPD), in addition to flushing the cache hierarchy, extra actions need to be taken to protect the flushed cache data. These extra actions of secure operation could cause significant burden on energy costs and battery size. We demonstrate that naive implementations could lead to significantly expanding the required power holdup budget (e.g., 10.3x more operations than EPD system without secure memory support). The significant overhead is caused by memory accesses of secure metadata. In this paper, we present Horus, a novel EPD-aware secure memory implementation. Horus reduces the overhead during draining period of EPD system by reducing memory accesses of secure metadata. Experiment result shows that Horus reduces the draining time by 5x, compared with the naive baseline design. 
    more » « less