skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Towards Application-Specific Address Mapping for Emerging Memory Devices
Recent advancements in 3D-stacked DRAM such as hybrid memory cube (HMC) and high-bandwidth memory (HBM) promise higher bandwidth and lower power consumption compared to traditional DDR-based DRAM. However, taking advantage of this additional bandwidth for improving the performance of real-world applications requires carefully laying out the data in memory which incurs significant programmer effort. To alleviate this programmer burden, we investigate application-specific address mapping to improve performance while minimizing manual effort. Our approach is guided by the following insights: (i) toggling activity of address bits can help determine strategies to improve parallelism within memory but this metric underestimates conflicts and (ii) modern memory controllers reorder address requests and therefore any toggling activity measured from an address trace is non-deterministic. Furthermore, our position is that analyzing individual address bits results in poor estimates for actual conflicts and exploited parallelism and that entropy needs to be calculated for groups of address bits. Therefore, we calculate window-based probabilistic entropy for groups of address bits to determine a near-optimal address mapping. We present simulation results for ten applications that show a performance improvement up to 25% over fixed address-mapping and up to 8% over previous application-specific address mapping for our proposed approach.  more » « less
Award ID(s):
1828105
PAR ID:
10347486
Author(s) / Creator(s):
;
Date Published:
Journal Name:
International Symposium on Memory Systems (MEMSYS)
Page Range / eLocation ID:
105 to 113
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    As FPGA-based accelerators become ubiquitous and more powerful, the demand for integration with High-Performance Memory (HPM) grows. Although HPMs offer a much greater bandwidth than standard DDR4 DRAM, they introduce new design challenges such as increased latency and higher bandwidth mismatch between memory and FPGA cores. This paper presents a scalable architecture for convolutional neural network accelerators conceived specifically to address these challenges and make full use of the memory's high bandwidth. The accelerator, which was designed using high-level synthesis, is highly configurable. The intrinsic parallelism of its architecture allows near-perfect scaling up to saturating the available memory bandwidth. 
    more » « less
  2. Row hammer attacks exploit electrical interactions between neighboring memory cells in high-density dynamic random-access memory (DRAM) to induce memory errors. By rapidly and repeatedly accessing DRAMs with specific patterns, an adversary with limited privilege on the target machine may trigger bit flips in memory regions that he has no permission to access directly. In this paper, we explore row hammer attacks in cross-VM settings, in which a malicious VM exploits bit flips induced by row hammer attacks to crack memory isolation enforced by virtualization. To do so with high fidelity, we develop novel techniques to determine the physical address mapping in DRAM modules at runtime (to improve the effectiveness of double-sided row hammer attacks), methods to exhaustively hammer a large fraction of physical memory from a guest VM (to collect exploitable vulnerable bits), and innovative approaches to break Xen paravirtualized memory isolation (to access arbitrary physical memory of the shared machine). Our study also suggests that the demonstrated row hammer attacks are applicable in modern public clouds where Xen paravirtualization technology is adopted. This shows that the presented cross-VM row hammer attacks are of practical importance. 
    more » « less
  3. Abstract Decoder-only Transformer models such as Generative Pre-trained Transformers (GPT) have demonstrated exceptional performance in text generation by autoregressively predicting the next token. However, the efficiency of running GPT on current hardware systems is bounded by low compute-to-memory-ratio and high memory access. In this work, we propose a Process-in-memory (PIM) GPT accelerator, PIM-GPT, which achieves end-to-end acceleration of GPT inference with high performance and high energy efficiency. PIM-GPT leverages DRAM-based PIM designs for executing multiply-accumulate (MAC) operations directly in the DRAM chips, eliminating the need to move matrix data off-chip. Non-linear functions and data communication are supported by an application specific integrated chip (ASIC). At the software level, mapping schemes are designed to maximize data locality and computation parallelism. Overall, PIM-GPT achieves 41 − 137 × , 631 − 1074 × speedup and 123 − 383 × , 320 − 602 × energy efficiency over GPU and CPU baseline on 8 GPT models with up to 1.4 billion parameters. 
    more » « less
  4. We describe a fast, abstract method for reverse engineering (RE) field programmable gate array (FPGA) look-up-tables (LUTs). Our method has direct applications to hardware (HW) metering and FPGA fingerprinting, and our approach allows easy portability and application to most L UT based FPGAs. Unlike conventional RE methodologies that rely on vendor specific code (like Xilinx XDL), tools, configuration files, components, etc., our methodology is not dependent on any specific FPGA or FPGA computer aided design (CAD) tool. We use generic hardware description language (HDL) code based on specially connected CASE statements to program the L UTs on a target FPGA. Our specially connected CASE statements allow us to guide placement of L UT functions on successive synthesis runs. This enables us to quickly determine which bits in the FPGA 's configuration file match to FPGA L UT bits. After we know which bits are L UT bits, we can go further and match specific LUT bits to specific bits in the configuration file, thereby creating a one-to-one mapping between every L UT memory cell and its matching bit in the configuration file. In this paper we present our CASE statement functions for performing one-to-one mapping of all FPGA L UT memory cell bits to specific configuration file bits. We have successfully applied our methods to several 7000 series Xilinx and Intel (Altera) FPGAs. 
    more » « less
  5. null (Ed.)
    Non-volatile memory (NVRAM) based on phase-change memory (such as Optane DC Persistent Memory Module) is making its way into Intel servers to address the needs of emerging applications that have a huge memory footprint. These systems have both DRAM and NVRAM on the same memory channel with the smaller capacity DRAM serving as a cache to the larger capacity NVRAM in the so called 2LM mode. In this work we analyze the performance of such DRAM caches on real hardware using a broad range of synthetic and real-world benchmarks. We identify three key limitations of DRAM caches in these emerging systems which prevent large-scale, bandwidth bound applications from taking full advantage of NVRAM read and write bandwidth. We show that software based techniques are necessary for orchestrating the data movement between DRAM and PMM for such workloads to take full advantage of these new heterogeneous memory systems. 
    more » « less