skip to main content

Title: A Survey of Resource Management for Processing-In-Memory and Near-Memory Processing Architectures
Due to the amount of data involved in emerging deep learning and big data applications, operations related to data movement have quickly become a bottleneck. Data-centric computing (DCC), as enabled by processing-in-memory (PIM) and near-memory processing (NMP) paradigms, aims to accelerate these types of applications by moving the computation closer to the data. Over the past few years, researchers have proposed various memory architectures that enable DCC systems, such as logic layers in 3D-stacked memories or charge-sharing-based bitwise operations in dynamic random-access memory (DRAM). However, application-specific memory access patterns, power and thermal concerns, memory technology limitations, and inconsistent performance gains complicate the offloading of computation in DCC systems. Therefore, designing intelligent resource management techniques for computation offloading is vital for leveraging the potential offered by this new paradigm. In this article, we survey the major trends in managing PIM and NMP-based DCC systems and provide a review of the landscape of resource management techniques employed by system designers for such systems. Additionally, we discuss the future challenges and opportunities in DCC management.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Journal of Low Power Electronics and Applications
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Secure architectures are becoming an increasingly common demand. This is due in large part to the rise of cloud computing, as users are trusting their data with hardware that they do not own. Unfortunately, many secure computation and isolation techniques are still susceptible to side-channel attacks. While various defenses to side-channel attacks exist, each tends to be targeted to a specific vulnerability and comes with a high runtime overhead, making it difficult to combine these defenses together in a performant manner.This work proposes an efficient design for preventing a large range of cache side-channel attacks by leveraging a near-memory processing (NMP) architecture. Specifically, the proposed design stores all sensitive data in isolated NMP vaults and performs all computation involving that sensitive data on NMP cores. Our approach eliminates possible cache side-channels while also minimizing runtime overhead when the parallelizability of NMP architecture is leveraged. Simulation results from a cycle-accurate architecture model shows that offloading secure computation to NMP cores can have as little as 0.26% overhead. 
    more » « less
  2. The slowdown of Moore’s Law, combined with advances in 3D stacking of logic and memory, have pushed architects to revisit the concept of processing-in-memory (PIM) to overcome the memory wall bottleneck. This PIM renaissance finds itself in a very different computing landscape from the one twenty years ago, as more and more computation shifts to the cloud. Most PIM architecture papers still focus on best-effort applications, while PIM’s impact on latency-critical cloud applications is not well understood. This paper explores how datacenters can exploit PIM architectures in the context of latency-critical applications. We adopt a general-purpose cloud server with HBM-based, 3D-stacked logic+memory modules, and study the impact of PIM on six diverse interactive cloud applications. We reveal the previously neglected opportunity that PIM presents to these services, and show the importance of properly managing PIM-related resources to meet the QoS targets of interactive services and maximize resource efficiency. Then, we present PIMCloud, a QoS-aware resource manager designed for cloud systems with PIM allowing colocation of multiple latency-critical and best-effort applications. We show that PIMCloud efficiently manages PIM resources: it (1) improves effective machine utilization by up to 70% and 85% (average 24% and 33%) under 2-app and 3-app mixes, compared to the best state-of-the-art manager; (2) helps latency-critical applications meet QoS; and (3) adapts to varying load patterns. 
    more » « less
  3. Memory latency and bandwidth are significant bottlenecks in designing in-memory indexes. Processing-in-memory (PIM), an emerging hardware design approach, alleviates this problem by embedding processors in memory modules, enabling low-latency memory access whose aggregated bandwidth scales linearly with the number of PIM modules. Despite recent work in balanced comparison-based indexes on PIM systems, building efficient tries for PIMs remains an open challenge due to tries' inherently unbalanced shape. This paper presents the PIM-trie, the first batch-parallel radix-based index for PIM systems that provides load balance and low communication under adversary-controlled workloads. We introduce trie matching-matching a query trie of a batch against the compressed data trie-as a key building block for PIM-friendly index operations. Our algorithm combines (i) hash-based comparisons for coarse-grained work distribution/elimination and (ii) bit-by-bit comparisons for fine-grained matching. Combined with other techniques (meta-block decomposition, selective recursive replication, differentiated verification), PIM-trie supports LongestCommonPrefix, Insert, and Delete in O(logP) communication rounds per batch and O(l/w) communication volume per string, where P is the number of PIM modules, l is the string length in bits, and w is the machine word size. Moreover, work and communication are load-balanced among modules whp, even under worst-case skew. 
    more » « less
  4. Generative Adversarial Network (GAN) has emerged as one of the most promising semi-supervised learning methods where two neural nets train themselves in a competitive environment. In this paper, as far as we know, we are the first to present a statistically trained Ternarized Generative Adversarial Network (TGAN) with fully ternarized weights (i.e. -1,0,+1) to massively reduce the need for computation and storage resources in the conventional GAN structures. In the proposed TGAN, the computationally expensive convolution operations (i.e. Multiplication and Accumulation) in both generator and discriminator's forward path are converted into hardware-friendly Addition/Subtraction operations. Accordingly, we propose a Processing-in-Memory accelerator for TGAN called (PIM-TGAN) based on Spin-Orbit Torque Magnetic Random Access Memory (SOT-MRAM) computational sub-arrays to efficiently accelerate the training process of GAN within non-volatile memory. In addition, we propose a parallelism technique to further enhance the training efficiency of TGAN. Our device-to-architecture co-simulation results show that, with almost the same inception score to the baseline GAN with floating point number weights on different data-sets, the proposed PIM-TGAN can obtain ~25.6× better energy-efficiency and 22× speedup compared to GPU platform averagely, and, 9.2× better energy-efficiency and 5.4× speedup over the best processing-in-ReRAM accelerators. 
    more » « less
  5. Artificial Intelligence (AI) is moving towards the edge. Training an AI model for edge computing on a centralized server increases latency, and the privacy of edge users is jeopardized due to private data transfer through a less secure communication channels. Additionally, existing high-power computing systems are battling with memory and data transfer bottlenecks between the processor and memory. Federated Learning (FL) is a collaborative AI learning paradigm for distributed local devices that operates without transferring local data. Local participant devices share the updated network parameters with the central server instead of sending the original data. The central server updates the global AI model and deploys the model to the local clients. As the local data resides only on the edge, these devices need to be protected from cyberattacks. The Federated Intrusion Detection System (FIDS) could be a viable system to protect edge devices as opposed to a centralized protection system. However, on-device training of the model in resource constrained devices may suffer from excessive power drain, in addition to memory and area overhead. In this work we present a memristor based system for AI training on edge devices. Memristor devices are ideal candidates for processing in memory, as their dynamic resistance properties allow them to perform multiply-add operations in parallel in the analog domain with extreme efficiency. Alternatively, existing CMOS-based PIM systems are typically developed for edge inference based on pretrained weights, and are not equipped for on-chip training. We show the effectiveness of the system, where successful learning and recognition is achieved completely within edge devices. The classification accuracy of the memristor system shows negligible loss when compared a software implementation. To the best of our knowledge, this first demonstration of a memristor based federated learning system. We demonstrate the effectiveness of this system as an intrusion detection platform for edge devices, although given the flexibility of the learning algorithm, it could be used to enhance many types of on board leaning and classification applications. 
    more » « less