skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on March 1, 2026

Title: PIMnet: A Domain-Specific Network for Efficient Collective Communication in Scalable PIM
Processing-in-memory (PIM), where compute is moved closer to memory or data, has been explored to accelerate emerging workloads. Different PIM-based systems have been announced, each offering a unique microarchitectural organization of their compute units, ranging from fixed functional units to programmable general-purpose compute cores near memory. However, one fundamental limitation of PIM is that each compute unit can only access its local memory; access to “remote” memory must occur through the host CPU – potentially limiting application performance scalability. In this work, we first characterize the scalability of real PIM architectures using the UPMEM PIM system. We analyze how the overhead of communicating through the host (instead of providing direct communication between the PIM compute units) can become a bottleneck for collective communications that are commonly used in many workloads. To overcome this inter-PIM bank communication, we propose PIMnet – a PIM interconnection network for PIM banks that provides direct connectivity between compute units and removes the overhead of communicating through the host. PIMnet exploits bandwidth parallelism where communication across the different PIM bank/chips can occur in parallel to maximize communication performance. PIMnet also matches the DRAM packaging hierarchy with a multi-tier network architecture. Unlike traditional interconnection networks, PIMnet is a PIM controlled network where communication is managed by the PIM logic, optimizing collective communications and minimizing the hardware overhead of PIMnet. Our evaluation of PIMnet shows that it provides up to 85× speedup on collective communications and achieves a 11.8× improvement on real applications compared to the baseline PIM.  more » « less
Award ID(s):
2312275 2312276
PAR ID:
10612701
Author(s) / Creator(s):
; ; ; ; ; ; ; ;
Publisher / Repository:
IEEE
Date Published:
Journal Name:
Proceedings
ISSN:
2378-203X
ISBN:
979-8-3315-0647-6
Page Range / eLocation ID:
1557 to 1572
Format(s):
Medium: X
Location:
Las Vegas, NV, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support. 
    more » « less
  2. Today’s Deep Neural Network (DNN) inference systems contain hundreds of billions of parameters, resulting in significant latency and energy overheads during inference due to frequent data transfers between compute and memory units. Processing-in-Memory (PiM) has emerged as a viable solution to tackle this problem by avoiding the expensive data movement. PiM approaches based on electrical devices suffer from throughput and energy efficiency issues. In contrast, Optically-addressed Phase Change Memory (OPCM) operates with light and achieves much higher throughput and energy efficiency compared to its electrical counterparts. This paper introduces a system-level design that takes the OPCM programming overhead into consideration, and identifies that the programming cost dominates the DNN inference on OPCM-based PiM architectures. We explore the design space of this system and identify the most energy-efficient OPCM array size and batch size. We propose a novel thresholding and reordering technique on the weight blocks to further reduce the programming overhead. Combining these optimizations, our approach achieves up to 65.2x higher throughput than existing photonic accelerators for practical DNN workloads. 
    more » « less
  3. The performance of today's in-memory indexes is bottlenecked by the memory latency/bandwidth wall. Processing-in-memory (PIM) is an emerging approach that potentially mitigates this bottleneck, by enabling low-latency memory access whose aggregate memory bandwidth scales with the number of PIM nodes. There is an inherent tension, however, between minimizing inter-node communication and achieving load balance in PIM systems, in the presence of workload skew. This paper presents PIM-tree , an ordered index for PIM systems that achieves both low communication and high load balance, regardless of the degree of skew in data and queries. Our skew-resistant index is based on a novel division of labor between the host CPU and PIM nodes, which leverages the strengths of each. We introduce push-pull search , which dynamically decides whether to push queries to a PIM-tree node or pull the node's keys back to the CPU based on workload skew. Combined with other PIM-friendly optimizations ( shadow subtrees and chunked skip lists ), our PIM-tree provides high-throughput, (guaranteed) low communication, and (guaranteed) high load balance, for batches of point queries, updates, and range scans. We implement PIM-tree, in addition to prior proposed PIM indexes, on the latest PIM system from UPMEM, with 32 CPU cores and 2048 PIM nodes. On workloads with 500 million keys and batches of 1 million queries, the throughput using PIM-trees is up to 69.7X and 59.1x higher than the two best prior PIM-based methods. As far as we know these are the first implementations of an ordered index on a real PIM system. 
    more » « less
  4. Memory latency and bandwidth are significant bottlenecks in designing in-memory indexes. Processing-in-memory (PIM), an emerging hardware design approach, alleviates this problem by embedding processors in memory modules, enabling low-latency memory access whose aggregated bandwidth scales linearly with the number of PIM modules. Despite recent work in balanced comparison-based indexes on PIM systems, building efficient tries for PIMs remains an open challenge due to tries' inherently unbalanced shape. This paper presents the PIM-trie, the first batch-parallel radix-based index for PIM systems that provides load balance and low communication under adversary-controlled workloads. We introduce trie matching-matching a query trie of a batch against the compressed data trie-as a key building block for PIM-friendly index operations. Our algorithm combines (i) hash-based comparisons for coarse-grained work distribution/elimination and (ii) bit-by-bit comparisons for fine-grained matching. Combined with other techniques (meta-block decomposition, selective recursive replication, differentiated verification), PIM-trie supports LongestCommonPrefix, Insert, and Delete in O(logP) communication rounds per batch and O(l/w) communication volume per string, where P is the number of PIM modules, l is the string length in bits, and w is the machine word size. Moreover, work and communication are load-balanced among modules whp, even under worst-case skew. 
    more » « less
  5. With the rapid advancement of DNNs, numerous Process-in-Memory (PIM) architectures based on various memory technologies (Non-Volatile (NVM)/Volatile Memory) have been developed to accelerate AI workloads. Magnetic Random Access Memory (MRAM) is highly promising among NVMs due to its zero standby leakage, fast write/read speeds, CMOS compatibility, and high memory density. However, existing MRAM technologies such as spin-transfer torque MRAM (STT-MRAM) and spin-orbit torque MRAM (SOT-MRAM), have inherent limitations. STT-MRAM faces high write current requirements, while SOT-MRAM introduces significant area overhead due to additional access transistors. The new STT-assisted-SOT (SAS) MRAM provides an area-efficient alternative by sharing one write access transistor for multiple magnetic tunnel junctions (MTJs). This work presents the first fully digital processing-in-SAS-MRAM system to enable 8-bit floating-point (FP8) neural network inference with an application in on-device session-based recommender system. A SAS-MRAM device prototype is fabricated with 4 MTJs sharing the same SOT metal line. The proposed SAS-MRAM-based PIM macro is designed in TSMC 28nm technology. It achieves 15.31 TOPS/W energy efficiency and 269 GOPS performance for FP8 operations at 700 MHz. Compared to state-of-the-art recommender systems for the same popular YooChoose dataset, it demonstrates a 86 ×, 1.8 ×, and 1.12 × higher energy efficiency than that of GPU, SRAM-PIM, and ReRAM-PIM, respectively. 
    more » « less