We study the relationship between memory accesses, bank conflicts, thread multiplicity (also known as over-subscription) and instruction-level parallelism in comparison-based sort- ing algorithms for Graphics Processing Units (GPUs). We experimentally validate a proposed formula that relates these parameters with asymptotic analysis of the number of mem- ory accesses by an algorithm. Using this formula we analyze and compare several GPU sorting algorithms, identifying key performance bottlenecks in each one of them. Based on this analysis we propose a GPU-efficient multiway merge- sort algorithm, GPU-MMS, which minimizes or eliminates these bottlenecks and balances various limiting factors for specific hardware. We realize an implementation of GPU-MMS and compare it to sorting algorithm implementations in state-of-the-art GPU libraries on three GPU architectures. Despite these library implementations being highly optimized, we find that GPU-MMS outperforms them by an average of 21% for random integer inputs and 14% for random key-value pairs.
more »
« less
This content will become publicly available on July 31, 2026
Sorting it out in Hardware: A State-of-the-Art Survey
Sorting is a fundamental operation in various applications and a traditional research topic in computer science. Improving the performance of sorting operations can have a significant impact on many application domains. Much attention has been paid to hardware-based solutions for high-performance sorting. These are often realized with application-specific integrated circuits (ASICs) or field-programmable gate arrays (FPGAs). Recently, in-memory sorting solutions have also been proposed to address the movement cost issue between memory and processing units, also known as the Von Neumann bottleneck. Due to the complexity of the sorting algorithms, achieving an efficient hardware implementation for sorting data is challenging. A large body of prior solutions is built on compare-and-swap (CAS) units. These are categorized ascomparison-basedsorting. Some recent solutions offercomparison-freesorting. In this survey, we review the latest works in the area of hardware-based sorting. We also discuss the recent hardware solutions forpartialandstreamsorting. Finally, we discuss some important concerns that need to be considered in the future designs of sorting systems.
more »
« less
- PAR ID:
- 10648808
- Publisher / Repository:
- ACM Transactions on Design Automation of Electronic Systems
- Date Published:
- Journal Name:
- ACM Transactions on Design Automation of Electronic Systems
- Volume:
- 30
- Issue:
- 4
- ISSN:
- 1084-4309
- Page Range / eLocation ID:
- 1 to 31
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The recent introduction of Unified Virtual Memory (UVM) in GPUs offers a new programming model that allows GPUs and CPUs to share the same virtual memory space, which shifts the complex memory management from programmers to GPU driver/ hardware and enables kernel execution even when memory is oversubscribed. Meanwhile, UVM may also incur considerable performance overhead due to tracking and data migration along with special handling of page faults and page table walk. As UVM is attracting significant attention from the research community to develop innovative solutions to these problems, in this paper, we propose a comprehensive UVM benchmark suite named UVMBench to facilitate future research on this important topic. The proposed UVMBench consists of 32 representative benchmarks from a wide range of application domains. The suite also features unified programming implementation and diverse memory access patterns across benchmarks, thus allowing thorough evaluation and comparison with current state-of-the-art. A set of experiments have been conducted on real GPUs to verify and analyze the benchmark suite behaviors under various scenarios.more » « less
-
Sorting data is needed in many application domains. Traditionally, the data is read from memory and sent to a general-purpose processor or application-specific hardware for sorting. The sorted data is then written back to the memory. Reading/writing data from/to memory and transferring data between memory and processing unit incur significant latency and energy overhead. In this work, we develop the first architectures for in-memory sorting of data to the best of our knowledge. We propose two architectures. The first architecture is applicable to the conventional format of representing data, i.e., weighted binary radix. The second architecture is proposed for developing unary processing systems, where data is encoded as uniform unary bit-streams. As we present, each of the two architectures has different advantages and disadvantages, making one or the other more suitable for a specific application. However, the common property of both is a significant reduction in the processing time compared to prior sorting designs. Our evaluations show on average 37 × and 138 × energy reduction for binary and unary designs, respectively, compared to conventional CMOS off-memory sorting systems in a 45nm technology. We designed a 3 × 3 and a 5 × 5 Median filter using the proposed sorting solutions, which we used for processing 64 × 64 pixel images. Our results show a reduction of 14 × and 634 × in energy and latency, respectively, with the proposed binary, and 5.6 × and 152 × 10 3 in energy and latency with the proposed unary approach compared to those of the off-memory binary and unary designs for the 3 × 3 Median filtering system.more » « less
-
As computing devices become more commonplace in every day life, we have seen an increase of possible attacks on commercial devices and critical infrastructure. As a result, both academia and industry have proposed solutions to mitigate or outright eliminate the ever expanding set of viable targets. Initially, this resulted in an influx of software-based defenses against these emerging threats. Unfortunately, it was found that software solutions could be bypassed with more advanced attacks and often resulted in high performance overhead. As such, hardware-assisted security defenses have been developed to provide improved security while keeping performance overhead to manageable levels, especially for IoT devices. In this paper, we will provide a survey of prominent hardware-assisted security defenses. We will enumerate the attacks these defenses aim to protect, as well as their effectiveness. We will also discuss the implications in both performance and system design. A comparison between approaches that target the same set of issues, and possible directions for future research will be presented.more » « less
-
Processing-in-memory (PIM), where compute is moved closer to memory or data, has been explored to accelerate emerging workloads. Different PIM-based systems have been announced, each offering a unique microarchitectural organization of their compute units, ranging from fixed functional units to programmable general-purpose compute cores near memory. However, one fundamental limitation of PIM is that each compute unit can only access its local memory; access to “remote” memory must occur through the host CPU – potentially limiting application performance scalability. In this work, we first characterize the scalability of real PIM architectures using the UPMEM PIM system. We analyze how the overhead of communicating through the host (instead of providing direct communication between the PIM compute units) can become a bottleneck for collective communications that are commonly used in many workloads. To overcome this inter-PIM bank communication, we propose PIMnet – a PIM interconnection network for PIM banks that provides direct connectivity between compute units and removes the overhead of communicating through the host. PIMnet exploits bandwidth parallelism where communication across the different PIM bank/chips can occur in parallel to maximize communication performance. PIMnet also matches the DRAM packaging hierarchy with a multi-tier network architecture. Unlike traditional interconnection networks, PIMnet is a PIM controlled network where communication is managed by the PIM logic, optimizing collective communications and minimizing the hardware overhead of PIMnet. Our evaluation of PIMnet shows that it provides up to 85× speedup on collective communications and achieves a 11.8× improvement on real applications compared to the baseline PIM.more » « less
An official website of the United States government
