Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUs.
more »
« less
Enabling Extremely Fine-grained Parallelism via Scalable Concurrent Queues on Modern Many-core Architectures
Enabling efficient fine-grained task parallelism is a significant challenge for hardware platforms with increasingly many cores. Existing techniques do not scale to hundreds of threads due to the high cost of synchronization in concurrent data structures. To overcome these limitations we present XQueue, a novel lock-less concurrent queuing system with relaxed ordering semantics that is geared towards realizing scalability up to hundreds of concurrent threads. We demonstrate the scalability of XQueue using microbenchmarks and show that XQueue can deliver concurrent operations with latencies as low as 110 cycles at scales of up to 192 cores (up to 6900× improvement compared to traditional synchronization mechanisms) across our diverse hardware, including x86, ARM, and Power9. The reduced latency allows XQueue to provide orders of magnitude (3300×) better throughput that existing techniques. To evaluate the real-world benefits of XQueue, we integrated XQueue with LLVM OpenMP and evaluated five unmodified benchmarks from the Barcelona OpenMP Task Suite (BOTS) as well as a graph traversal benchmark from the GAP benchmark suite. We compared the XQueue-enabled LLVM OpenMP implementation with the native LLVM and GNU OpenMP versions. Using fine-grained task workloads, XQueue can deliver 4× to 6× speedup compared to native GNU OpenMP and LLVM OpenMP in many cases, with speedups as high as 116× in some cases.
more »
« less
- PAR ID:
- 10304007
- Date Published:
- Journal Name:
- Proceedings of the 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '21)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Task graphs have been studied for decades as a foundation for scheduling irregular parallel applications and incorporated in many programming models including OpenMP. While many high-performance parallel libraries are based on task graphs, they also have additional scheduling requirements, such as synchronization within inner levels of data parallelism and internal blocking communications. In this paper, we extend task-graph scheduling to support efficient synchronization and communication within tasks. Compared to past work, our scheduler avoids deadlock and oversubscription of worker threads, and refines victim selection to increase the overlap of sibling tasks. To the best of our knowledge, our approach is the first to combine gang-scheduling and work-stealing in a single runtime. Our approach has been evaluated on the SLATE high-performance linear algebra library. Relative to the LLVM OMP runtime, our runtime demonstrates performance improvements of up to 13.82%, 15.2%, and 36.94% for LU, QR, and Cholesky, respectively, evaluated across different configurations related to matrix size, number of nodes, and use of CPUs vs GPUsmore » « less
-
One of the major challenges in parallelization is the difficulty of improving application scalability with conventional techniques. HPX provides efficient scalable parallelism by significantly reducing node starvation and effective latencies while controlling the overheads. In this paper, we present a new highly scalable parallel distributed N-Body application using a future-based algorithm, which is implemented with HPX. The main difference between this algorithm and prior art is that a future-based request buffer is used between different nodes and along each spatial direction to send/receive data to/from the remote nodes, which helps removing synchronization barriers. HPX provides an asynchronous programming model which results in improving the parallel performance. The results of using HPX for parallelizing Octree construction on one node and the force computation on the distributed nodes show the scalability improvement on an average by about 45% compared to an equivalent OpenMP implementation and 28% compared to a hybrid implementation (MPI+OpenMP) [1] respectively for one billion particles running on up to 128 nodes with 20 cores per each.more » « less
-
null (Ed.)Wireless local area networks (WLANs) are a key component of the telecommunications infrastructure in our society. While many solutions have been produced to improve their downlink throughput, the techniques for enhancing their uplink throughput remain limited. The stagnation can be attributed to the lack of fine-grained inter-node synchronization due to the hardware limitation of most devices. In this paper, we present an uplink distributed multiple-input-and-multiple-output scheme (termed UD-MIMO) for WLANs to enable concurrent uplink transmission in the absence of fine-grained inter-node synchronization. The enabling technique behind UD-MIMO is a practical solution to decoding uplink packets from asynchronous users. UD-MIMO makes it possible for WLANs to significantly improve their uplink throughput while not requiring tight internode synchronization. We have built a prototype of UD-MIMO on a wireless testbed and demonstrate its compatibility with commercial off-the-shelf Atheros 802.11 client devices (with modified Linux driver). Our experimental results show that, for a WLAN with 8 APs in a conference room, UD-MIMO offers 3.4× throughput compared to interference-avoidance approach.more » « less
-
Technologies such as Multi-Channel DRAM (MCDRAM) or High Bandwidth Memory (HBM) provide significantly more bandwidth than conventional memory. This trend has raised questions about how applications should manage data transfers between levels.This paper focuses on evaluating different usage modes of the MCDRAM in Intel Knights Landing (KNL) manycore processors. We evaluate these usage modes with a sorting kernel and a sortingbased streaming benchmark. We develop a performance model for the benchmark and use experimental evidence to demonstrate the correctness of the model. The model projects near-optimal numbers of copy threads for memory bandwidth bound computations. We demonstrate on KNL up to a 1.9X speedup for sort when the problem does not fit in MCDRAM over an OpenMP GNU sort that does not use MCDRAM.more » « less