Multi-core and many-core processing chips are becoming widespread and are now being widely integrated into Beowulf clusters. This poses a challenging problem for distributed simulation as it now becomes necessary to extend the algorithms to operate on a platform that includes both shared memory and distributed memory hardware. Furthermore, as the number of on-chip cores grows, the challenges for developing solutions without significant contention for shared data structures grows. This is especially true for the pending event list data structures where multiple execution threads attempt to schedule the next event for execution. This problem is especially aggravated in parallel simulation, where event executions are generally fine-grained leading quickly to non-trivial contention for the pending event list. This manuscript explores the design of the software architecture and several data structures to manage the pending event sets for execution in a Time Warp synchronized parallel simulation engine. The experiments are especially targeting multi-core and many-core Beowulf clusters containing 8-core to 48-core processors. These studies include a two-level structure for holding the pending event sets using three different data structures, namely: splay trees, the STL multiset, and ladder queues. Performance comparisons of the three data structures using two architectures for the pending event sets are presented.
more »
« less
Lock-free pending event set management in time warp
The rapid growth in the parallelism of multi-core processors has opened up new opportunities and challenges for parallel simulation discrete event simulation (PDES). PDES simulators attempt to find parallelism within the pending event set to achieve speedup. Typically the pending event set is sorted to preserve the causal orders of the contained events. Sorting is a key aspect that amplifies contention for exclusive access to the shared event scheduler and events are generally scheduled to follow the time-based order of the pending events. In this work we leverage a Ladder Queue data structure to partition the pending events into groups (called buckets) arranged by adjacent and short regions of time. We assume that the pending events within any one bucket are causally independent and schedule them for execution without sorting and without consideration of their total time-based order. We use the Time Warp mechanism to recover whenever actual dependencies arise. Due to the lack of need for sorting, we further extend our pending event data structure so that it can be organized for lock-free access. Experimental results show consistent speedup for all studied configurations and simulation models. The speedups range from 1.1 to 1.49 with higher speedups occurring with higher thread counts where contention for the shared event set becomes more problematic with a conventional mutex locking mechanism.
more »
« less
- Award ID(s):
- 0915337
- PAR ID:
- 10350996
- Date Published:
- Journal Name:
- ACM SIGSIM Conference on Principles of Advanced Discrete Simulation
- Page Range / eLocation ID:
- 15 to 26
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Transactional memory is a concurrency control mechanism that dynamically determines when threads may safely execute critical sections of code. It provides the performance of fine-grained locking mechanisms with the simplicity of coarse-grained locking mechanisms. With hardware based transactions, the protection of shared data accesses and updates can be evaluated at runtime so that only true collisions to shared data force serialization. This paper explores the use of transactional memory as an alternative to conventional synchronization mechanisms for managing the pending event set in a Time Warp synchronized parallel simulator. In particular, we explore the application of Intel’s hardware-based transactional memory (TSX) to manage shared access to the pending event set by the simulation threads. Comparison between conventional locking mechanisms and transactional memory access is performed to evaluate each within the warped Time Warp synchronized parallel simulation kernel. In this testing, evaluation of both forms of transactional memory found in the Intel Haswell processor, Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM), are evaluated. The results show that RTM generally outperforms conventional locking mechanisms and that HLE provides consistently better performance than conventional locking mechanisms, in some cases as much as 27%.more » « less
-
Parallel computing algorithms benefit from in- creases in concurrency when the hardware capacity is being under utilized. For likelihood-based molecular evolution in- ferences this can be due to small problem sizes, or because hardware capacity has increased beyond dataset sizes. A central concept in this domain is a bifurcating tree, which represents evolutionary relationships. The topology of the tree being evaluated directly affects the degree of parallelism that can be exploited by likelihood-based algorithms. For time-reversible evolutionary models we can reroot an unbalanced tree in order to make it more symmetric, without affecting the likelihood. Based on the reduction in number of concurrent operation sets, we define a best-case theoretical expectations, based on tree size and topology, for speedup due to rerooting which approaches 2-fold as the number of tip nodes increases for pectinate trees, and much higher values for some random topologies as the number of tip nodes increases. Empirical results using the NVIDIA CUDA implementation of the BEAGLE library confirm the merit of this approach. For pectinate trees we observe speedups of up to 1.93-fold due to rerooting and even larger speedups for random trees for the core likelihood-evaluation function in BEAGLE.more » « less
-
Abstract This paper presents dynamic design techniques—namely, continuous-time event-triggered control (CETC), periodic event-triggered control (PETC) and self-triggered control (STC)—for a class of unstable one-dimensional reaction-diffusion partial differential equations (PDEs) with boundary control and an anti-collocated sensing mechanism. For the first time, global exponential stability (GES) of the closed-loop system is established using a PDE backstepping control design combined with dynamic event-triggered mechanisms for parabolic PDEs—a result not previously achieved even under full-state measurement. When the emulated continuous-time backstepping controller is implemented on the plant using a zero-order hold, our design guarantees $$ L^{2} $$-GES through the integration of novel switching dynamic event triggers and a newly developed Lyapunov functional. While CETC requires the continuous monitoring of the triggering function to detect events, PETC only requires the periodic evaluation of this function. The STC design assumes full-state measurements and, unlike CETC, does not require continuous monitoring of any triggering function. Instead, it computes the next event time at the current event time using only full-state measurements available at the current event time and the immediate previous event time. Thus, STC operates entirely with event-triggered measurements, in contrast to CETC and PETC, which rely on continuous measurements. The well-posedness of the closed-loop systems under all three strategies is established, and simulation results are provided to illustrate the theoretical results.more » « less
-
This paper proposes efficient solutions for k-core decomposition with high parallelism. The problem of k-core decomposition is fundamental in graph analysis and has applications across various domains. However, existing algorithms face significant challenges in achieving work-efficiency in theory and/or high parallelism in practice, and suffer from various performance bottlenecks. We present a simple, work-efficient parallel framework for k-core decomposition that is easy to implement and adaptable to various strategies for improving work-efficiency. We introduce two techniques to enhance parallelism: a sampling scheme to reduce contention on high-degree vertices, and vertical granularity control (VGC) to mitigate scheduling overhead for low-degree vertices. Furthermore, we design a hierarchical bucket structure to optimize performance for graphs with high coreness values. We evaluate our algorithm on a diverse set of real-world and synthetic graphs. Compared to state-of-the-art parallel algorithms, including ParK, PKC, and Julienne, our approach demonstrates superior performance on 23 out of 25 graphs when tested on a 96-core machine. Our algorithm shows speedups of up to 315× over ParK, 33.4× over PKC, and 52.5× over Julienne.more » « less
An official website of the United States government

