skip to main content


Title: Lock-free pending event set management in time warp
The rapid growth in the parallelism of multi-core processors has opened up new opportunities and challenges for parallel simulation discrete event simulation (PDES). PDES simulators attempt to find parallelism within the pending event set to achieve speedup. Typically the pending event set is sorted to preserve the causal orders of the contained events. Sorting is a key aspect that amplifies contention for exclusive access to the shared event scheduler and events are generally scheduled to follow the time-based order of the pending events. In this work we leverage a Ladder Queue data structure to partition the pending events into groups (called buckets) arranged by adjacent and short regions of time. We assume that the pending events within any one bucket are causally independent and schedule them for execution without sorting and without consideration of their total time-based order. We use the Time Warp mechanism to recover whenever actual dependencies arise. Due to the lack of need for sorting, we further extend our pending event data structure so that it can be organized for lock-free access. Experimental results show consistent speedup for all studied configurations and simulation models. The speedups range from 1.1 to 1.49 with higher speedups occurring with higher thread counts where contention for the shared event set becomes more problematic with a conventional mutex locking mechanism.  more » « less
Award ID(s):
0915337
PAR ID:
10350996
Author(s) / Creator(s):
;
Date Published:
Journal Name:
ACM SIGSIM Conference on Principles of Advanced Discrete Simulation
Page Range / eLocation ID:
15 to 26
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Multi-core and many-core processing chips are becoming widespread and are now being widely integrated into Beowulf clusters. This poses a challenging problem for distributed simulation as it now becomes necessary to extend the algorithms to operate on a platform that includes both shared memory and distributed memory hardware. Furthermore, as the number of on-chip cores grows, the challenges for developing solutions without significant contention for shared data structures grows. This is especially true for the pending event list data structures where multiple execution threads attempt to schedule the next event for execution. This problem is especially aggravated in parallel simulation, where event executions are generally fine-grained leading quickly to non-trivial contention for the pending event list. This manuscript explores the design of the software architecture and several data structures to manage the pending event sets for execution in a Time Warp synchronized parallel simulation engine. The experiments are especially targeting multi-core and many-core Beowulf clusters containing 8-core to 48-core processors. These studies include a two-level structure for holding the pending event sets using three different data structures, namely: splay trees, the STL multiset, and ladder queues. Performance comparisons of the three data structures using two architectures for the pending event sets are presented. 
    more » « less
  2. Transactional memory is a concurrency control mechanism that dynamically determines when threads may safely execute critical sections of code. It provides the performance of fine-grained locking mechanisms with the simplicity of coarse-grained locking mechanisms. With hardware based transactions, the protection of shared data accesses and updates can be evaluated at runtime so that only true collisions to shared data force serialization. This paper explores the use of transactional memory as an alternative to conventional synchronization mechanisms for managing the pending event set in a Time Warp synchronized parallel simulator. In particular, we explore the application of Intel’s hardware-based transactional memory (TSX) to manage shared access to the pending event set by the simulation threads. Comparison between conventional locking mechanisms and transactional memory access is performed to evaluate each within the warped Time Warp synchronized parallel simulation kernel. In this testing, evaluation of both forms of transactional memory found in the Intel Haswell processor, Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM), are evaluated. The results show that RTM generally outperforms conventional locking mechanisms and that HLE provides consistently better performance than conventional locking mechanisms, in some cases as much as 27%. 
    more » « less
  3. Parallel computing algorithms benefit from in- creases in concurrency when the hardware capacity is being under utilized. For likelihood-based molecular evolution in- ferences this can be due to small problem sizes, or because hardware capacity has increased beyond dataset sizes. A central concept in this domain is a bifurcating tree, which represents evolutionary relationships. The topology of the tree being evaluated directly affects the degree of parallelism that can be exploited by likelihood-based algorithms. For time-reversible evolutionary models we can reroot an unbalanced tree in order to make it more symmetric, without affecting the likelihood. Based on the reduction in number of concurrent operation sets, we define a best-case theoretical expectations, based on tree size and topology, for speedup due to rerooting which approaches 2-fold as the number of tip nodes increases for pectinate trees, and much higher values for some random topologies as the number of tip nodes increases. Empirical results using the NVIDIA CUDA implementation of the BEAGLE library confirm the merit of this approach. For pectinate trees we observe speedups of up to 1.93-fold due to rerooting and even larger speedups for random trees for the core likelihood-evaluation function in BEAGLE. 
    more » « less
  4. Time Warp synchronized parallel discrete event simulators are organized to operate asynchronously and aggressively without explicit synchronization between the concurrently executing simulators. In place of an explicit synchronization mechanism, the concurrent simulators maintain a common virtual clock model and implement a rollback/recovery mechanism to restore causal order when out-of order events are detected. When the critical path of execution of the simulation is balanced across these parallel simulators, this can result in a highly effective, lightweight synchronization mechanism. However, imbalances in the workload across the parallel simulators can result in excessive rollback at some nodes and ultimately result in an overall slowing of the simulation as prematurely computed and transmitted events are processed. On small shared memory multi-core systems, a lowest timestamp first scheduling policy can effectively balance the workload. However, on larger many-core chips, conventional load balancing and workload migration will once again become necessary. Fortunately, emerging many-core chips contain some interesting features that can potentially be exploited to improve the performance of parallel simulations. For example, the Intel Single-chip Cloud Computer (SCC) provides mechanisms that a running application can use to adjust the frequency/voltage of different regions (called islands) of the chip. These islands are network and processing core centric and thus, in a Time Warp simulation, one can increase the frequency of the cores executing threads on the critical path (those experiencing infrequent rollback) and decrease the frequency of the cores executing threads off the critical path (those experiencing excessive rollback). This paper investigates the run-time control and adjustment of core frequency in an AMD Phenom II X6 multi-core processor to explore and demonstrate that the dynamic run-time control of core frequency can sometimes improve the performance of a Time Warp synchronized parallel simulation. 
    more » « less
  5. Data deduplication has been widely used in storage systems to improve storage efficiency and I/O performance. In particular, content-defined variable-size chunking (CDC) is often used in data deduplication systems for its capability to detect and remove duplicate data in modified files. However, the CDC algorithm is very compute-intensive and inherently sequential. Efforts on accelerating it by segmenting a file and running the algorithm independently on each segment in parallel come at a cost of substantial degradation of deduplication ratio. In this paper, we propose SS-CDC, a two-stage parallel CDC, that enables (almost) full parallelism on chunking of a file without compromising deduplication ratio. Further, SS-CDC exploits instruction-level SIMD parallelism available in today's processors. As a case study, by using Intel AVX-512 instructions, SS-CDC consistently obtains superlinear speedups on a multi-core server. Our experiments using real-world datasets show that, compared to existing parallel CDC methods which only achieve up to a 7.7X speedup on an 8-core processor with the deduplication ratio degraded by up to 40%, SS-CDC can achieve up to a 25.6X speedup with no loss of deduplication ratio. 
    more » « less