Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
The field of computer architecture uses quantitative methods to drive the computer system design process. By quantitatively profiling the run time characteristics of computer programs, the principal processing needs of commonly used programs became well understood and computer architects can focus their design solutions toward those needs. The DESMetrics project is established to follow this quantitative model by profiling the execution of Discrete Event Simulation (DES) models in order to focus optimization efforts within DES execution frameworks (and especially parallel DES engines). In particular, the DESMetrics project is designed to capture the run time characteristics of event execution in DES models. Because DES models tend to have fine grained computational processing requirements, the DESMetrics project focuses on the event dependencies and their exchange between the objects in the simulation. For now, we assume that optimization of the actual event processing is well served by conventional compiler and architecture solutions. Although, as will become clear later in Section 6, the possibility of identifying scheduling blocks of events that could potentially be schedule together can be achieved — at least within a single simulation object.more » « less
-
Transactional memory is a concurrency control mechanism that dynamically determines when threads may safely execute critical sections of code. It provides the performance of fine-grained locking mechanisms with the simplicity of coarse-grained locking mechanisms. With hardware based transactions, the protection of shared data accesses and updates can be evaluated at runtime so that only true collisions to shared data force serialization. This paper explores the use of transactional memory as an alternative to conventional synchronization mechanisms for managing the pending event set in a Time Warp synchronized parallel simulator. In particular, we explore the application of Intel’s hardware-based transactional memory (TSX) to manage shared access to the pending event set by the simulation threads. Comparison between conventional locking mechanisms and transactional memory access is performed to evaluate each within the warped Time Warp synchronized parallel simulation kernel. In this testing, evaluation of both forms of transactional memory found in the Intel Haswell processor, Hardware Lock Elision (HLE) and Restricted Transactional Memory (RTM), are evaluated. The results show that RTM generally outperforms conventional locking mechanisms and that HLE provides consistently better performance than conventional locking mechanisms, in some cases as much as 27%.more » « less
-
A considerable amount of research on parallel discrete event simulation has been conducted over the past few decades. However, most of this research has targeted the parallel simulation infrastructure; focusing on data structures, algorithms, and synchronization methods for parallel simulation kernels. Unfortunately, distributed environments often have high communication latencies that can reduce the potential performance of parallel simulations. Effective partitioning of the concurrent simulation objects of the real world models can have a large impact on the amount of network traffic necessary in the simulation, and consequently the overall performance. This paper presents our studies on profiling the characteristics of simulation models and using the collected data to perform partitioning of the models for concurrent execution. Our benchmarks show that Profile Guided Partitioning can result in dramatic performance gains in the parallel simulations. In some of the models, 5-fold improvements of the run time of the concurrently executed simulations were observed.more » « less
-
The rapid growth in the parallelism of multi-core processors has opened up new opportunities and challenges for parallel simulation discrete event simulation (PDES). PDES simulators attempt to find parallelism within the pending event set to achieve speedup. Typically the pending event set is sorted to preserve the causal orders of the contained events. Sorting is a key aspect that amplifies contention for exclusive access to the shared event scheduler and events are generally scheduled to follow the time-based order of the pending events. In this work we leverage a Ladder Queue data structure to partition the pending events into groups (called buckets) arranged by adjacent and short regions of time. We assume that the pending events within any one bucket are causally independent and schedule them for execution without sorting and without consideration of their total time-based order. We use the Time Warp mechanism to recover whenever actual dependencies arise. Due to the lack of need for sorting, we further extend our pending event data structure so that it can be organized for lock-free access. Experimental results show consistent speedup for all studied configurations and simulation models. The speedups range from 1.1 to 1.49 with higher speedups occurring with higher thread counts where contention for the shared event set becomes more problematic with a conventional mutex locking mechanism.more » « less
-
Multi-core and many-core processing chips are becoming widespread and are now being widely integrated into Beowulf clusters. This poses a challenging problem for distributed simulation as it now becomes necessary to extend the algorithms to operate on a platform that includes both shared memory and distributed memory hardware. Furthermore, as the number of on-chip cores grows, the challenges for developing solutions without significant contention for shared data structures grows. This is especially true for the pending event list data structures where multiple execution threads attempt to schedule the next event for execution. This problem is especially aggravated in parallel simulation, where event executions are generally fine-grained leading quickly to non-trivial contention for the pending event list. This manuscript explores the design of the software architecture and several data structures to manage the pending event sets for execution in a Time Warp synchronized parallel simulation engine. The experiments are especially targeting multi-core and many-core Beowulf clusters containing 8-core to 48-core processors. These studies include a two-level structure for holding the pending event sets using three different data structures, namely: splay trees, the STL multiset, and ladder queues. Performance comparisons of the three data structures using two architectures for the pending event sets are presented.more » « less
-
Some emerging high performance many-core chips have support to enable software control of an individual core’s operating frequency (and voltage). These controls can potentially be used to optimize execution for either performance (accelerating the critical path) or power savings (green computing). In Time Warp parallel simulators using the Virtual Time synchronization paradigm, some cores may be executing events that are well off the critical path and likely to be undone. In this work, we explore the adjustment of operating frequencies of cores executing on and off the critical path to reduce rollback and power consumption, while maintaining or, in some cases, enhancing performance.more » « less
-
Time Warp synchronized parallel discrete event simulators are organized to operate asynchronously and aggressively without explicit synchronization between the concurrently executing simulators. In place of an explicit synchronization mechanism, the concurrent simulators maintain a common virtual clock model and implement a rollback/recovery mechanism to restore causal order when out-of order events are detected. When the critical path of execution of the simulation is balanced across these parallel simulators, this can result in a highly effective, lightweight synchronization mechanism. However, imbalances in the workload across the parallel simulators can result in excessive rollback at some nodes and ultimately result in an overall slowing of the simulation as prematurely computed and transmitted events are processed. On small shared memory multi-core systems, a lowest timestamp first scheduling policy can effectively balance the workload. However, on larger many-core chips, conventional load balancing and workload migration will once again become necessary. Fortunately, emerging many-core chips contain some interesting features that can potentially be exploited to improve the performance of parallel simulations. For example, the Intel Single-chip Cloud Computer (SCC) provides mechanisms that a running application can use to adjust the frequency/voltage of different regions (called islands) of the chip. These islands are network and processing core centric and thus, in a Time Warp simulation, one can increase the frequency of the cores executing threads on the critical path (those experiencing infrequent rollback) and decrease the frequency of the cores executing threads off the critical path (those experiencing excessive rollback). This paper investigates the run-time control and adjustment of core frequency in an AMD Phenom II X6 multi-core processor to explore and demonstrate that the dynamic run-time control of core frequency can sometimes improve the performance of a Time Warp synchronized parallel simulation.more » « less
-
Parallel Discrete Event Simulation (PDES) using distributed synchronization supports the concurrent execution of discrete event simulation models on parallel processing hardware platforms. The multi-core/many-core era has provided a low latency “cluster on a chip” architecture for high-performance simulation and modeling of complex systems. A research many-core processor named the Single-Chip Cloud Computer (SCC) has been created by Intel Labs that contains some interesting opportunities for PDES research and development. The features of most interest in the SCC system are: low-latency messaging hardware, software managed cache coherence, and (user controllable) core independent dynamic frequency and voltage regulation capability. Ideally, each of these features provide interesting opportunities that can be exploited for improving the performance of PDES. This paper reports some preliminary efforts to migrate an optimistically synchronized parallel simulation kernel called WARPED to an SCC emulation system called Rock Creek Communication Environment (RCCE). The WARPED simulation kernel has been ported to the RCCE environment and several test simulation models have also been ported to the RCCE environment. Based on initial efforts, some preliminary insights on how to exploit some of the exotic features of SCC for increasing the performance of PDES applications is noted.more » « less
-
In recent years, the number of hardware supported threads in desktop processors has increased dramatically. All but the very lowest cost netbooks and embedded processors now have at least dual cores and soon systems supporting upwards of 8 to 16 hardware threads are likely to be commonplace. Unfortunately, it will be difficult to take full advantage of the parallelism emerging processors will be able to provide. To help address this issue, we are investigating mechanisms to pre-compute function results in separate threads running concurrently with the main program thread. The concurrent threads are forked automatically and without program modification. A critical component for the success of this idea is an ability to build a background thread that can pre-compute usable results in some effective manner. For some support functions (dynamic memory) exact arguments predictions for the function pre-computation are not necessary, for others (trigonometric functions) they are. In work with dynamic memory, we are able to pre-compute memory blocks and show modest speedup: saving approximately 25% of the dynamic memory costs. In studies with predicting argument values to trigonometric functions, we show that learning algorithms are able to successfully predict the next argument values approximately 44% of the time.more » « less