Recent advances in GPU-based manycore accelerators provide the opportunity to efficiently process large-scale graphs on chip. However, real world graphs have a diverse range of topology and connectivity patterns (e.g., degree distributions) that make the design of input-agnostic hardware architectures a challenge. Network-on-Chip (NoC)- based architectures provide a way to overcome this challenge as the architectural topology can be used to approximately model the expected traffic patterns that emerge from graph application workloads. In this paper, we first study the mix of long- and short-range traffic patterns generated on-chip using graph workloads, and subsequently use the findings to adapt the design of an optimal NoC-based architecture. In particular, by leveraging emerging three-dimensional (3D) integration technology, we propose design of a small-world NoC (SWNoC)- enabled manycore GPU architecture, where the placement of the links connecting the streaming multiprocessors (SM) and the memory controllers (MC) follow a power-law distribution. The proposed 3D manycore GPU architecture outperforms the traditional planar (2D) counterparts in both performance and energy consumption. Moreover, by adopting a joint performance-thermal optimization strategy, we address the thermal concerns in a 3D design without noticeably compromising the achievable performance. The 3D integration technology is also leveraged to incorporate Near Data Processing (NDP) to complement the performance benefits introduced by the SWNoC architecture. As graph applications are inherently memory intensive, off-chip data movement gives rise to latency and energy overheads in the presence of external DRAM. In conventional GPU architectures, as the main memory layer is not integrated with the logic, off-chip data movement negatively impacts overall performance and energy consumption. We demonstrate that NDP significantly reduces the overheads associated with such frequent and irregular memory accesses in graph-based applications. The proposed SWNoC-enabled NDP framework that integrates 3D memory (like Micron's HMC) with a massive number of GPU cores achieves 29.5% performance improvement and 30.03% less energy consumption on average compared to a conventional planar Mesh-based design with external DRAM.
more »
« less
Measuring Network Latency Variation Impacts to High Performance Computing Application Performance
In this paper, we study the impacts of latency variation versus latency mean on application runtime, library performance, and packet delivery. Our contributions include the design and implementation of a network latency injector that is suitable for most QLogic and Mellanox InfiniBand cards. We fit statistical distributions of latency mean and variation to varying levels of network contention for a range of parallel application workloads. We use the statistical distributions to characterize the latency variation impacts to application degradation. The level of application degradation caused by variation in network latency depends on application characteristics, and can be significant. Observed degradation varies from no degradation for applications without communicating processes to 3.5 times slower for communication-intensive parallel applications. We support our results with statistical analysis of our experimental observations. For communication-intensive high performance computing applications, we show statistically significant evidence that changes in performance are more highly correlated with changes of variation in network latency than with changes of mean network latency alone.
more »
« less
- NSF-PAR ID:
- 10060492
- Date Published:
- Journal Name:
- ICPE '18 Proceedings of the 2018 ACM/SPEC International Conference on Performance Engineering
- Page Range / eLocation ID:
- 68 to 79
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Performance variation deriving from hardware and software sources is common in modern scientific and data-intensive computing systems, and synchronization in parallel and distributed programs often exacerbates their impacts at scale. The decentralized and emergent effects of such variation are, unfortunately, also difficult to systematically measure, analyze, and predict; modeling assumptions which are stringent enough to make analysis tractable frequently cannot be guaranteed at meaningful application scales, and longitudinal methods at such scales can require the capture and manipulation of impractically large amounts of data. This paper describes a new, scalable, and statistically robust approach for effective modeling, measurement, and analysis of large-scale performance variation in HPC systems. Our approach avoids the need to reason about complex distributions of runtimes among large numbers of individual application processes by focusing instead on the maximum length of distributed workload intervals. We describe this approach and its implementation in MPI which makes it applicable to a diverse set of HPC workloads. We also present evaluations of these techniques for quantifying and predicting performance variation carried out on large-scale computing systems, and discuss the strengths and limitations of the underlying modeling assumptions.more » « less
-
Recently, wireless communication technologies, such as Wireless Local Area Networks (WLANs), have gained increasing popularity in industrial control systems (ICSs) due to their low cost and ease of deployment, but communication delays associated with these technologies make it unsuitable for critical real-time and safety applications. To address concerns on network-induced delays of wireless communication technologies and bring their advantages into modern ICSs, wireless network infrastructure based on the Parallel Redundancy Protocol (PRP) has been proposed. Although application-specific simulations and measurements have been conducted to show that wireless network infrastructure based on PRP can be a viable solution for critical applications with stringent delay performance constraints, little has been done to devise an analytical framework facilitating the adoption of wireless PRP infrastructure in miscellaneous ICSs. Leveraging the deterministic network calculus (DNC) theory, we propose to analytically derive worst-case bounds on network- induced delays for critical ICS applications. We show that the problem of worst-case delay bounding for a wireless PRP network can be solved by performing network-calculus-based analysis on its non-feedforward traffic pattern. Closed-form expressions of worst-case delays are derived, which has not been found previously and allows ICS architects/designers to compute worst- case delay bounds for ICS tasks in their respective application domains of interest. Our analytical results not only provide insights into the impacts of network-induced delays on latency- critical tasks but also allow ICS architects/operators to assess whether proper wireless RPR network infrastructure can be adopted into their systems.more » « less
-
Recent work has shown that lightweight virtualization like Docker containers can be used in HPC to package applications with their runtime environments. In many respects, applications in containers perform similarly to native applications. Other work has shown that containers can have adverse effects on the latency variation of communications with the enclosed application. This latency variation may have an impact on the performance of some HPC workloads, especially those dependent on synchronization between processes. In this work, we measure the latency characteristics of messages between Docker containers, and then compare those measurements to the performance of real-world applications. Our specific goals are to: measure the changes in mean and variation of latency with Docker containers, study how this affects the synchronization time of MPI processes, and measure the impact of these factors on realÂworld applications such as the NAS Parallel Benchmark (NPB).more » « less
-
null (Ed.)Resource disaggregation is a new architecture for data centers in which resources like memory and storage are decoupled from the CPU, managed independently, and connected through a high-speed network. Recent work has shown that although disaggregated data centers (DDCs) provide operational benefits, applications running on DDCs experience degraded performance due to extra network latency between the CPU and their working sets in main memory. DBMSs are an interesting case study for DDCs for two main reasons: (1) DBMSs normally process data-intensive workloads and require data movement between different resource components; and (2) disaggregation drastically changes the assumption that DBMSs can rely on their own internal resource management. We take the first step to thoroughly evaluate the query execution performance of production DBMSs in disaggregated data centers. We evaluate two popular open-source DBMSs (MonetDB and PostgreSQL) and test their performance with the TPC-H benchmark in a recently released operating system for resource disaggregation. We evaluate these DBMSs with various configurations and compare their performance with that of single-machine Linux with the same hardware resources. Our results confirm that significant performance degradation does occur, but, perhaps surprisingly, we also find settings in which the degradation is minor or where DDCs actually improve performance.more » « less