Electromigration (EM) is still the most important reliability concern for VLSI systems, especially at the nanometer regime. EM immortality check is an important step for full-chip EM signoff analysis. In this paper, we propose a new electromigration (EM) immortality check method for multi-segment interconnect considering the impacts of Joule heating induced temperature gradient. Temperature gradients from metal Joule heating, called thermal migration, can be a significant force for the metal atomic migrations, and these impacts get more significant as technology scales down. Compared to existing methods, the new method can consider the spatial temperature gradient due to Joule heating for multi-segment wires for the first time. We derive the analytic solution for the resulting steady-state EM-thermal migration stress distribution problem. Then we develop the new temperature-aware voltage-based EM immortality check method considering the multi-segment temperature migration effects, which carries all the benefits of the recently proposed voltage-based EM immortality method for multi-segment interconnects. Numerical results on an IBM power grid and self synthesized power delivery networks show that the proposed temperature-aware EM immortality check method is much more accurate than recently proposed state of the art EM immortality method.
more »
« less
Long-term reliability management for multitasking GPGPUs
This paper proposes long-term reliability management for spatial multitasking GPU architectures. Specifically, we focus on electromigration (EM)-induced long-term failure of the GPU's power delivery network. A distributed power delivery network model at functional unit granularity is developed and used for our EM analysis of GPU architectures. We use a recently proposed physics-based EM reliability model and consider the EM-induced time-to-failure at the GPU system level as a reliability resource. For GPU scheduling, we mainly focus on spatial multitasking, which allows GPU computing resources to be partitioned among multiple applications. We find that the existing reliability-agnostic thread block scheduler for spatial multitasking is effective in achieving high GPU utilization, but poor reliability. We develop and implement a long-term reliability-aware thread block scheduler in GPGPU-Sim, and compare it against existing reliability-agnostic scheduler. We evaluate several use cases of spatial multitasking and find that our proposed scheduler achieves up to 30\% improvement in long-term reliability.
more »
« less
- Award ID(s):
- 1854276
- PAR ID:
- 10148005
- Date Published:
- Journal Name:
- International Conference on Synthesis, Modeling, Analysis and Simulation Methods and Applications to Circuit Design (SMACD’19)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This work proposes a new dynamic thermal and reliability management framework via task mapping and migration to improve thermal performance and reliability of commercial multi-core processors considering workload-dependent thermal hot spot stress. The new method is motivated by the observation that different workloads activate different spatial power and thermal hot spots within each core of processors. Existing run-time thermal management, which is based on on-chip location-fixed thermal sensor information, can lead to suboptimal management solutions as the temperatures provided by those sensors may not be the true hot spots. The new method, called Hot-Trim, utilizes a machine learning-based approach to characterize the power density hot spots across each core, then a new task mapping/migration scheme is developed based on the hot spot stresses. Compared to existing works, the new approach is the first to optimize VLSI reliabilities by exploring workload-dependent power hot spots. The advantages of the proposed method over the Linux baseline task mapping and the temperature-based mapping method are demonstrated and validated on real commercial chips. Experiments on a real Intel Core i7 quad-core processor executing PARSEC-3.0 and SPLASH-2 benchmarks show that, compared to the existing Linux scheduler, core and hot spot temperature can be lowered by 1.15 to 1.31C. In addition, Hot-Trim can improve the chip's EM, NBTI and HCI related reliability by 30.2%, 7.0% and 31.1% respectively compared to Linux baseline without any performance degradation. Furthermore, it improves EM and HCI related reliability by 29.6% and 19.6% respectively, and at the same time even further reduces the temperature by half a degree compared to the conventional temperature-based mapping technique.more » « less
-
As the era of high frequency, single core processors have come to a close, the new paradigm of many core processors has come to dominate. In response to these systems, asynchronous multitasking runtime systems have been developed as a promising solution to efficiently utilize these newly available hardware. Asynchronous multitasking runtime systems work by dividing a problem into a large number of fine grained tasks. However, as the number of tasks created increase, the overheads associated with task creation and management cannot be ignored. Task inlining, a method where the parent thread consumes a child thread, enables the runtime system to achieve the balance between parallelism and its overhead. As largely impacted by different processor architectures, the decision of task inlining is dynamic in nature. In this research, we present adaptive techniques for deciding, at runtime, whether a particular task should be inlined or not. We present two policies, a baseline policy that makes inlining decision based on a fixed threshold and an adaptive policy which decides the threshold dynamically at runtime. We also evaluate and justify the performance of these policies on different processor architectures. To the best of our knowledge, this is the first study of the impacts of adaptive policy at runtime for task inlining in an asynchronous multitasking runtime system on different processor architectures. From experimentation, we find that the baseline policy improves the execution time from 7.61% to 54.09%. Furthermore, the adaptive policy improves over the baseline policy by up to 74%.more » « less
-
Recent advances in GPU-based manycore accelerators provide the opportunity to efficiently process large-scale graphs on chip. However, real world graphs have a diverse range of topology and connectivity patterns (e.g., degree distributions) that make the design of input-agnostic hardware architectures a challenge. Network-on-Chip (NoC)- based architectures provide a way to overcome this challenge as the architectural topology can be used to approximately model the expected traffic patterns that emerge from graph application workloads. In this paper, we first study the mix of long- and short-range traffic patterns generated on-chip using graph workloads, and subsequently use the findings to adapt the design of an optimal NoC-based architecture. In particular, by leveraging emerging three-dimensional (3D) integration technology, we propose design of a small-world NoC (SWNoC)- enabled manycore GPU architecture, where the placement of the links connecting the streaming multiprocessors (SM) and the memory controllers (MC) follow a power-law distribution. The proposed 3D manycore GPU architecture outperforms the traditional planar (2D) counterparts in both performance and energy consumption. Moreover, by adopting a joint performance-thermal optimization strategy, we address the thermal concerns in a 3D design without noticeably compromising the achievable performance. The 3D integration technology is also leveraged to incorporate Near Data Processing (NDP) to complement the performance benefits introduced by the SWNoC architecture. As graph applications are inherently memory intensive, off-chip data movement gives rise to latency and energy overheads in the presence of external DRAM. In conventional GPU architectures, as the main memory layer is not integrated with the logic, off-chip data movement negatively impacts overall performance and energy consumption. We demonstrate that NDP significantly reduces the overheads associated with such frequent and irregular memory accesses in graph-based applications. The proposed SWNoC-enabled NDP framework that integrates 3D memory (like Micron's HMC) with a massive number of GPU cores achieves 29.5% performance improvement and 30.03% less energy consumption on average compared to a conventional planar Mesh-based design with external DRAM.more » « less
-
This paper describes a new approach to register-pressure-aware instruction scheduling, using Ant Colony Optimization (ACO) . ACO is a nature-inspired optimization technique that researchers have successfully applied to NP-hard sequencing problems like the Traveling Salesman Problem (TSP) and its derivatives. In this work, we describe an ACO algorithm for solving the long-standing compiler optimization problem of balancing Instruction-Level Parallelism (ILP) and Register Pressure (RP) in pre-allocation instruction scheduling. Three different cost functions are studied for estimating RP during instruction scheduling. The proposed ACO algorithm is implemented in the LLVM open-source compiler, and its performance is evaluated experimentally on three different machines with three different instruction-set architectures: Intel x86, ARM, and AMD GPU. The proposed ACO algorithm is compared to an exact Branch-and-Bound (B&B) algorithm proposed in previous work. On x86 and ARM, both algorithms are evaluated relative to LLVM's generic scheduler, while on the AMD GPU, the algorithms are evaluated relative to AMD's production scheduler. The experimental results show that using SPECrate 2017 Floating Point, the proposed algorithm gives geometric-mean improvements of 1.13% and 1.25% in execution speed on x86 and ARM, respectively, relative to the LLVM scheduler. Using PlaidML on an AMD GPU, it gives a geometric-mean improvement of 7.14% in execution speed relative to the AMD scheduler. The proposed ACO algorithm gives approximately the same execution-time results as the B&B algorithm, with each algorithm outperforming the other on a substantial number of hard scheduling regions. ACO gives better results than B&B on many large instances that B&B times out on. Both ACO and B&B outperform the LLVM algorithm on the CPU and the AMD algorithm on the GPU.more » « less
An official website of the United States government

