The breakdown in Moore’s Law and Dennard Scaling is leading to drastic changes in the makeup and constitution of computing systems. For example, a single chip integrates 10-100s of cores and has a heterogeneous mix of general-purpose compute engines and highly specialized accelerators. Traditionally, computer architects have relied on tools like architectural simulators (e.g., Accel-Sim, gem5, gem5-SALAM, GPGPU-Sim, MGPUSim, Sniper-Sim, and ZSim) to accurately perform early stage prototyping and optimizations for the proposed research. However, as systems become increasingly complex and heterogeneous, architectural tools are straining to keep up. In particular, publicly available architectural simulators are often not very representative of the industry parts they intend to represent. This leads to a mismatch in expectations; when prototyping new optimizations in gem5 users may draw the wrong conclusions about the efficacy of proposed optimizations if the tool’s models do not provide high fidelity. In this work, we focus on the gem5 simulator, the most popular platform for computer system simulation. In recent years gem5 has been used by ∼20% of simulation-based papers published in top-tier computer architecture conferences per year. Moreover, gem5 can run entire systems, including CPUs, GPUs, and accelerators as well as the operating system, runtime, network and other related components (including multiple ISAs). Thus, gem5 has the potential to allow users to study the behavior of the entire heterogeneous systems. Unfortunately, some of gem5’s models do not always provide high accuracy relative to their ”real” counterparts. In particular, although gem5’s GPU model provides high accuracy internally at AMD [9], the publicly available gem5 GPU model is often inaccurate, especially for the memory subsystem. To understand this, we designed a series of microbenchmarks designed to expose the latencies, bandwidths, and sizes of a variety of GPU components on real AMD GPUs. Our results showed that while gem5’s GPU microarchitecture was relatively accurate (within 5-10% in most cases), gem5’s memory subsytem was off by an average of 272% (645% max) for latency and 70% (693% max) for bandwidth. Accordingly, to help bridge this divide, we propose to design and use a new tool, GPU Accuracy Profiler (GAP), to compare and improve the behavior of gem5’s simulated GPUs relative to real GPUs. By iteratively applying fixes and improvements to gem’s GPU model via GAP, we will significantly improved its fidelity relative to real AMD GPUs. Although this work is still ongoing, our preliminary results show significant promise: on average 25% error for latency and 16% error for bandwidth, respectively. Overall, by completing this work we hope to enable more widespread adoption of gem5 as an accurate platform for heterogeneous architecture research.
more »
« less
Improving gem5’s GPUFS Support
With the waning of Moore’s Law and the end of Dennard’s Scaling, systems are turning towards heterogeneity, mixing conventional cores and specialized accelerators to continue scaling performance and energy efficiency. Specialized accelerators are frequently used to improve the efficiency of computations that run inefficiently on conventional, general-purpose processors. As a result, systems ranging from smartphones to data-centers, hyper-scalars, and supercomputers are increasingly using large numbers of accelerators to provide better efficiency than CPU-based solutions. However, heterogeneous systems face key challenges: changes to the underlying technology which threaten continued scaling, as well as the voracious scaling from applications, which require additional research to address. Traditionally, simulators could be used to perform early exploration for this research. However, existing simulators lack important support for these key challenges. Detailed simulation of modern systems can take extremely long times in existing tools and infrastructure. Furthermore, prototyping optimizations at scale can also be challenging, especially for newly proposed accelerators. Although other simulators such as Accel-Sim, SCALE-Sim, and Gemmini enable some early experiments, they are limited in their ability to target a wide variety of accelerators. In comparison, gem5 has support for various CPUs, GPUs, DSPs, and many other important accelerators. However, efficiently simulating large-scale workloads on gem5’s cycle-level models requires prohibitively long times. We aim to enhance gem5’s support to make running these workloads practical while retaining accuracy.
more »
« less
- Award ID(s):
- 1925485
- PAR ID:
- 10468164
- Publisher / Repository:
- 5th gem5 Users' Workshop
- Date Published:
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Computer systems research heavily relies on simulation tools like gem5 to effectively prototype and validate new ideas. However, publicly available simulators struggle to accurately model systems as architectures evolve rapidly. This is a major issue because incorrect simulator models may lead researchers to draw misleading or even incorrect conclusions about their research prototypes from these simulators. Although this challenge pertains to many open source simulators, we focus on the widely used, open source gem5 simulator. In GAP we showed that gem5’s GPGPU models have significant correlation issues versus real hardware. GAP also improved the fidelity of gem5’s AMDGPU model, particularly for cache access latencies and bandwidths. However, one critical issue remains: our microbenchmarks reveal 88% error in memory bandwidth between gem5’s current model and corresponding real AMD GPUs. To narrow this gap, we examined recent patents and gem5’s memory system bottlenecks, then made several improvements including: utilizing a redesigned HBM memory controller, enhancing TLB request coalescing, adding support for multiple page sizes, adding a page walk cache, and improving network bandwidth modeling. Collectively, these optimizations significantly improve gem5’s GPU memory bandwidth by 3.8x: from 153 GB/s to 583 GB/s. Moreover, our address translation enhancements can be ported to other ISAs where similar support is also needed, improving gem5’s MMU support.more » « less
-
Power consumption has increasingly become a first-class design constraint to satisfy requirements for scientific workloads and other widely used workloads, such as machine learning. To meet performance and power requirements, system designers often use architectural simulators, such as gem5, to model component and system-level behavior. However, performance and power modeling tools are often isolated and do not make it accessible to integrate with one another for rapid performance and power system co-design. Although studies have previously explored power modeling with gem5 and validation on real hardware, there are several flaws with this approach. First, power models are sometimes not open source, making it difficult to apply them to different simulated systems. The current interface for implementing power models in gem5 also relies on hard-coded strings provided by the user to model dynamic and static power. This makes defining power models for components cumbersome and restrictive, as gem5’s MathExpr string formula parser has support for limited mathematical operations. Third, previous works only implement one form of power model for one component. This unnecessarily limits users from combining other power models, which may model certain system components with higher accuracy. Instead, we posit that decoupling how power models are integrated with simulators from the design of power models themselves will enable better power modeling in simulators. Accordingly, we extend our prior work on designing and implementing an extensible, generalizable power modeling interface by integrating support for McPAT into it and validating it emits correct power values.more » « less
-
In this work, we set out to find the answers to the following questions: (1) Where are the bottlenecks in a state-of-theart architectural simulator? (2) How much faster can architectural simulations run by tuning system configurations? (3) What are the opportunities in accelerating software simulation using hardware accelerators? We choose gem5 as the representative architectural simulator, run several simulations with various configurations, perform a detailed architectural analysis of the gem5 source code on different server platforms, tune both system and architectural settings for running simulations, and discuss the future opportunities in accelerating gem5 as an important application. Our detailed profiling of gem5 reveals that its performance is extremely sensitive to the size of the Ll cache. Our experimental results show that a RISC-V core with 32KB data and instruction cache improves gem5’s simulation speed by 31%-61% compared with a baseline core with 8KB Ll caches. Our paper is the first step toward building specialized hardware and software environments for accelerating software-based simulators.more » « less
-
In recent years deep neural networks (DNNs) have emerged as an important application domain driving the requirements for future systems. As DNNs get more sophisticated, their compute requirements and the datasets they are trained on continue to grow at a fast rate. For example, Gholami showed that compute in Transformer networks grew 750X over 2 years, while other work projects DNN compute and memory requirements to grow by 1.5X per year. Given their growing requirements and importance, heterogeneous systems often add machine learning (ML) specific features (e.g., TensorCores) to improve their efficiency. However, given ML’s voracious rate of growth and size, there is a growing challenge in performing early-system exploration based on sound simulation methodology. In this work we discuss our efforts to enhance gem5’s support to make these workloads practical to run while retaining accuracy.more » « less
An official website of the United States government
