skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Search for: All records

Creators/Authors contains: "Sinclair, Matthew D."

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Chiplets are transforming computer system designs, allowing system designers to combine heterogeneous computing resources at unprecedented scales. Breaking larger, monolithic chips into smaller, connected chiplets helps performance continue scaling, avoids die size limitations, improves yield, and reduces design and integration costs. However, chiplet-based designs introduce an additional level of hierarchy, which causes indirection and non-uniformity. This clashes with typical heterogeneous systems: unlike CPU-based multi-chiplet systems, heterogeneous systems do not have significant OS support or complex coherence protocols to mitigate the impact of this indirection. Thus, exploiting locality across application phases is harder in multi-chiplet heterogeneous systems. We propose CPElide, which utilizes information already available in heterogeneous systems’ embedded microprocessor (the command processor) to track inter-chiplet data dependencies and aggressively perform implicit synchronization only when necessary, instead of conservatively like the state-of-the-art HMG. Across 24 workloads CPElide improves average performance (13%, 19%), energy (14%, 11%), and network traffic (14%, 17%), respectively, over current approaches and HMG. 
    more » « less
    Free, publicly-accessible full text available November 4, 2025
  2. Large-scale computing systems are increasingly using accelerators such as GPUs to enable peta- and exa-scale levels of compute to meet the needs of Machine Learning (ML) and scientific computing applications. Given the widespread and growing use of ML, including in some scientific applications, optimizing these clusters for ML workloads is particularly important. However, recent work has demonstrated that accelerators in these clusters can suffer from performance variability and this variability can lead to resource under-utilization and load imbalance. In this work we focus on how clusters schedulers, which are used to share accelerator-rich clusters across many concurrent ML jobs, can embrace performance variability to mitigate its effects. Our key insight to address this challenge is to characterize which applications are more likely to suffer from performance variability and take that into account while placing jobs on the cluster. We design a novel cluster scheduler, PAL, which uses performance variability measurements and application-specific profiles to improve job performance and resource utilization. PAL also balances performance variability with locality to ensure jobs are spread across as few nodes as possible. Overall, PAL significantly improves GPU-rich cluster scheduling: across traces for six ML workload applications spanning image, language, and vision models with a variety of variability profiles, PAL improves geomean job completion time by 42%, cluster utilization by 28%, and makespan by 47% over existing state-of-the-art schedulers. 
    more » « less
    Free, publicly-accessible full text available November 18, 2025
  3. In recent years deep neural networks (DNNs) have emerged as an important application domain driving the requirements for future systems. As DNNs get more sophisticated, their compute requirements and the datasets they are trained on continue to grow at a fast rate. For example, Gholami showed that compute in Transformer networks grew 750X over 2 years, while other work projects DNN compute and memory requirements to grow by 1.5X per year. Given their growing requirements and importance, heterogeneous systems often add machine learning (ML) specific features (e.g., TensorCores) to improve their efficiency. However, given ML’s voracious rate of growth and size, there is a growing challenge in performing early-system exploration based on sound simulation methodology. In this work we discuss our efforts to enhance gem5’s support to make these workloads practical to run while retaining accuracy. 
    more » « less
    Free, publicly-accessible full text available June 29, 2025
  4. The breakdown in Moore’s Law and Dennard Scaling is leading to drastic changes in the makeup and constitution of computing systems. For example, a single chip integrates 10-100s of cores and has a heterogeneous mix of general-purpose compute engines and highly specialized accelerators. Traditionally, computer architects have relied on tools like architectural simulators (e.g., Accel-Sim, gem5, gem5-SALAM, GPGPU-Sim, MGPUSim, Sniper-Sim, and ZSim) to accurately perform early stage prototyping and optimizations for the proposed research. However, as systems become increasingly complex and heterogeneous, architectural tools are straining to keep up. In particular, publicly available architectural simulators are often not very representative of the industry parts they intend to represent. This leads to a mismatch in expectations; when prototyping new optimizations in gem5 users may draw the wrong conclusions about the efficacy of proposed optimizations if the tool’s models do not provide high fidelity. In this work, we focus on the gem5 simulator, the most popular platform for computer system simulation. In recent years gem5 has been used by ∼20% of simulation-based papers published in top-tier computer architecture conferences per year. Moreover, gem5 can run entire systems, including CPUs, GPUs, and accelerators as well as the operating system, runtime, network and other related components (including multiple ISAs). Thus, gem5 has the potential to allow users to study the behavior of the entire heterogeneous systems. Unfortunately, some of gem5’s models do not always provide high accuracy relative to their ”real” counterparts. In particular, although gem5’s GPU model provides high accuracy internally at AMD [9], the publicly available gem5 GPU model is often inaccurate, especially for the memory subsystem. To understand this, we designed a series of microbenchmarks designed to expose the latencies, bandwidths, and sizes of a variety of GPU components on real AMD GPUs. Our results showed that while gem5’s GPU microarchitecture was relatively accurate (within 5-10% in most cases), gem5’s memory subsytem was off by an average of 272% (645% max) for latency and 70% (693% max) for bandwidth. Accordingly, to help bridge this divide, we propose to design and use a new tool, GPU Accuracy Profiler (GAP), to compare and improve the behavior of gem5’s simulated GPUs relative to real GPUs. By iteratively applying fixes and improvements to gem’s GPU model via GAP, we will significantly improved its fidelity relative to real AMD GPUs. Although this work is still ongoing, our preliminary results show significant promise: on average 25% error for latency and 16% error for bandwidth, respectively. Overall, by completing this work we hope to enable more widespread adoption of gem5 as an accurate platform for heterogeneous architecture research. 
    more » « less
  5. null (Ed.)
  6. In 2018, AMD added support for an updated gem5 GPU model based on their GCN3 architecture. Having a high-fidelity GPU model allows for more accurate research into optimizing modern GPU applications. However, the complexity of getting the necessary libraries and drivers, needed for this model to run GPU applications in gem5, made it difficult to use. This post describes the work we have done with increasing the usability of the GPU model by simplifying the setup process, extending the types of applications that can be run, and optimizing parts of the software stack used by the GPU model. 
    more » « less
  7. In the past decade, GPUs have become an important resource for compute-intensive, general-purpose GPU applications such as machine learning, big data analysis, and large-scale simulations. In the future, with the explosion of machine learning and big data, application demands will keep increasing, resulting in more data and computation being pushed to GPUs. However, due to the slowing of Moore’s Law and rising manufacturing costs, it is becoming more and more challenging to add compute resources into a single GPU device to improve its throughput. As a result, spreading work across multiple GPUs is popular in data-centric and scientific applications. For example, Facebook uses 8 GPUs per server in their recent machine learning platform. However, research infrastructure has not kept pace with this trend: most GPU hardware simulators, including gem5, only support a single GPU. Thus, it is hard to study interference between GPUs, communication between GPUs, or work scheduling across GPUs. Our research group has been working to address this shortcoming by adding multi-GPU support to gem5. Here, we discuss the changes that were needed, which included updating the emulated driver, GPU components, and coherence protocol. 
    more » « less
  8. null (Ed.)
  9. null (Ed.)
    Deterministic execution for GPUs is a desirable property as it helps with debuggability and reproducibility. It is also important for safety regulations, as safety critical workloads are starting to be deployed onto GPUs. Prior deterministic architectures, such as GPUDet, attempt to provide strong determinism for all types of workloads, incurring significant performance overheads due to the many restrictions that are required to satisfy determinism. We observe that a class of reduction workloads, such as graph applications and neural architecture search for machine learning, do not require such severe restrictions to preserve determinism. This motivates the design of our system, Deterministic Atomic Buffering (DAB), which provides deterministic execution with low area and performance overheads by focusing solely on ordering atomic instructions instead of all memory instructions. By scheduling atomic instructions deterministically with atomic buffering, the results of atomic operations are isolated initially and made visible in the future in a deterministic order. This allows the GPU to execute deterministically in parallel without having to serialize its threads for atomic operations as opposed to GPUDet. Our simulation results show that, for atomic-intensive applications, DAB performs 4× better than GPUDet and incurs only a 23% slowdown on average compared to a non-deterministic GPU architecture. We also characterize the bottlenecks and provide insights for future optimizations. 
    more » « less
  10. The open-source and community-supported gem5 simulator is one of the most popular tools for computer architecture research. This simulation infrastructure allows researchers to model modern computer hardware at the cycle level, and it has enough fidelity to boot unmodified Linux-based operating systems and run full applications for multiple architectures including x86, Arm, and RISC-V. The gem5 simulator has been under active development over the last nine years since the original gem5 release. In this time, there have been over 7500 commits to the codebase from over 250 unique contributors which have improved the simulator by adding new features, fixing bugs, and increasing the code quality. In this paper, we give and overview of gem5's usage and features, describe the current state of the gem5 simulator, and enumerate the major changes since the initial release of gem5. We also discuss how the gem5 simulator has transitioned to a formal governance model to enable continued improvement and community support for the next 20 years of computer architecture research. 
    more » « less