skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Evaluating Gather and Scatter Performance on CPUs and GPUs
This paper describes a new benchmark tool, Spatter, for assessing memory system architectures in the context of a specific category of indexed accesses known as gather and scatter. These types of operations are increasingly used to express sparse and irregular data access patterns, and they have widespread utility in many modern HPC applications including scientific simulations, data mining and analysis computations, and graph processing. However, many traditional benchmarking tools like STREAM, STRIDE, and GUPS focus on characterizing only uniform stride or fully random accesses despite evidence that modern applications use varied sets of more complex access patterns. Spatter is an open-source benchmark that provides a tunable and configurable framework to benchmark a variety of indexed access patterns, including variations of gather / scatter that are seen in HPC mini-apps evaluated in this work. The design of Spatter includes backends for OpenMP and CUDA, and experiments show how it can be used to evaluate 1) uniform access patterns for CPU and GPU, 2) prefetching regimes for gather / scatter, 3) compiler implementations of vectorization for gather / scatter, and 4) trace-driven "proxy patterns" that reflect the patterns found in multiple applications. The results from Spatter experiments show, for instance, that GPUs typically outperform CPUs for these operations in absolute bandwidth but not fraction of peak bandwidth, and that Spatter can better represent the performance of some cache-dependent mini-apps than traditional STREAM bandwidth measurements.  more » « less
Award ID(s):
1710371
PAR ID:
10199354
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
The International Symposium on Memory Systems (MEMSYS 2020)
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This work is focused on analyzing potential performance improvements of HPC applications using stacked memories like the Hybrid Memory Cube, or HMC. We target a HPC sparse direct solver library, SuperLU [4], that performs LU decomposition and is a core piece of simulation codes like NIMROD [1]. To accelerate this library, we are interested in mapping both the computationally intense Spare Matrix-Vector (SpMV) kernels that can be implemented using matrix-matrix multiply (GEMM) calls and memory-intensive primitives like Scatter and Gather to a reconfigurable fabric tightly integrated with a 3D stacked memory. Here we provide initial results on mapping GEMM to OpenCL-based devices as well as a trace-driven evaluation of SuperLU's memory accesses with a combined FPGA and HMC platform. 
    more » « less
  2. null (Ed.)
    A new mobile computing paradigm, dubbed mini-app, has been growing rapidly over the past few years since being introduced by WeChat in 2017. In this paradigm, a host app allows its end-users to install and run mini-apps inside itself, enabling the host app to build an ecosystem around (much like Google Play and Apple AppStore), enrich the host's functionalities, and offer mobile users elevated convenience without leaving the host app. It has been reported that there are over millions of mini-apps in WeChat. However, little information is known about these mini-apps at an aggregated level. In this paper, we present MiniCrawler, the first scalable and open source WeChat mini-app crawler that has indexed over 1,333,308 mini-apps. It leverages a number of reverse engineering techniques to uncover the interfaces and APIs in WeChat for crawling the mini-apps. With the crawled mini-apps, we then measure their resource consumption, API usage, library usage, obfuscation rate, app categorization, and app ratings at an aggregated level. The details of how we develop MiniCrawler and our measurement results are reported in this paper. 
    more » « less
  3. Because of severe limitations in technology scaling, architects have innovated in specializing general purpose processors for computation primitives (e.g. vector instructions, loop accelerators). The general principle is exposing rich semantics to the ISA. An opportunity to explore is whether richer semantics of memory access patterns could also be used to improve the efficiency of memory and communication. Two important open questions are how to convey higher level memory information and how to take advantage of this information in hardware. We find that a majority of memory accesses follow a small number of simple patterns; we term these streams (e.g. affine, indirect). Streams can often be decoupled from core execution, and patterns persist long enough to express useful behavior. Our approach is therefore to express streams as ISA primitives, which we argue can enable: prefetch stream accesses to hide memory latency, semi-binding decoupled access to remove address computation and optimize the memory interface, and finally inform cache policies. In this work, we propose ISA-extensions for decoupled-streams, which interact with the core using a FIFO-based interface. We implement optimizations for each of the aforementioned opportunities on an aggressive wide-issue OOO core and evaluate with SPEC CPU 2017 and CortexSuite[1, 2]. Across all workloads, we observe about 1.37× speedup and energy efficiency improvement over hardware stride prefetching. 
    more » « less
  4. The continued growth in the processing power of FPGAs coupled with high bandwidth memories (HBM), makes systems like the Xilinx U280 credible platforms for linear solvers which often dominate the run time of scientific and engineering applications. In this paper, we present Callipepla, an accelerator for a preconditioned conjugate gradient linear solver (CG). FPGA acceleration of CG faces three challenges: (1) how to support an arbitrary problem and terminate acceleration processing on the fly, (2) how to coordinate long-vector data flow among processing modules, and (3) how to save off-chip memory bandwidth and maintain double (FP64) precision accuracy. To tackle the three challenges, we present (1) a stream-centric instruction set for efficient streaming processing and control, (2) vector streaming reuse (VSR) and decentralized vector flow scheduling to coordinate vector data flow among modules and further reduce off-chip memory access latency with a double memory channel design, and (3) a mixed precision scheme to save bandwidth yet still achieve effective double precision quality solutions. To the best of our knowledge, this is the first work to introduce the concept of VSR for data reusing between on-chip modules to reduce unnecessary off-chip accesses and enable modules working in parallel for FPGA accelerators. We prototype the accelerator on a Xilinx U280 HBM FPGA. Our evaluation shows that compared to the Xilinx HPC product, the XcgSolver, Callipepla achieves a speedup of 3.94×, 3.36× higher throughput, and 2.94× better energy efficiency. Compared to an NVIDIA A100 GPU which has 4× the memory bandwidth of Callipepla, we still achieve 77% of its throughput with 3.34× higher energy efficiency. The code is available at https://github.com/UCLA-VAST/Callipepla. 
    more » « less
  5. null (Ed.)
    As multicore systems continue to grow in scale and on-chip memory capacity, the on-chip network bandwidth and latency become problematic bottlenecks. Because of this, overheads in data transfer, the coherence protocol and replacement policies become increasingly important. Unfortunately, even in well-structured programs, many natural optimizations are difficult to implement because of the reactive and centralized nature of traditional cache hierarchies, where all requests are initiated by the core for short, cache line granularity accesses. For example, long-lasting access patterns could be streamed from shared caches without requests from the core. Indirect memory access can be performed by chaining requests made from within the cache, rather than constantly returning to the core. Our primary insight is that if programs can embed information about long-term memory stream behavior in their ISAs, then these streams can be floated to the appropriate level of the memory hierarchy. This decentralized approach to address generation and cache requests can lead to better cache policies and lower request and data traffic by proactively sending data before the cores even request it. To evaluate the opportunities of stream floating, we enhance a tiled multicore cache hierarchy with stream engines to process stream requests in last-level cache banks. We develop several novel optimizations that are facilitated by stream exposure in the ISA, and subsequent exposure to caches. We evaluate using a cycle-level execution-driven gem5-based simulator, using 10 data-processing workloads from Rodinia and 2 streaming kernels written in OpenMP. We find that stream floating enables 52% and 39% speedup over an inorder and OOO core with state of art prefetcher design respectively, with 64% and 49% energy efficiency advantage. 
    more » « less