Abstract—Dense linear algebra kernels are critical for wireless, and the oncoming proliferation of 5G only amplifies their importance. Due to the inductive nature of many such algorithms, parallelism is difficult to exploit: parallel regions have fine grain producer/consumer interaction with iteratively changing dependence distance, reuse rate, and memory access patterns. This causes a high overhead both for multi-threading due to fine-grain synchronization, and for vectorization due to the nonrectangular iteration domains. CPUs, DSPs, and GPUs perform order-of-magnitude below peak. Our insight is that if the nature of inductive dependences and memory accesses were explicit in the hardware/software interface, then a spatial architecture could efficiently execute parallel code regions. To this end, we first extend the traditional dataflow model with first class primitives for inductive dependences and memory access patterns (streams). Second, we develop a hybrid spatial architecture combining systolic and dataflow execution to attain high utilization at low energy and area cost. Finally, we create a scalable design through a novel vector-stream control model which amortizes control overhead both in time and spatially across architecture lanes. We evaluate our design, REVEL, with a full stack (compiler, ISA, simulator, RTL). Across a suite of linear algebra kernels, REVEL outperforms equally-provisioned DSPs by 4.6×—37×. Compared to state-of-the-art spatial architectures, REVEL is mean 3.4× faster. Compared to a set of ASICs, REVEL is only 2× the power and half the area.
more »
« less
Stream-based memory access specialization for general purpose processors
Because of severe limitations in technology scaling, architects have innovated in specializing general purpose processors for computation primitives (e.g. vector instructions, loop accelerators). The general principle is exposing rich semantics to the ISA. An opportunity to explore is whether richer semantics of memory access patterns could also be used to improve the efficiency of memory and communication. Two important open questions are how to convey higher level memory information and how to take advantage of this information in hardware. We find that a majority of memory accesses follow a small number of simple patterns; we term these streams (e.g. affine, indirect). Streams can often be decoupled from core execution, and patterns persist long enough to express useful behavior. Our approach is therefore to express streams as ISA primitives, which we argue can enable: prefetch stream accesses to hide memory latency, semi-binding decoupled access to remove address computation and optimize the memory interface, and finally inform cache policies. In this work, we propose ISA-extensions for decoupled-streams, which interact with the core using a FIFO-based interface. We implement optimizations for each of the aforementioned opportunities on an aggressive wide-issue OOO core and evaluate with SPEC CPU 2017 and CortexSuite[1, 2]. Across all workloads, we observe about 1.37× speedup and energy efficiency improvement over hardware stride prefetching.
more »
« less
- Award ID(s):
- 1823562
- PAR ID:
- 10120212
- Date Published:
- Journal Name:
- ISCA
- Page Range / eLocation ID:
- 736 to 749
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)As multicore systems continue to grow in scale and on-chip memory capacity, the on-chip network bandwidth and latency become problematic bottlenecks. Because of this, overheads in data transfer, the coherence protocol and replacement policies become increasingly important. Unfortunately, even in well-structured programs, many natural optimizations are difficult to implement because of the reactive and centralized nature of traditional cache hierarchies, where all requests are initiated by the core for short, cache line granularity accesses. For example, long-lasting access patterns could be streamed from shared caches without requests from the core. Indirect memory access can be performed by chaining requests made from within the cache, rather than constantly returning to the core. Our primary insight is that if programs can embed information about long-term memory stream behavior in their ISAs, then these streams can be floated to the appropriate level of the memory hierarchy. This decentralized approach to address generation and cache requests can lead to better cache policies and lower request and data traffic by proactively sending data before the cores even request it. To evaluate the opportunities of stream floating, we enhance a tiled multicore cache hierarchy with stream engines to process stream requests in last-level cache banks. We develop several novel optimizations that are facilitated by stream exposure in the ISA, and subsequent exposure to caches. We evaluate using a cycle-level execution-driven gem5-based simulator, using 10 data-processing workloads from Rodinia and 2 streaming kernels written in OpenMP. We find that stream floating enables 52% and 39% speedup over an inorder and OOO core with state of art prefetcher design respectively, with 64% and 49% energy efficiency advantage.more » « less
-
Spatial dataflow architectures (SDAs) are a promising and versatile accelerator platform. They are software-programmable and achieve near-ASIC performance and energy efficiency, beating CPUs by orders of magnitude. Unfortunately, many SDAs struggle to efficiently implement irregular computations because they suffer from an abstraction inversion: they fail to capture coarse-grain dataflow semantics in the application — namely asynchronous communication, pipelining, and queueing — that are naturally supported by the dataflow execution model and existing SDA hardware. Ripple is a language and architecture that corrects the abstraction inversion by preserving dataflow semantics down the stack. Ripple provides asynchronous iterators, shared-memory atomics, and a familiar task-parallel interface to concisely express the asynchronous pipeline parallelism enabled by an SDA. Ripple efficiently implements deadlock-free, asynchronous task communication by exposing hardware token queues in its ISA. Across nine important workloads, compared to a recent ordered-dataflow SDA, Ripple shrinks programs by 1.9×, improves performance by 3×, increases IPC by 58%, and reduces dynamic instructions by 44%.more » « less
-
Deep Learning Recommendation Models (DLRMs) are very popular in personalized recommendation systems and are a major contributor to the data-center AI cycles. Due to the high computational and memory bandwidth needs of DLRMs, specifically the embedding stage in DLRM inferences, both CPUs and GPUs are used for hosting such workloads. This is primarily because of the heavy irregular memory accesses in the embedding stage of computation that leads to significant stalls in the CPU pipeline. As the model and parameter sizes keep increasing with newer recommendation models, the computational dominance of the embedding stage also grows, thereby, bringing into question the suitability of CPUs for inference. In this paper, we first quantify the cause of irregular accesses and their impact on caches and observe that off-chip memory access is the main contributor to high latency. Therefore, we exploit two well-known techniques: (1) Software prefetching, to hide the memory access latency suffered by the demand loads and (2) Overlapping computation and memory accesses, to reduce CPU stalls via hyperthreading to minimize the overall execution time. We evaluate our work on a single-core and 24-core configuration with the latest recommendation models and recently released production traces. Our integrated techniques speed up the inference by up to 1.59x, and on average by 1.4x.more » « less
-
Level-one data cache (L1 DC) and data translation lookaside buffer (DTLB) accesses impact energy usage as they frequently occur and each L1 DC and DTLB access uses significantly more energy than a register file access. Often, multiple memory operations will reference the same cache line using the same register, such as when iterating through an array. We propose to memoize L1 DC access information, such as the L1 DC data array way and the DTLB way, by associating this information with the register used to access it. When a load or store calculates the memory address, we detect whether the calculated address shares the cache line memoized with the base register. If so, we avoid the L1 DC tag array access and the DTLB access to determine the L1 DC way and instead use the memoized information. In addition, only a single data array way in a set- associative L1 DC needs to be accessed during a load instruction when the L1 DC way has been memoized. Our nonspeculative memoization approach can be applied before a speculative approach, allowing a significant reduction in data access energy usage for existing executables with no ISA modifications.more » « less
An official website of the United States government

