Emerging sensing applications create an unprecedented need for energy efficiency in programmable processors. To achieve useful multi-year deployments on a small battery or energy harvester, these applications must avoid off-device communication and instead process most data locally. Recent work has proven coarse-grained reconfigurable arrays (CGRAs) as a promising architecture for this domain. Unfortunately, nearly all prior CGRAs support only computations with simple control flow and no memory aliasing (e.g., affine inner loops), causing an Amdahl efficiency bottleneck as non-trivial fractions of programs must run on an inefficient von Neumann core.RipTide is a co-designed compiler and CGRA architecture that achieves both high programmability and extreme energy efficiency, eliminating this bottleneck. RipTide provides a rich set of control-flow operators that support arbitrary control flow and memory access on the CGRA fabric. RipTide implements these primitives without tagged tokens to save energy; this requires careful ordering analysis in the compiler to guarantee correctness. RipTide further saves energy and area by offloading most control operations into its programmable on-chip network, where they can re-use existing network switches. RipTide’s compiler is implemented in LLVM, and its hardware is synthesized in Intel 22FFL. RipTide compiles applications written in C while saving 25% energy v. the state-of-the-art energy-minimal CGRA and 6.6 × energy v. a von Neumann core.
more »
« less
Snafu: An Ultra-Low-Power, Energy-Minimal CGRA-Generation Framework and Architecture
Ultra-low-power (ULP) devices are becoming pervasive, enabling many emerging sensing applications. Energy-efficiency is paramount in these applications, as efficiency determines device lifetime in battery-powered deployments and performance in energy-harvesting deployments. Unfortunately, existing designs fall short because ASICs’ upfront costs are too high and prior ULP architectures are too inefficient or inflexible.We present Snafu, the first framework to flexibly generate ULP coarse-grain reconfigurable arrays (CGRAs). Snafu provides a standard interface for processing elements (PE), making it easy to integrate new types of PEs for new applications. Unlike prior high-performance, high-power CGRAs, Snafu is designed from the ground up to minimize energy consumption while maximizing flexibility. Snafu saves energy by configuring PEs and routers for a single operation to minimize switching activity; by minimizing buffering within the fabric; by implementing a statically routed, bufferless, multi-hop network; and by executing operations in-order to avoid expensive tag-token matching.We further present Snafu-Arch, a complete ULP system that integrates an instantiation of the Snafu fabric alongside a scalar RISC-V core and memory. We implement Snafu in RTL and evaluate it on an industrial sub-28 nm FinFET process across a suite of common sensing benchmarks. Snafu-Arch operates at <1 mW, orders-of-magnitude less power than most prior CGRAs. Snafu-Arch uses 41% less energy and runs 4.4× faster than the prior state-of-the-art general-purpose ULP architecture. Moreover, we conduct three comprehensive case-studies to quantify the cost of programmability in Snafu. We find that Snafu-Arch is close to ASIC designs built in the same technology, using just 2.6× more energy on average.
more »
« less
- Award ID(s):
- 1815882
- PAR ID:
- 10301525
- Date Published:
- Journal Name:
- 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA)
- Page Range / eLocation ID:
- 1027 to 1040
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The architecture of a coarse-grained reconfigurable array (CGRA) processing element (PE) has a significant effect on the performance and energy-efficiency of an application running on the CGRA. This paper presents APEX, an automated approach for generating specialized PE architectures for an application or an application domain. APEX first analyzes application domain benchmarks using frequent subgraph mining to extract commonly occurring computational subgraphs. APEX then generates specialized PEs by merging subgraphs using a datapath graph merging algorithm. The merged datapath graphs are translated into a PE specification from which we automatically generate the PE hardware description in Verilog along with a compiler that maps applications to the PE. The PE hardware and compiler are inserted into a flexible CGRA generation and compilation toolchain that allows for agile evaluation of CGRAs. We evaluate APEX for two domains, machine learning and image processing. For image processing applications, our automatically generated CGRAs with specialized PEs achieve from 5% to 30% less area and from 22% to 46% less energy compared to a general-purpose CGRA. For machine learning applications, our automatically generated CGRAs consume 16% to 59% less energy and 22% to 39% less area than a general-purpose CGRA. This work paves the way for creation of application domain-driven design-space exploration frameworks that automatically generate efficient programmable accelerators, with a much lower design effort for both hardware and compiler generation.more » « less
-
Computing at the extreme edge allows systems with high-resolution sensors to be pushed well outside the reach of traditional communication and power delivery, requiring high-performance, high-energy-efficiency architectures to run complex ML, DSP, image processing, etc. Recent work has demonstrated the suitability of CGRAs for energy-minimal computation, but has focused strictly on energy optimization, neglecting performance. Pipestitch is an energy-minimal CGRA architecture that adds lightweight hardware threads to ordered dataflow, exploiting abundant, untapped parallelism in the complex workloads needed to meet the demands of emerging sensing applications. Pipestitch introduces a programming model, control-flow operator, and synchronization network to allow lightweight hardware threads to pipeline on the CGRA fabric. Across 5 important sparse workloads, Pipestitch achieves a 3.49 × increase in performance over RipTide, the state-of-the-art, at a cost of a 1.10 × increase in area and a 1.05 × increase in energy.more » « less
-
Sorting is a fundamental function in many applications from data processing to database systems. For high performance, sorting-hardware based sorting designs are implemented by conventional binary or emerging stochastic computing (SC) approaches. Binary designs are fast and energy-efficient but costly to implement. SC-based designs, on the other hand, are area and power-efficient but slow and energy-hungry. So, the previous studies of the hardware-based sorting further faced scalability issues. In this work, we propose a novel scalable low-cost design for implementing sorting networks. We borrow the concept of SC for the area- and power efficiency but use weighted stochastic bit-streams to address the high latency and energy consumption issue of SC designs. A new lock and swap (LAS) unit is proposed to sort weighted bit-streams. The LAS-based sorting network can determine the result of comparing different input values early and then map the inputs to the corresponding outputs based on shorter weighted bit-streams. Experimental results show that the proposed design approach achieves much better hardware scalability than prior work. Especially, as increasing the number of inputs, the proposed scheme can reduce the energy consumption by about 3.8% - 93% compared to prior binary and SC-based designs.more » « less
-
Energy-efficiency is a key concern in mobile sensing applications, such as those for tracking social interactions or physical activities. An attractive approach to saving energy is to shape the workload of the system by artificially introducing delays so that the workload would require less energy to process. However, adding delays to save energy may have a detrimental impact on user experience. To address this problem, we present Gratis, a novel paradigm for incorporating workload shaping energy optimizations in mobile sensing applications in an automated manner. Gratis adopts stream programs as a high-level abstraction whose execution is coordinated using an explicit power management policy. We present an expressive coordination language that can specify a broad range of workload-shaping optimizations. A unique property of the proposed power management policies is that they have predictable performance, which can be estimated at compile time, in a computationally efficient manner, from a small number of measurements. We have developed a simulator that can predict the energy with a average error of 7% and delay with a average error of 15%, even when applications have variable workloads. The simulator is scalable: hours of real-world traces can be simulated in a few seconds. Building on the simulator's accuracy and scalability, we have developed tools for configuring power management policies automatically. We have evaluated Gratis by developing two mobile applications and optimizing their energy consumption. For example, an application that tracks social interactions using speaker-identification techniques can run for only 7 hours without energy optimizations. However, when Gratis employs batching, scheduled concurrency, and adaptive sensing, the battery lifetime can be extended to 45 hours when the end-to-end deadline is one minute. These results demonstrate the efficacy of our approach to reduce energy consumption in mobile sensing applications.more » « less
An official website of the United States government

