skip to main content


This content will become publicly available on July 1, 2024

Title: STRIVE: Enabling Choke Point Detection and Timing Error Resilience in a Low-Power Tensor Processing Unit
Rapid growth in Deep Neural Network (DNN) workloads has increased the energy footprint of the Artificial Intelligence (AI) computing realm. For optimum energy efficiency, we propose operating a DNN hardware in the Low-Power Computing (LPC) region. However, operating at LPC causes increased delay sensitivity to Process Variation (PV). Delay faults are an intriguing consequence of PV. In this paper, we demonstrate the vulnerability of DNNs to delay variations, substantially lowering the prediction accuracy. To overcome delay faults, we present STRIVE—a post-fabrication fault detection and reactive error reduction technique. We also introduce a time-borrow correction technique to ensure error-free DNN computation.  more » « less
Award ID(s):
2106237
NSF-PAR ID:
10445571
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
Proceedings ACM IEEE Design Automation Conference
ISSN:
0738-100X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Sparse deep neural networks (DNNs) have the potential to deliver compelling performance and energy efficiency without significant accuracy loss. However, their benefits can quickly diminish if their training is oblivious to the target hardware. For example, fewer critical connections can have a significant overhead if they translate into long-distance communication on the target hardware. Therefore, hardware-aware sparse training is needed to leverage the full potential of sparse DNNs. To this end, we propose a novel and comprehensive communication-aware sparse DNN optimization framework for tile-based in-memory computing (IMC) architectures. The proposed technique, CANNON first maps the DNN layers onto the tiles of the target architecture. Then, it replaces the fully connected and convolutional layers with communication-aware sparse connections. After that, CANNON optimizes the communication cost with minimal impact on the DNN accuracy. Extensive experimental evaluations with a wide range of DNNs and datasets show up to 3.0× lower communication energy, 3.1× lower communication latency, and 6.8× lower energy-delay product compared to state-of-the-art pruning approaches with a negligible impact on the classification accuracy on IMC-based machine learning accelerators. 
    more » « less
  2. Deep neural networks (DNNs) are gaining popularity in a wide range of domains, ranging from speech and video recognition to healthcare. With this increased adoption comes the pressing need for securing DNN execution environments on CPUs, GPUs, and ASICs. While there are active research efforts in supporting a trusted execution environment (TEE) on CPUs, the exploration in supporting TEEs on accelerators is limited, with only a few solutions available. A key limitation along this line of work is that these secure DNN accelerators narrowly consider a few specific architectures. The design choices and the associated cost for securing these architectures do not transfer to other diverse architectures. This paper strives to address this limitation by developing a design space exploration tool for supporting TEEs on diverse DNN accelerators. We target secure DNN accelerators equipped with cryptographic engines where the cryptographic operations are closely coupled with the data movement in the accelerators. These operations significantly complicate the scheduling for DNN accelerators, as the scheduling needs to account for the extra on-chip computation and off-chip memory accesses introduced by these cryptographic operations, and even needs to account for potential interactions across DNN layers. We tackle these challenges in our tool, called SecureLoop, by introducing a scheduling search engine with the following attributes: 1) considers the cryptographic overhead associated with every offchip data access, 2) uses an efficient modular arithmetic technique to compute the optimal authentication block assignment for each individual layer, and 3) uses a simulated annealing algorithm to perform cross-layer optimizations. Compared to the conventional schedulers, our tool finds the schedule for secure DNN designs with up to 33.2% speedup and 50.2% improvement of energy-delay product. 
    more » « less
  3. Increasing processing requirements in the Artificial Intelligence (AI) realm has led to the emergence of domain-specific architectures for Deep Neural Network (DNN) applications. Tensor Processing Unit (TPU), a DNN accelerator by Google, has emerged as a front runner outclassing its contemporaries, CPUs and GPUs, in performance by 15×–30×. TPUs have been deployed in Google data centers to cater to the performance demands. However, a TPU’s performance enhancement is accompanied by a mammoth power consumption. In the pursuit of lowering the energy utilization, this paper proposes PREDITOR—a low-power TPU operating in the Near-Threshold Computing (NTC) realm. PREDITOR uses mathematical analysis to mitigate the undetectable timing errors by boosting the voltage of the selective multiplier-and-accumulator units at specific intervals to enhance the performance of the NTC TPU, thereby ensuring a high inference accuracy at low voltage. PREDITOR offers up to 3×–5× improved performance in comparison to the leading-edge error mitigation schemes with a minor loss in accuracy. 
    more » « less
  4. null (Ed.)
    Multiply-accumulate (MAC) operations are common in data processing and machine learning but costly in terms of hardware usage. Stochastic Computing (SC) is a promising approach for low-cost hardware design of complex arithmetic operations such as multiplication. Computing with deterministic unary bit-streams (defined as bit-streams with all 1s grouped together at the beginning or end of a bit-stream) has been recently suggested to improve the accuracy of SC. Conventionally, SC designs use multiplexer (MUX) units or OR gates to accumulate data in the stochastic domain. MUX-based addition suffers from scaling of data and OR-based addition from inaccuracy. This work proposes a novel technique for MAC operation on unary bit-streamsthat allows exact, non-scaled addition of multiplication results. By introducing a relative delay between the products, we control correlation between bit-streams and eliminate OR-based addition error. We evaluate the accuracy of the proposed technique compared to the state-of-the-art MAC designs. After quantization, the proposed technique demonstrates at least 37% and up to 100% decrease of the mean absolute error for uniformly distributed random input values, compared to traditional OR-based MAC designs. Further, we demonstrate that the proposed technique is practical and evaluate area, power and energy of three possible implementations. 
    more » « less
  5. null (Ed.)
    Binary switches, which are the primitive units of all digital computing and information processing hardware, are usually benchmarked on the basis of their ‘energy–delay product’, which is the product of the energy dissipated in completing the switching action and the time it takes to complete that action. The lower the energy–delay product, the better the switch (supposedly). This approach ignores the fact that lower energy dissipation and faster switching usually come at the cost of poorer reliability (i.e., a higher switching error rate) and hence the energy–delay product alone cannot be a good metric for benchmarking switches. Here, we show the trade-off between energy dissipation, energy–delay product and error–probability for an electronic switch (a metal oxide semiconductor field effect transistor), a magnetic switch (a magnetic tunnel junction switched with spin transfer torque) and an optical switch (bistable non-linear mirror). As expected, reducing energy dissipation and/or energy–delay product generally results in increased switching error probability and reduced reliability. 
    more » « less