skip to main content

Search for: All records

Creators/Authors contains: "Angizi, Shaahin"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. In this paper, we propose IMA-GNN as an In-Memory Accelerator for centralized and decentralized Graph Neural Network inference, explore its potential in both settings and provide a guideline for the community targeting flexible and efficient edge computation. Leveraging IMA-GNN, we first model the computation and communication latencies of edge devices. We then present practical case studies on GNN-based taxi demand and supply prediction and also adopt four large graph datasets to quantitatively compare and analyze centralized and decentralized settings. Our cross-layer simulation results demonstrate that on average, IMA-GNN in the centralized setting can obtain ~790x communication speed-up compared to the decentralized GNN setting. However, the decentralized setting performs computation ~1400x faster while reducing the power consumption per device. This further underlines the need for a hybrid semi-decentralized GNN approach.
    Free, publicly-accessible full text available June 5, 2024
  2. Recently, Intelligent IoT (IIoT), including various sensors, has gained significant attention due to its capability of sensing, deciding, and acting by leveraging artificial neural networks (ANN). Nevertheless, to achieve acceptable accuracy and high performance in visual systems, a power-delay-efficient architecture is required. In this paper, we propose an ultra-low-power processing in-sensor architecture, namely SenTer, realizing low-precision ternary multi-layer perceptron networks, which can operate in detection and classification modes. Moreover, SenTer supports two activation functions based on user needs and the desired accuracy-energy trade-off. SenTer is capable of performing all the required computations for the MLP's first layer in the analog domain and then submitting its results to a co-processor. Therefore, SenTer significantly reduces the overhead of analog buffers, data conversion, and transmission power consumption by using only one ADC. Additionally, our simulation results demonstrate acceptable accuracy on various datasets compared to the full precision models.
    Free, publicly-accessible full text available June 5, 2024
  3. In this work, we leverage the uni-polar switching behavior of Spin-Orbit Torque Magnetic Random Access Memory (SOT-MRAM) to develop an efficient digital Computing-in-Memory (CiM) platform named XOR-CiM. XOR-CiM converts typical MRAM sub-arrays to massively parallel computational cores with ultra-high bandwidth, greatly reducing energy consumption dealing with convolutional layers and accelerating X(N)OR-intensive Binary Neural Networks (BNNs) inference. With a similar inference accuracy to digital CiMs, XOR-CiM achieves ∼4.5× and 1.8× higher energy-efficiency and speed-up compared to the recent MRAM-based CiM platforms.
    Free, publicly-accessible full text available April 5, 2024
  4. In this work, we propose a Parallel Processing-In-DRAM architecture named P-PIM leveraging the high density of DRAM to enable fast and flexible computation. P-PIM enables bulk bit-wise in-DRAM logic between operands in the same bit-line by elevating the analog operation of the memory sub-array based on a novel dual-row activation mechanism. With this, P-PIM can opportunistically perform a complete and inexpensive in-DRAM RowHammer (RH) self-tracking and mitigation technique to protect the memory unit against such a challenging security vulnerability. Our results show that P-PIM achieves ~72% higher energy efficiency than the fastest charge-sharing-based designs. As for the RH protection, with a worst-case slowdown of ~0.8%, P-PIM archives up to 71% energy-saving over the SRAM/CAM-based frameworks and about 90% saving over DRAM-based frameworks.
    Free, publicly-accessible full text available April 1, 2024
  5. Free, publicly-accessible full text available March 1, 2024
  6. Free, publicly-accessible full text available March 1, 2024
  7. In this work, a high-speed and energy-efficient comparator-based N ear- S ensor L ocal B inary P attern accelerator architecture (NS-LBP) is proposed to execute a novel local binary pattern deep neural network. First, inspired by recent LBP networks, we design an approximate, hardware-oriented, and multiply-accumulate (MAC)-free network named Ap-LBP for efficient feature extraction, further reducing the computation complexity. Then, we develop NS-LBP as a processing-in-SRAM unit and a parallel in-memory LBP algorithm to process images near the sensor in a cache, remarkably reducing the power consumption of data transmission to an off-chip processor. Our circuit-to-application co-simulation results on MNIST and SVHN datasets demonstrate minor accuracy degradation compared to baseline CNN and LBP-network models, while NS-LBP achieves 1.25 GHz and an energy-efficiency of 37.4 TOPS/W. NS-LBP reduces energy consumption by 2.2× and execution time by a factor of 4× compared to the best recent LBP-based networks.
    Free, publicly-accessible full text available January 1, 2024
  8. Convolutional Neural Networks (CNNs), due to their recent successes, have gained lots of attention in various vision-based applications. They have proven to produce incredible results, especially on big data, that require high processing demands. However, CNN processing demands have limited their usage in embedded edge devices with constrained energy budgets and hardware. This paper proposes an efficient new architecture, namely Ocelli includes a ternary compute pixel (TCP) consisting of a CMOS-based pixel and a compute add-on. The proposed Ocelli architecture offers several features; (I) Because of the compute add-on, TCPs can produce ternary values (i.e., −1, 0, +1) regarding the light intensity as pixels’ inputs; (II) Ocelli realizes analog convolutions enabling low-precision ternary weight neural networks. Since the first layer’s convolution operations are the performance bottleneck of accelerators, Ocelli mitigates the overhead of analog buffers and analog-to-digital converters. Moreover, our design supports a zero-skipping scheme to further power reduction; (III) Ocelli exploits non-volatile magnetic RAMs to store CNN’s weights, which remarkably reduces the static power consumption; and finally, (IV) Ocelli has two modes, including sensing and processing. Once the object is detected, the architecture switches to the typical sensing mode to capture the image. Compared to the conventional pixels, itmore »achieves an average 10% efficiency on its lane detection power consumption compared with existing edge detection algorithms. Moreover, considering different CNN workloads, our design shows more than 23% power efficiency over conventional designs, while it can achieve better accuracy.« less
    Free, publicly-accessible full text available December 1, 2023
  9. Free, publicly-accessible full text available December 1, 2023
  10. In the Artificial Intelligence of Things (AIoT) era, always-on intelligent and self-powered visual perception systems have gained considerable attention and are widely used. Thus, this paper proposes TizBin, a low-power processing in-sensor scheme with event and object detection capabilities to eliminate power costs of data conversion and transmission and enable data-intensive neural network tasks. Once the moving object is detected, TizBin architecture switches to the high-power object detection mode to capture the image. TizBin offers several unique features, such as analog convolutions enabling low-precision ternary weight neural networks (TWNN) to mitigate the overhead of analog buffer and analog-to-digital converters. Moreover, TizBin exploits non-volatile magnetic RAMs to store NN’s weights, remarkably reducing static power consumption. Our circuit-to-application co-simulation results for TWNNs demonstrate minor accuracy degradation on various image datasets, while TizBin achieves a frame rate of 1000 and efficiency of ∼1.83 TOp/s/W.
    Free, publicly-accessible full text available October 1, 2023