skip to main content


Title: A Near-Sensor Processing Accelerator for Approximate Local Binary Pattern Networks
In this work, a high-speed and energy-efficient comparator-based N ear- S ensor L ocal B inary P attern accelerator architecture (NS-LBP) is proposed to execute a novel local binary pattern deep neural network. First, inspired by recent LBP networks, we design an approximate, hardware-oriented, and multiply-accumulate (MAC)-free network named Ap-LBP for efficient feature extraction, further reducing the computation complexity. Then, we develop NS-LBP as a processing-in-SRAM unit and a parallel in-memory LBP algorithm to process images near the sensor in a cache, remarkably reducing the power consumption of data transmission to an off-chip processor. Our circuit-to-application co-simulation results on MNIST and SVHN datasets demonstrate minor accuracy degradation compared to baseline CNN and LBP-network models, while NS-LBP achieves 1.25 GHz and an energy-efficiency of 37.4 TOPS/W. NS-LBP reduces energy consumption by 2.2× and execution time by a factor of 4× compared to the best recent LBP-based networks.  more » « less
Award ID(s):
2216772 2228028 2216773
NSF-PAR ID:
10427746
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
IEEE Transactions on Emerging Topics in Computing
ISSN:
2376-4562
Page Range / eLocation ID:
1 to 11
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Efficient machine learning implementations optimized for inference in hardware have wide-ranging benefits, depending on the application, from lower inference latency to higher data throughput and reduced energy consumption. Two popular techniques for reducing computation in neural networks are pruning, removing insignificant synapses, and quantization, reducing the precision of the calculations. In this work, we explore the interplay between pruning and quantization during the training of neural networks for ultra low latency applications targeting high energy physics use cases. Techniques developed for this study have potential applications across many other domains. We study various configurations of pruning during quantization-aware training, which we term quantization-aware pruning , and the effect of techniques like regularization, batch normalization, and different pruning schemes on performance, computational complexity, and information content metrics. We find that quantization-aware pruning yields more computationally efficient models than either pruning or quantization alone for our task. Further, quantization-aware pruning typically performs similar to or better in terms of computational efficiency compared to other neural architecture search techniques like Bayesian optimization. Surprisingly, while networks with different training configurations can have similar performance for the benchmark application, the information content in the network can vary significantly, affecting its generalizability. 
    more » « less
  2. In this work, we leverage the uni-polar switching behavior of Spin-Orbit Torque Magnetic Random Access Memory (SOT-MRAM) to develop an efficient digital Computing-in-Memory (CiM) platform named XOR-CiM. XOR-CiM converts typical MRAM sub-arrays to massively parallel computational cores with ultra-high bandwidth, greatly reducing energy consumption dealing with convolutional layers and accelerating X(N)OR-intensive Binary Neural Networks (BNNs) inference. With a similar inference accuracy to digital CiMs, XOR-CiM achieves ∼4.5× and 1.8× higher energy-efficiency and speed-up compared to the recent MRAM-based CiM platforms. 
    more » « less
  3. null (Ed.)
    With the recent advances in both machine learning and embedded systems research, the demand to deploy computational models for real-time execution on edge devices has increased substantially. Without deploying computational models on edge devices, the frequent transmission of sensor data to the cloud results in rapid battery draining due to the energy consumption of wireless data transmission. This rapid power dissipation leads to a considerable reduction in the battery lifetime of the system, therefore jeopardizing the real-world utility of smart devices. It is well-established that for difficult machine learning tasks, models with higher performance often require more computation power and thus are not power-efficient choices for deployment on edge devices. However, the trade-offs between performance and power consumption are not well studied. While numerous methods (e.g., model compression) have been developed to obtain an optimal model, these methods focus on improving the efficiency of a single model. In an entirely new direction, we introduce an effective method to find a combination of multiple models that are optimal in terms of power-efficiency and performance by solving an optimization problem in which both performance and power consumption are taken into account. Experimental results demonstrate that on the ImageNet dataset, we can achieve a 20% energy reduction with only 0.3% accuracy drop compared to Squeeze-and-Excitation Networks. Compared to a pruned convolutional neural network for human activity recognition, while consuming 1.7% less energy, our proposed policy achieves 1.3% higher accuracy. 
    more » « less
  4. null (Ed.)
    Neuromorphic Computing has become tremendously popular due to its ability to solve certain classes of learning tasks better than traditional von-Neumann computers. Data-intensive classification and pattern recognition problems have been of special interest to Neuromorphic Engineers, as these problems present complex use-cases for Deep Neural Networks (DNNs) which are motivated from the architecture of the human brain, and employ densely connected neurons and synapses organized in a hierarchical manner. However, as these systems become larger in order to handle an increasing amount of data and higher dimensionality of features, the designs often become connectivity constrained. To solve this, the computation is divided into multiple cores/islands, called processing engines (PEs). Today, the communication among these PEs are carried out through a power-hungry network-on-chip (NoC), and hence the optimal distribution of these islands along with energy-efficient compute and communication strategies become extremely important in reducing the overall energy of the neuromorphic computer, which is currently orders of magnitude higher than the biological human brain. In this paper, we extensively analyze the choice of the size of the islands based on mixed-signal neurons/synapses for 3-8 bit-resolution within allowable ranges for system-level classification error, determined by the analog non-idealities (noise and mismatch) in the neurons, and propose strategies involving local and global communication for reduction of the system-level energy consumption. AC-coupled mixed-signal neurons are shown to have 10X lower non-idealities than DC-coupled ones, while the choice of number of islands are shown to be a function of the network, constrained by the analog to digital conversion (or viceversa) power at the interface of the islands. The maximum number of layers in an island is analyzed and a global bus-based sparse connectivity is proposed, which consumes orders of magnitude lower power than the competing powerline communication techniques. 
    more » « less
  5. null (Ed.)
    The concept of Industry 4.0 introduces the unification of industrial Internet-of-Things (IoT), cyber physical systems, and data-driven business modeling to improve production efficiency of the factories. To ensure high production efficiency, Industry 4.0 requires industrial IoT to be adaptable, scalable, real-time, and reliable. Recent successful industrial wireless standards such as WirelessHART appeared as a feasible approach for such industrial IoT. For reliable and real-time communication in highly unreliable environments, they adopt a high degree of redundancy. While a high degree of redundancy is crucial to real-time control, it causes a huge waste of energy, bandwidth, and time under a centralized approach and are therefore less suitable for scalability and handling network dynamics. To address these challenges, we propose DistributedHART—a distributed real-time scheduling system for WirelessHART networks. The essence of our approach is to adopt local (node-level) scheduling through a time window allocation among the nodes that allows each node to schedule its transmissions using a real-time scheduling policy locally and online. DistributedHART obviates the need of creating and disseminating a central global schedule in our approach, thereby significantly reducing resource usage and enhancing the scalability. To our knowledge, it is the first distributed real-time multi-channel scheduler for WirelessHART. We have implemented DistributedHART and experimented on a 130-node testbed. Our testbed experiments as well as simulations show at least 85% less energy consumption in DistributedHART compared to existing centralized approach while ensuring similar schedulability. 
    more » « less