skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 5:00 PM ET until 11:00 PM ET on Friday, June 21 due to maintenance. We apologize for the inconvenience.


Title: VLSI hardware architecture for Gaussian process
Gaussian process (GP) is a popular machine learning technique that is widely used in many application domains, especially in robotics. However, GP is very computation intensive and time consuming during the inference phase, thereby bringing severe challenges for its large-scale deployment in real-time applications. In this paper, we propose two efficient hardware architecture for GP accelerator. One architecture targets for general GP inference, and the other architecture is specifically optimized for the scenario when the data point is gradually observed. Evaluation results show that the proposed hardware accelerator provides significant hardware performance improvement than the general-purpose computing platform.  more » « less
Award ID(s):
1932370
NSF-PAR ID:
10212918
Author(s) / Creator(s):
Date Published:
Journal Name:
Asilomar Conference on Signals, Systems, and Computers
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper we present three hardware architectures designed to accelerate the inference operation of a neuro-inspired sparse coding algorithm. The memory and communication requirement of the three architectures are compared, and we show that one architecture outperforms the other two in scalability. A hardware system consists of an accelerator and a general purpose processor is proposed for the inference and learning operation. Two optimizations are proposed to further improve the overall performance by skipping unnecessary computations and autonomously learning the feature set. 
    more » « less
  2. In this paper, we explore the prospect of accelerating tree-based genetic programming (TGP) by way of modern field-programmable gate array (FPGA) devices, which is motivated by the fact that FPGAs can sometimes leverage larger amounts of data/function parallelism, as well as better energy efficiency, when compared to general-purpose CPU/GPU systems. In our preliminary study, we introduce a fixed-depth, tree-based architecture capable of evaluating type-consistent primitives that can be fully unrolled and pipelined. The current primitive constraints preclude arbitrary control structures, but they allow for entire programs to be evaluated every clock cycle. Using a variety of floating-point primitives and random programs, we compare to the recent TensorGP tool executing on a modern 8 nm GPU, and we show that our accelerator implemented on a 14 nm FPGA achieves an average speedup of 43×. When compared to the popular baseline tool DEAP executing across all cores of a 2-socket, 28-core (56-thread), 14 nm CPU server, our accelerator achieves an average speedup of 4,902×. Finally, when compared to the recent state-of-the-art tool Operon executing on the same 2-processor CPU system, our accelerator executes about 2.4× slower on average. Despite not achieving an average speedup over every tool tested, our single-FPGA accelerator is the fastest in several instances, and we describe five future extensions that could allow for a 32–144× speedup over our current design as well as allow for larger program depths/sizes. Overall, we estimate that a future version of our accelerator will constitute a state-of-the-art GP system for many applications. 
    more » « less
  3. In view of the performance limitations of fully-decoupled designs for neural architectures and accelerators, hardware-software co-design has been emerging to fully reap the benefits of flexible design spaces and optimize neural network performance. Nonetheless, such co-design also enlarges the total search space to practically infinity and presents substantial challenges. While the prior studies have been focusing on improving the search efficiency (e.g., via reinforcement learning), they commonly rely on co-searches over the entire architecture-accelerator design space. In this paper, we propose a semi-decoupled approach to reduce the size of the total design space by orders of magnitude, yet without losing optimality. We first perform neural architecture search to obtain a small set of optimal architectures for one accelerator candidate. Importantly, this is also the set of (close-to-)optimal architectures for other accelerator designs based on the property that neural architectures' ranking orders in terms of inference latency and energy consumption on different accelerator designs are highly similar. Then, instead of considering all the possible architectures, we optimize the accelerator design only in combination with this small set of architectures, thus significantly reducing the total search cost. We validate our approach by conducting experiments on various architecture spaces for accelerator designs with different dataflows. Our results highlight that we can obtain the optimal design by only navigating over the reduced search space. 
    more » « less
  4. Genetic programming (GP) is a general, broadly effective procedure by which computable solutions are constructed from high-level objectives. As with other machine-learning endeavors, one continual trend for GP is to exploit ever-larger amounts of parallelism. In this paper, we explore the possibility of accelerating GP by way of modern field-programmable gate arrays (FPGAs), which is motivated by the fact that FPGAs can sometimes leverage larger amounts of both function and data parallelism—common characteristics of GP— when compared to CPUs and GPUs. As a first step towards more general acceleration, we present a preliminary accelerator for the evaluation phase of "tree-based GP"—the original, and still popular, flavor of GP—for which the FPGA dynamically compiles programs of varying shapes and sizes onto a reconfigurable function tree pipeline. Overall, when compared to a recent open-source GPU solution implemented on a modern 8nm process node, our accelerator implemented on an older 20nm FPGA achieves an average speedup of 9.7×. Although our accelerator is 7.9× slower than most examples of a state-of-the-art CPU solution implemented on a recent 7nm process node, we describe future extensions that can make FPGA acceleration provide attractive Pareto-optimal tradeoffs. 
    more » « less
  5. In this article, we present a low-energy inference method for convolutional neural networks in image classification applications. The lower energy consumption is achieved by using a highly pruned (lower-energy) network if the resulting network can provide a correct output. More specifically, the proposed inference method makes use of two pruned neural networks (NNs), namely mildly and aggressively pruned networks, which are both designed offline. In the system, a third NN makes use of the input data for the online selection of the appropriate pruned network. The third network, for its feature extraction, employs the same convolutional layers as those of the aggressively pruned NN, thereby reducing the overhead of the online management. There is some accuracy loss induced by the proposed method where, for a given level of accuracy, the energy gain of the proposed method is considerably larger than the case of employing any one pruning level. The proposed method is independent of both the pruning method and the network architecture. The efficacy of the proposed inference method is assessed on Eyeriss hardware accelerator platform for some of the state-of-the-art NN architectures. Our studies show that this method may provide, on average, 70% energy reduction compared to the original NN at the cost of about 3% accuracy loss on the CIFAR-10 dataset. 
    more » « less