In this paper we present three hardware architectures designed to accelerate the inference operation of a neuro-inspired sparse coding algorithm. The memory and communication requirement of the three architectures are compared, and we show that one architecture outperforms the other two in scalability. A hardware system consists of an accelerator and a general purpose processor is proposed for the inference and learning operation. Two optimizations are proposed to further improve the overall performance by skipping unnecessary computations and autonomously learning the feature set. 
                        more » 
                        « less   
                    
                            
                            VLSI hardware architecture for Gaussian process
                        
                    
    
            Gaussian process (GP) is a popular machine learning technique that is widely used in many application domains, especially in robotics. However, GP is very computation intensive and time consuming during the inference phase, thereby bringing severe challenges for its large-scale deployment in real-time applications. In this paper, we propose two efficient hardware architecture for GP accelerator. One architecture targets for general GP inference, and the other architecture is specifically optimized for the scenario when the data point is gradually observed. Evaluation results show that the proposed hardware accelerator provides significant hardware performance improvement than the general-purpose computing platform. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 1932370
- PAR ID:
- 10212918
- Date Published:
- Journal Name:
- Asilomar Conference on Signals, Systems, and Computers
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Genetic programming (GP) is a general, broadly effective procedure by which computable solutions are constructed from high-level objectives. As with other machine-learning endeavors, one continual trend for GP is to exploit ever-larger amounts of parallelism. In this paper, we explore the possibility of accelerating GP by way of modern field-programmable gate arrays (FPGAs), which is motivated by the fact that FPGAs can sometimes leverage larger amounts of both function and data parallelism—common characteristics of GP— when compared to CPUs and GPUs. As a first step towards more general acceleration, we present a preliminary accelerator for the evaluation phase of "tree-based GP"—the original, and still popular, flavor of GP—for which the FPGA dynamically compiles programs of varying shapes and sizes onto a reconfigurable function tree pipeline. Overall, when compared to a recent open-source GPU solution implemented on a modern 8nm process node, our accelerator implemented on an older 20nm FPGA achieves an average speedup of 9.7×. Although our accelerator is 7.9× slower than most examples of a state-of-the-art CPU solution implemented on a recent 7nm process node, we describe future extensions that can make FPGA acceleration provide attractive Pareto-optimal tradeoffs.more » « less
- 
            In view of the performance limitations of fully-decoupled designs for neural architectures and accelerators, hardware-software co-design has been emerging to fully reap the benefits of flexible design spaces and optimize neural network performance. Nonetheless, such co-design also enlarges the total search space to practically infinity and presents substantial challenges. While the prior studies have been focusing on improving the search efficiency (e.g., via reinforcement learning), they commonly rely on co-searches over the entire architecture-accelerator design space. In this paper, we propose a semi-decoupled approach to reduce the size of the total design space by orders of magnitude, yet without losing optimality. We first perform neural architecture search to obtain a small set of optimal architectures for one accelerator candidate. Importantly, this is also the set of (close-to-)optimal architectures for other accelerator designs based on the property that neural architectures' ranking orders in terms of inference latency and energy consumption on different accelerator designs are highly similar. Then, instead of considering all the possible architectures, we optimize the accelerator design only in combination with this small set of architectures, thus significantly reducing the total search cost. We validate our approach by conducting experiments on various architecture spaces for accelerator designs with different dataflows. Our results highlight that we can obtain the optimal design by only navigating over the reduced search space.more » « less
- 
            Latent Gaussian process (GP) models are widely used in neuroscience to uncover hidden state evolutions from sequential observations, mainly in neural activity recordings. While latent GP models provide a principled and powerful solution in theory, the intractable posterior in non-conjugate settings necessitates approximate inference schemes, which may lack scalability. In this work, we propose cvHM, a general inference framework for latent GP models leveraging Hida-Matérn kernels and conjugate computation variational inference (CVI). With cvHM, we are able to perform variational inference of latent neural trajectories with linear time complexity for arbitrary likelihoods. The reparameterization of stationary kernels using Hida-Matérn GPs helps us connect the latent variable models that encode prior assumptions through dynamical systems to those that encode trajectory assumptions through GPs. In contrast to previous work, we use bidirectional information filtering, leading to a more concise implementation. Furthermore, we employ the Whittle approximate likelihood to achieve highly efficient hyperparameter learning.more » « less
- 
            This paper proposes a real-size, single-shot, high-speed, and energy-efficient tensorized optical multimodal fusion network (TOMFuN) on an electro-photonic large-scale III–V-on-Si in-memory compute engine. The TOMFuN architecture leverages a memory-efficient and low-complexity self-attention for the embedding network for the text information and tensor-train and CANDECOMP/PARAFAC decompositions for compressing the model parameters in the large-scale fully connected layers. Compared to full-size counterparts, our proposed network maintains a compatible inference accuracy in multimodal sentiment analysis tasks while requiring 92.8× fewer model parameters and 51.3× fewer hardware resources. Furthermore, the impact of photonic device imperfections on the TOMFuN architecture is investigated. The simulation results show that noise-aware on-chip training exhibits superior robustness. Finally, chip performance analysis shows that our TOMFuN inference accelerator has 230.73 PetaOps computational speed, 6.51 TOPS/W power efficiency, and 2.7 µs latency with the input dimensions of 1024.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    