Different from traditional tedious CPU-GPU-based training algorithms using gradient descent methods, the software-FPGA co-designed learning algorithm is created to quickly solve a system of linear equations to directly calculate optimal values of hyperparameters of the green granular neural network (GGNN). To reduce both CO2 emissions and energy consumption effectively, a novel green granular convolutional neural network (GGCNN) is developed by using a new classifier that uses GGNNs as building blocks with new fast software-FPGA co-designed learning. Initial simulation results indicate that the FPGA equation solver code runs faster than the Python equation solver code. Therefore, implementing the GGCNN with software-FPGA co-designed learning is feasible. In the future, The GGCNN will be evaluated by comparing with a convolutional neural network with the traditional software-CPU-GPU-based learning in terms of speeds, model sizes, accuracy, CO2 emissions and energy consumption by using popular datasets. New algorithms will be created to divide the inputs to different input groups for building different GGNNs to solve the curse of dimensionality.
more »
« less
A Green Granular Neural Network with Efficient Software-FPGA Co-designed Learning
A novel green granular neural network (GGNN) with new fast software-FPGA co-designed learning is developed to reduce both CO2 emissions and energy consumption more effectively than popular neural networks with the traditional software-CPU-GPU-based learning. Different from traditional tedious CPU-GPU-based training algorithms using gradient descent methods and other methods such as genetic algorithms , the software-FPGA co-designed training algorithm may quickly solve a system of linear equations to directly calculate optimal values of hyperparameters of the GGNN. Initial simulation results indicates that the FPGA equation solver code ran faster than the Python equation solver code. Therefore, implementing the GGNN with software-FPGA co-designed learning is feasible. In addition, the shallow high-speed GGNN is explainable because it can generate interpretable granular If-Then rules. In the future, The GGNN will be evaluated by comparing with other machine learning models with traditional software-based learning in terms of speeds, model sizes, accuracy, CO2 emissions and energy consumption by using popular datasets. New algorithms will be created to divide the inputs to different input groups that will be used to build different small-size GGNNs to solve the curse of dimensionality. Additionally, the explainable green granular convolutional neural network will be developed by using the GGNNs as basic building blocks to efficiently solve image recognition problems.
more »
« less
- Award ID(s):
- 2234227
- PAR ID:
- 10454162
- Date Published:
- Journal Name:
- IEEE 22nd International Conference on Cognitive Informatics and Cognitive Computing (ICCI*CC'2023)
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Multi-Agent Reinforcement Learning (MARL) is a key technology in artificial intelligence applications such as robotics, surveillance, energy systems, etc. Multi-Agent Deep Deterministic Policy Gradient (MADDPG) is a state-of-the-art MARL algorithm that has been widely adopted and considered a popular baseline for novel MARL algorithms. However, existing implementations of MADDPG on CPU and CPU-GPU platforms do not exploit fine-grained parallelism between cooperative agents and handle inter-agent communication sequentially, leading to sub-optimal throughput performance in MADDPG training. In this work, we develop the first high-throughput MADDPG accelerator on a CPU-FPGA heterogeneous platform. Specifically, we develop dedicated hardware modules that enable parallel training of each agent's internal Deep Neural Networks (DNNs) and support low-latency inter-agent communication using an on-chip agent interconnection network. Our experimental results show that the speed performance of agent neural network training improves by a factor of 3.6×−24.3× and 1.5×−29.5× compared with state-of-the-art CPU and CPU-GPU implementations. Our design achieves up to a 1.99× and 1.93× improvement in overall system throughput compared with CPU and CPU-GPU implementations, respectively.more » « less
-
Molecular dynamics (MD) models require comprehensive computational power to simulate nanoscale phenomena. Traditionally, central processing unit (CPU) clusters have been the standard method of performing these numerically intensive computations. This article investigates the use of graphical processing units (GPUs) to implement large-scale MD models for exploring nanofluidic-substrate interactions. MD models of water nanodroplets over flat silicon substrate are tracked wherein the simulation attains a steady state computational performance. Different classes of GPU units from NVIDIA (C2050, K20, and K40) are evaluated for energy efficiency performance with respect to three green computing measures: simulation completion time, power consumption, and CO2 emissions. The CPU+K40 configuration displayed the lowest energy consumption profile for all the measures. This research demonstrates the use of energy efficient graphical computing versus traditional CPU computing for high-performance molecular dynamics simulations.more » « less
-
Operational ocean forecasting systems (OOFSs) are complex engines that must execute ocean models with high performance to provide timely products and datasets. Significant computational resources are then needed to run high-fidelity models, and, historically, the technological evolution of microprocessors has constrained data-parallel scientific computation. Today, graphics processing units (GPUs) offer a rapidly growing and valuable source of computing power rivaling the traditional CPU-based machines: the exploitation of thousands of threads can significantly accelerate the execution of many models, ranging from traditional HPC workloads of finite difference, finite volume, and finite element modelling through to the training of deep neural networks used in machine learning (ML) and artificial intelligence. Despite the advantages, GPU usage in ocean forecasting is still limited due to the legacy of CPU-based model implementations and the intrinsic complexity of porting core models to GPU architectures. This review explores the potential use of GPU in ocean forecasting and how the computational characteristics of ocean models can influence the suitability of GPU architectures for the execution of the overall value chain: it discusses the current approaches to code (and performance) portability, from CPU to GPU, including tools that perform code transformation, easing the adaptation of Fortran code for GPU execution (like PSyclone), the direct use of OpenACC directives (like ICON-O), the adoption of specific frameworks that facilitate the management of parallel execution across different architectures, and the use of new programming languages and paradigms.more » « less
-
Random Forests (RFs) are a commonly used machine learning method for classification and regression tasks spanning a variety of application domains, including bioinformatics, business analytics, and software optimization. While prior work has focused primarily on improving performance of the training of RFs, many applications, such as malware identification, cancer prediction, and banking fraud detection, require fast RF classification. In this work, we accelerate RF classification on GPU and FPGA. In order to provide efficient support for large datasets, we propose a hierarchical memory layout suitable to the GPU/FPGA memory hierarchy. We design three RF classification code variants based on that layout, and we investigate GPU- and FPGA-specific considerations for these kernels. Our experimental evaluation, performed on an Nvidia Xp GPU and on a Xilinx Alveo U250 FPGA accelerator card using publicly available datasets on the scale of millions of samples and tens of features, covers various aspects. First, we evaluate the performance benefits of our hierarchical data structure over the standard compressed sparse row (CSR) format. Second, we compare our GPU implementation with cuML, a machine learning library targeting Nvidia GPUs. Third, we explore the performance/accuracy tradeoff resulting from the use of different tree depths in the RF. Finally, we perform a comparative performance analysis of our GPU and FPGA implementations. Our evaluation shows that, while reporting the best performance on GPU, our code variants outperform the CSR baseline both on GPU and FPGA. For high accuracy targets, our GPU implementation yields a 5-9 × speedup over CSR, and up to a 2 × speedup over Nvidia’s cuML library.more » « less
An official website of the United States government

