NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Special Session: Machine Learning for Embedded System Design

https://doi.org/10.1145/3607888.3608962

Alcorta Lozano, Erika Susana; Gerstlauer, Andreas; Deng, Chenhui; Sun, Qi; Zhang, Zhiru; Xu, Ceyu; Wills, Lisa Wu; Sanchez Lopera, Daniela; Ecker, Wolfgang; Garg, Siddharth; et al (September 2023, International Conference on Hardware/Software Codesign and System Synthesis)

Full Text Available
CoMeFa: Deploying Compute-in-Memory on FPGAs for Deep Learning Acceleration

https://doi.org/10.1145/3603504

Arora, Aman; Bhamburkar, Atharva; Borda, Aatman; Anand, Tanmay; Sehgal, Rishabh; Hanindhito, Bagus; Gaillardon, Pierre-Emmanuel; Kulkarni, Jaydeep; John, Lizy K. (July 2023, ACM Transactions on Reconfigurable Technology and Systems)

Block random access memories (BRAMs) are the storage houses of FPGAs, providing extensive on-chip memory bandwidth to the compute units implemented using logic blocks and digital signal processing slices. We propose modifying BRAMs to convert them to CoMeFa (Compute-in-Memory Blocks forFPGAs) random access memories (RAMs). These RAMs provide highly parallel compute-in-memory by combining computation and storage capabilities in one block. CoMeFa RAMs utilize the true dual-port nature of FPGA BRAMs and contain multiple configurable single-bit bit-serial processing elements. CoMeFa RAMs can be used to compute with any precision, which is extremely important for applications like deep learning (DL). Adding CoMeFa RAMs to FPGAs significantly increases their compute density while also reducing data movement. We explore and propose two architectures of these RAMs: CoMeFa-D (optimized for delay) and CoMeFa-A (optimized for area). Compared to existing proposals, CoMeFa RAMs do not require changing the underlying static RAM technology like simultaneously activating multiple wordlines on the same port, and are practical to implement. CoMeFa RAMs are especially suitable for parallel and compute-intensive applications like DL, but these versatile blocks find applications in diverse applications like signal processing and databases, among others. By augmenting an Intel Arria 10–like FPGA with CoMeFa-D (CoMeFa-A) RAMs at the cost of 3.8% (1.2%) area, and with algorithmic improvements and efficient mapping, we observe a geomean speedup of 2.55× (1.85×) across microbenchmarks from various applications and a geomean speedup of up to 2.5× across multiple deep neural networks. Replacing all or some BRAMs with CoMeFa RAMs in FPGAs can make them better accelerators of DL workloads.
more » « less
Full Text Available
HLSDataset: Open-Source Dataset for ML-Assisted FPGA Design using High Level Synthesis

https://doi.org/10.1109/ASAP57973.2023.00040

Wei, Zhigang; Arora, Aman; Li, Ruihao; John, Lizy (July 2023, International Conference on Application-specific Systems, Architectures and Processors)

Full Text Available
Lightweight ML-based Runtime Prefetcher Selection on Many-core Platforms

Alcorta, Erika S.; Yadwadkar, Neeraja J.; Gerstlauer, Andreas (May 2023, Workshop on Machine Learning for Computer Architecture and Systems)

Full Text Available
Koios 2.0: Open-Source Deep Learning Benchmarks for FPGA Architecture and CAD Research

https://doi.org/10.1109/TCAD.2023.3272582

Arora, Aman; Boutros, Andrew; Damghani, Seyed Alireza; Mathur, Karan; Mohanty, Vedant; Anand, Tanmay; Elgammal, Mohamed A.; Kent, Kenneth B.; Betz, Vaughn; John, Lizy K. (May 2023, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Full Text Available
SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core Systems

https://doi.org/10.1145/3545008.3545044

Wu, Qinzhe; Ekanayake, Ashen; Li, Ruihao; Beard, Jonathan; John, Lizy (January 2023, International Conference on Parallel Processing)

Full Text Available
Tensor Slices: FPGA Building Blocks For The Deep Learning Era

https://doi.org/10.1145/3529650

Arora, Aman; Ghosh, Moinak; Mehta, Samidh; Betz, Vaughn; John, Lizy K. (December 2022, ACM Transactions on Reconfigurable Technology and Systems)

FPGAs are well-suited for accelerating deep learning (DL) applications owing to the rapidly changing algorithms, network architectures and computation requirements in this field. However, the generic building blocks available on traditional FPGAs limit the acceleration that can be achieved. Many modifications to FPGA architecture have been proposed and deployed including adding specialized artificial intelligence (AI) processing engines, adding support for smaller precision math like 8-bit fixed point and IEEE half-precision (fp16) in DSP slices, adding shadow multipliers in logic blocks, etc. In this paper, we describe replacing a portion of the FPGA’s programmable logic area with Tensor Slices. These slices have a systolic array of processing elements at their heart that support multiple tensor operations, multiple dynamically-selectable precisions and can be dynamically fractured into individual multipliers and MACs (multiply-and-accumulate). These slices have a local crossbar at the inputs that helps with easing the routing pressure caused by a large block on the FPGA. Adding these DL-specific coarse-grained hard blocks to FPGAs increases their compute density and makes them even better hardware accelerators for DL applications, while still keeping the vast majority of the real estate on the FPGA programmable at fine-grain.
more » « less
Full Text Available
Learning-based Phase-aware Multi-core CPU Workload Forecasting

https://doi.org/10.1145/3564929

Alcorta, Erika S.; Gerstlauer, Andreas (December 2022, ACM Transactions on Design Automation of Electronic Systems)

Predicting workload behavior during workload execution is essential for dynamic resource optimization in multi-processor systems. Recent studies have proposed advanced machine learning techniques for dynamic workload prediction. Workload prediction can be cast as a time series forecasting problem. However, traditional forecasting models struggle to predict abrupt workload changes. These changes occur because workloads are known to go through phases. Prior work has investigated machine learning-based approaches for phase detection and prediction, but such approaches have not been studied in the context of dynamic workload forecasting. In this paper, we propose phase-aware CPU workload forecasting as a novel approach that applies long-term phase prediction to improve the accuracy of short-term workload forecasting. Phase-aware forecasting requires machine learning models for phase classification, phase prediction, and phase-based forecasting that have not been explored in this combination before. Furthermore, existing prediction approaches have only been studied in single-core settings. This work explores phase-aware workload forecasting with multi-threaded workloads running on multi-core systems. We propose different multi-core settings differentiated by the number of cores they access and whether they produce specialized or global outputs per core. We study various advanced machine learning models for phase classification, phase prediction, and phase-based forecasting in isolation and different combinations for each setting. We apply our approach to forecasting of multi-threaded Parsec and SPEC workloads running on an 8-core Intel Core-i9 platform. Our results show that combining GMM clustering with LSTMs for phase prediction and phase-based forecasting yields the best phase-aware forecasting results. An approach that uses specialized models per core achieves an average error of 23% with up to 22% improvement in prediction accuracy compared to a phase-unaware setup.
more » « less
Full Text Available
Machine Learning-Based Microarchitecture- Level Power Modeling of CPUs

https://doi.org/10.1109/TC.2022.3185572

Kumar, Ajay Krishna; Alsalamin, Sami; Amrouch, Hussam; Gerstlauer, Andreas (June 2022, IEEE Transactions on Computers)

Energy efficiency has emerged as a key concern for modern processor design, especially when it comes to embedded and mobile devices. It is vital to accurately quantify the power consumption of different micro-architectural components in a CPU. Traditional RTL or gate-level power estimation is too slow for early design-space exploration studies. By contrast, existing architecture-level power models suffer from large inaccuracies. Recently, advanced machine learning techniques have been proposed for accurate power modeling. However, existing approaches still require slow RTL simulations, have large training overheads or have only been demonstrated for fixed-function accelerators and simple in-order cores with predictable behavior. In this work, we present a novel machine learning-based approach for microarchitecture-level power modeling of complex CPUs. Our approach requires only high-level activity traces obtained from microarchitecture simulations. We extract representative features and develop low-complexity learning formulations for different types of CPU-internal structures. Cycle-accurate models at the sub-component level are trained from a small number of gate-level simulations and hierarchically composed to build power models for complete CPUs. We apply our approach to both in-order and out-of-order RISC-V cores. Cross-validation results show that our models predict cycle-by-cycle power consumption to within 3% of a gate-level power estimation on average. In addition, our power model for the Berkeley Out-of-Order (BOOM) core trained on micro-benchmarks can predict the cycle-by-cycle power of real-world applications with less than 3.6% mean absolute error.
more » « less
Full Text Available
GAPS: GPU-acceleration of PDE solvers for wave simulation

https://doi.org/10.1145/3524059.3532373

Hanindhito, Bagus; Gourounas, Dimitrios; Fathi, Arash; Trenev, Dimitar; Gerstlauer, Andreas; John, Lizy K. (June 2022, ACM International Conference on Supercomputing (ICS))

Full Text Available

« Prev Next »

Search for: All records