NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Efficient Approaches for GEMM Acceleration on Leading AI-Optimized FPGAs

https://doi.org/10.1109/FCCM60383.2024.00015

Taka, Endri; Gourounas, Dimitrios; Gerstlauer, Andreas; Marculescu, Diana; Arora, Aman (May 2024, IEEE)

Full Text Available
A Hierarchical Classification Method for High-accuracy Instruction Disassembly with Near-field EM Measurements

https://doi.org/10.1145/3629167

Iyer, Vishnuvardhan V; Thimmaiah, Aditya; Orshansky, Michael; Gerstlauer, Andreas; Yilmaz, Ali E (January 2024, ACM Transactions on Embedded Computing Systems)

Electromagnetic (EM) fields have been extensively studied as potent side-channel tools for testing the security of hardware implementations. In this work, a low-cost side-channel disassembler that uses fine-grained EM signals to predict a program's execution trace with high accuracy is proposed. Unlike conventional side-channel disassemblers, the proposed disassembler does not require extensive randomized instantiations of instructions to profile them, instead relying on leakage-model-informed sub-sampling of potential architectural states resulting from instruction execution, which is further augmented by using a structured hierarchical approach. The proposed disassembler consists of two phases: (i) In the feature-selection phase, signals are collected with a relatively small EM probe, performing high-resolution scans near the chip surface, as profiling codes are executed. The measured signals from the numerous probe configurations are compiled into a hierarchical database by storing the min-max envelopes of the probed EM fields and differential signals derived from them, a novel dimension that increases the potency of the analysis. The envelope-to-envelope distances are evaluated throughout the hierarchy to identify optimal measurement configurations that maximize the distance between each pair of instruction classes. (ii) In the classification phase, signals measured for unknown instructions using optimal measurement configurations identified in the first phase are compared to the envelopes stored in the database to perform binary classification with majority voting, identifying candidate instruction classes at each hierarchical stage. Both phases of the disassembler rely on a four-stage hierarchical grouping of instructions by their length, size, operands, and functions. The proposed disassembler is shown to recover ∼97–99% of instructions from several test and application benchmark programs executed on the AT89S51 microcontroller.
more » « less
Full Text Available
Lightweight ML-based Runtime Prefetcher Selection on Many-core Platforms

Alcorta, Erika S.; Yadwadkar, Neeraja J.; Gerstlauer, Andreas (May 2023, Workshop on Machine Learning for Computer Architecture and Systems)

Full Text Available
Learning-based Phase-aware Multi-core CPU Workload Forecasting

https://doi.org/10.1145/3564929

Alcorta, Erika S.; Gerstlauer, Andreas (December 2022, ACM Transactions on Design Automation of Electronic Systems)

Predicting workload behavior during workload execution is essential for dynamic resource optimization in multi-processor systems. Recent studies have proposed advanced machine learning techniques for dynamic workload prediction. Workload prediction can be cast as a time series forecasting problem. However, traditional forecasting models struggle to predict abrupt workload changes. These changes occur because workloads are known to go through phases. Prior work has investigated machine learning-based approaches for phase detection and prediction, but such approaches have not been studied in the context of dynamic workload forecasting. In this paper, we propose phase-aware CPU workload forecasting as a novel approach that applies long-term phase prediction to improve the accuracy of short-term workload forecasting. Phase-aware forecasting requires machine learning models for phase classification, phase prediction, and phase-based forecasting that have not been explored in this combination before. Furthermore, existing prediction approaches have only been studied in single-core settings. This work explores phase-aware workload forecasting with multi-threaded workloads running on multi-core systems. We propose different multi-core settings differentiated by the number of cores they access and whether they produce specialized or global outputs per core. We study various advanced machine learning models for phase classification, phase prediction, and phase-based forecasting in isolation and different combinations for each setting. We apply our approach to forecasting of multi-threaded Parsec and SPEC workloads running on an 8-core Intel Core-i9 platform. Our results show that combining GMM clustering with LSTMs for phase prediction and phase-based forecasting yields the best phase-aware forecasting results. An approach that uses specialized models per core achieves an average error of 23% with up to 22% improvement in prediction accuracy compared to a phase-unaware setup.
more » « less
Full Text Available
Memory Latency Distribution-Driven Regulation for Temporal Isolation in MPSoCs

https://doi.org/10.4230/LIPIcs.ECRTS.2023.4

Saeed, Ahsan; Hoornaert, Denis; Dasari, Dakshina; Ziegenbein, Dirk; Mueller-Gritschneder, Daniel; Schlichtmann, Ulf; Gerstlauer, Andreas; Mancuso, Renato (July 2023, 35th Euromicro Conference on Real-Time Systems (ECRTS 2023))
Papadopoulos, Alessandro V. (Ed.)
Temporal isolation is one of the most significant challenges that must be addressed before Multi-Processor Systems-on-Chip (MPSoCs) can be widely adopted in mixed-criticality systems with both time-sensitive real-time (RT) applications and performance-oriented non-real-time (NRT) applications. Specifically, the main memory subsystem is one of the most prevalent causes of interference, performance degradation and loss of isolation. Existing memory bandwidth regulation mechanisms use static, dynamic, or predictive DRAM bandwidth management techniques to restore the execution time of an application under contention as close as possible to the execution time in isolation. In this paper, we propose a novel distribution-driven regulation whose goal is to achieve a timeliness objective formulated as a constraint on the probability of meeting a certain target execution time for the RT applications. Using existing interconnect-level Performance Monitoring Units (PMU), we can observe the Cumulative Distribution Function (CDF) of the per-request memory latency. Regulation is then triggered to enforce first-order stochastical dominance with respect to a desired reference. Consequently, it is possible to enforce that the overall observed execution time random variable is dominated by the reference execution time. The mechanism requires no prior information of the contending application and treats the DRAM subsystem as a black box. We provide a full-stack implementation of our mechanism on a Commercial Off-The-Shelf (COTS) platform (Xilinx Ultrascale+ MPSoC), evaluate it using real and synthetic benchmarks, experimentally validate that the timeliness objectives are met for the RT applications, and demonstrate that it is able to provide 2.2x more overall throughput for NRT applications compared to DRAM bandwidth management-based regulation approaches.
more » « less
Full Text Available
Special Session: Machine Learning for Embedded System Design

https://doi.org/10.1145/3607888.3608962

Alcorta Lozano, Erika Susana; Gerstlauer, Andreas; Deng, Chenhui; Sun, Qi; Zhang, Zhiru; Xu, Ceyu; Wills, Lisa Wu; Sanchez Lopera, Daniela; Ecker, Wolfgang; Garg, Siddharth; et al (September 2023, International Conference on Hardware/Software Codesign and System Synthesis)

Full Text Available
CASPHAr: Cache-Managed Accelerator Staging and Pipelining in Heterogeneous System Architectures

https://doi.org/10.1109/TCAD.2022.3197535

Asri, Mochamad; Gerstlauer, Andreas (August 2022, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Full Text Available
High-Level Simulation of Embedded Software Vulnerabilities to EM Side-Channel Attacks

Thimmaiah, Aditya; Iyer, Vishnuvardhan V.; Gerstlauer, Andreas; Orshansky, Michael (July 2022, International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS))

Full Text Available
Machine Learning-Based Microarchitecture- Level Power Modeling of CPUs

https://doi.org/10.1109/TC.2022.3185572

Kumar, Ajay Krishna; Alsalamin, Sami; Amrouch, Hussam; Gerstlauer, Andreas (June 2022, IEEE Transactions on Computers)

Energy efficiency has emerged as a key concern for modern processor design, especially when it comes to embedded and mobile devices. It is vital to accurately quantify the power consumption of different micro-architectural components in a CPU. Traditional RTL or gate-level power estimation is too slow for early design-space exploration studies. By contrast, existing architecture-level power models suffer from large inaccuracies. Recently, advanced machine learning techniques have been proposed for accurate power modeling. However, existing approaches still require slow RTL simulations, have large training overheads or have only been demonstrated for fixed-function accelerators and simple in-order cores with predictable behavior. In this work, we present a novel machine learning-based approach for microarchitecture-level power modeling of complex CPUs. Our approach requires only high-level activity traces obtained from microarchitecture simulations. We extract representative features and develop low-complexity learning formulations for different types of CPU-internal structures. Cycle-accurate models at the sub-component level are trained from a small number of gate-level simulations and hierarchically composed to build power models for complete CPUs. We apply our approach to both in-order and out-of-order RISC-V cores. Cross-validation results show that our models predict cycle-by-cycle power consumption to within 3% of a gate-level power estimation on average. In addition, our power model for the Berkeley Out-of-Order (BOOM) core trained on micro-benchmarks can predict the cycle-by-cycle power of real-world applications with less than 3.6% mean absolute error.
more » « less
Full Text Available
GAPS: GPU-acceleration of PDE solvers for wave simulation

https://doi.org/10.1145/3524059.3532373

Hanindhito, Bagus; Gourounas, Dimitrios; Fathi, Arash; Trenev, Dimitar; Gerstlauer, Andreas; John, Lizy K. (June 2022, ACM International Conference on Supercomputing (ICS))

Full Text Available

« Prev Next »

Search for: All records