Deep learning accelerators are important tools for feeding the growing demand for deep learning applications. The automated design of such accelerators--which is important for reducing development costs--can be viewed as a search over a vast and complex design space that consists of all possible accelerators and all the possible software that could run on them. Unfortunately, this search is complicated by the existence of many ordinal and categorical values, which are critical to explore for the ultimate design but are not handled well by existing search techniques. This paper presents a technique for efficiently searching this space by injecting domain information--in this case information about hardware/software (HW/SW) co-design--into the automated search process. Specifically, this paper introduces a novel Bayesian optimization framework called daBO (domain-aware BO) that accepts domain information as input, including those describing ordinal and categorical values. This paper also introduces Spotlight, a design tool based on daBO, and this paper empirically shows that Spotlight produces accelerator designs and software schedules that are orders of magnitude better than those created by the state-of-the-art. For example, for the ResNet-50 deep learning model, Spotlight produces a HW/SW configuration that reduces delay by 135x over the configuration produced by ConfuciuX, a state-of-the-art HW/SW co-design tool, and Spotlight reduces energy-delay product (EDP) by 44x over an Eyeriss-like accelerator, which is an edge-scale hand-designed accelerator. In the realm of cloud-scale accelerators, Spotlight reduces the EDP of a scaled-up Eyeriss-like accelerator by 23x. Our evaluation shows that Spotlight benefits from the efficiency of daBO, which allows Spotlight to identify accelerator designs and software schedules that prior work cannot identify.
more »
« less
This content will become publicly available on July 7, 2026
UltraScale+ SpinalHDL Wrapper: Streamlining Ideas to Bitstream on UltraScale+ platforms
In an embedded computing landscape that inexorably leans into heterogeneity, System-on-Chips (SoCs) featuring tightly integrated Field Programmable Gate Arrays (FPGA) are bound to proliferate. In particular, such architectures’ high degree of flexibility and control caters well to the real-time\ community. Despite the appeal, real-time research exploiting HW/SW co-design on such architectures has remained tepid. While the usual suspects, such as the complexity of Hardware Description Languages, can be blamed, recent advancements in tooling (e.g., languages, frameworks) have proven efficient in easing the design of FPGA-located accelerators. However, in the context of SoC with FPGA platforms, these solutions fall short of addressing the next hurdle: integrating the custom accelerators with the rest of the SoC, which requires the tedious implementation of various supporting software resources. This article presents the first iteration of the UltraScale+ SpinalHDL Wrapper; a SpinalHDL library dedicated to supporting HW/SW co-design on SoC with FPGA platforms. The support ranges from assisting during the design of accelerators to automatically inferring and generating ready-to-use software support, such as Linux Kernel modules and Vivado deployment scripts.
more »
« less
- Award ID(s):
- 2403012
- PAR ID:
- 10621333
- Publisher / Repository:
- 19th Workshop on Operating Systems Platforms for Embedded Real-Time Applications (OSPERT 2025)
- Date Published:
- Subject(s) / Keyword(s):
- FPGA, UltraScale+ Hardware/Software co-design Hardware Construct Languages
- Format(s):
- Medium: X
- Location:
- Brussels, Belgium
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Artificial intelligence (AI) based wearable applications collect and process a significant amount of streaming sensor data. Transmitting the raw data to cloud processors wastes scarce energy and threatens user privacy. Wearable edge AI devices should ideally balance two competing requirements: (1) maximizing the energy efficiency using targeted hardware accelerators and (2) providing versatility using general-purpose cores to support arbitrary applications. To this end, we present an open-source domain-specific programmable system-on-chip (SoC) that combines a RISC-V core with a meticulously determined set of accelerators targeting wearable applications. We apply the proposed design method to design an FPGA prototype and six real-life use cases to demonstrate the efficacy of the proposed SoC. Thorough experimental evaluations show that the proposed SoC provides up to 9.1x faster execution and up to 8.9x higher energy efficiency than software implementations in FPGA while maintaining programmability.more » « less
-
The security and performance of FPGA-based accelerators play vital roles in today’s cloud services. In addition to supporting convenient access to high-end FPGAs, cloud vendors and third-party developers now provide numerous FPGA accelerators for machine learning models. However, the security of accelerators developed for state-of-the-art Cloud FPGA environments has not been fully explored, since most remote accelerator attacks have been prototyped on local FPGA boards in lab settings, rather than in Cloud FPGA environments. To address existing research gaps, this work analyzes three existing machine learning accelerators developed in Xilinx Vitis to assess the potential threats of power attacks on accelerators in Amazon Web Services (AWS) F1 Cloud FPGA platforms, in a multi-tenant setting. The experiments show that malicious co-tenants in a multi-tenant environment can instantiate voltage sensing circuits as register-transfer level (RTL) kernels within the Vitis design environment to spy on co-tenant modules. A methodology for launching a practical remote power attack on Cloud FPGAs is also presented, which uses an enhanced time-to-digital (TDC) based voltage sensor and auto-triggered mechanism. The TDC is used to capture power signatures, which are then used to identify power consumption spikes and observe activity patterns involving the FPGA shell, DRAM on the FPGA board, or the other co-tenant victim’s accelerators. Voltage change patterns related to shell use and accelerators are then used to create an auto-triggered attack that can automatically detect when to capture voltage traces without the need for a hard-wired synchronization signal between victim and attacker. To address the novel threats presented in this work, this paper also discusses defenses that could be leveraged to secure multi-tenant Cloud FPGAs from power-based attacks.more » « less
-
Cloud deployments now increasingly exploit Field-Programmable Gate Array (FPGA) accelerators as part of virtual instances. While cloud FPGAs are still essentially single-tenant, the growing demand for efficient hardware acceleration paves the way to FPGA multi-tenancy. It then becomes necessary to explore architectures, design flows, and resource management features that aim at exposing multi-tenant FPGAs to the cloud users. In this article, we discuss a hardware/software architecture that supports provisioning space-shared FPGAs in Kernel-based Virtual Machine (KVM) clouds. The proposed hardware/software architecture introduces an FPGA organization that improves hardware consolidation and support hardware elasticity with minimal data movement overhead. It also relies on VirtIO to decrease communication latency between hardware and software domains. Prototyping the proposed architecture with a Virtex UltraScale+ FPGA demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization, which is one of the goals of virtualization. Overall, our FPGA design achieved about 2× higher maximum frequency than the state of the art and a bandwidth reaching up to 28 Gbps on 32-bit data width.more » « less
-
Over a billion mobile consumer system-on-chip (SoC) chipsets ship each year. Of these, the mobile consumer market undoubtedly involving smartphones has a significant market share. Most modern smartphones comprise of advanced SoC architectures that are made up of multiple cores, GPS, and many different programmable and fixed-function accelerators connected via a complex hierarchy of interconnects with the goal of running a dozen or more critical software usecases under strict power, thermal and energy constraints. The steadily growing complexity of a modern SoC challenges hardware computer architects on how best to do early stage ideation. Late SoC design typically relies on detailed full-system simulation once the hardware is specified and accelerator software is written or ported. However, early-stage SoC design must often select accelerators before a single line of software is written. To help frame SoC thinking and guide early stage mobile SoC design, in this paper we contribute the Gables model that refines and retargets the Roofline model-designed originally for the performance and bandwidth limits of a multicore chip-to model each accelerator on a SoC, to apportion work concurrently among different accelerators (justified by our usecase analysis), and calculate a SoC performance upper bound. We evaluate the Gables model with an existing SoC and develop several extensions that allow Gables to inform early stage mobile SoC design.more » « less
An official website of the United States government
