Applications are migrating en masse to the cloud, while ac- celerators such as GPUs, TPUs, and FPGAs proliferate in the wake of Moore’s Law. These trends are in conflict: cloud ap- plicationsrunonvirtualplatforms,butexistingvirtualization techniques have not provided production-ready solutions for accelerators. As a result, cloud providers expose accel- erators by dedicating physical devices to individual guests. Multi-tenancy and consolidation are lost as a consequence. We present AvA, which addresses limitations of existing virtualization techniques with automated construction of hypervisor-managed virtual accelerator stacks. AvA com- bines a DSL for describing APIs and sharing policies, device- agnostic runtime components, and a compiler to generate accelerator-specific components such as guest libraries and API servers. AvA uses Hypervisor Interposed Remote Acceleration (HIRA),anewtechniquetoenablehypervisor- enforcement of sharing policies from the specification. We use AvA to virtualize nine accelerators and eleven framework APIs, including six for which no virtualization support has been previously explored. AvA provides near- native performance and can enforce sharing policies that are not possible with current techniques, with orders of magnitude less developer effort than required for hand-built virtualization support.
more »
« less
Virtualizing a Post-Moore’s Law Analog Mesh Processor: The Case of a Photonic PDE Accelerator
Innovative processor architectures aim to play a critical role in future sustainment of performance improvements under severe limitations imposed by the end of Moore’s Law. The Reconfigurable Optical Computer (ROC) is one such innovative, Post-Moore’s Law processor. ROC is designed to solve partial differential equations in one shot as opposed to existing solutions, which are based on costly iterative computations. This is achieved by leveraging physical properties of a mesh of optical components that behave analogously to lumped electrical components. However, virtualization is required to combat shortfalls of the accelerator hardware. Namely, 1) the infeasibility of building large photonic arrays to accommodate arbitrarily large problems, and 2) underutilization brought about by mismatches in problem and accelerator mesh sizes due to future advances in manufacturing technology. In this work, we introduce an architecture and methodology for light-weight virtualization of ROC which exploits advantages borne from optical computing technology. Specifically, we apply temporal and spatial virtualization to ROC and then extend the accelerator scheduling tradespace with the introduction of spectral virtualization. Additionally, we investigate multiple resource scheduling strategies for a system-on-chip (SoC)-based PDE acceleration architecture and show that virtual configuration management offers a speedup of approximately 2 ×. Finally, we show that overhead from virtualization is minimal, and our experimental results show two orders of magnitude increased speed as compared to microprocessor execution while keeping errors due to virtualization under 10%.
more »
« less
- Award ID(s):
- 1748294
- PAR ID:
- 10389633
- Date Published:
- Journal Name:
- ACM Transactions on Embedded Computing Systems
- ISSN:
- 1539-9087
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The dominance of machine learning and the ending of Moore’s law have renewed interests in Processor in Memory (PIM) architectures. This interest has produced several recent proposals to modify an FPGA’s BRAM architecture to form a next-generation PIM reconfigurable fabric [1], [2]. PIM architectures can also be realized within today’s FPGAs as overlays without the need to modify the underlying FPGA architecture. To date, there has been no study to understand the comparative advantages of the two approaches. In this paper, we present a study that explores the comparative advantages between two proposed custom architectures and a PIM overlay running on a commodity FPGA. We created PiCaSO, a Processor in/near Memory Scalable and Fast Overlay architecture as a representative PIM overlay. The results of this study show that the PiCaSO overlay achieves up to 80% of the peak throughput of the custom designs with 2.56× shorter latency and 25% – 43% better BRAM memory utilization efficiency. We then show how several key features of the PiCaSO overlay can be integrated into the custom PIM designs to further improve their throughput by 18%, latency by 19.5%, and memory efficiency by 6.2%.more » « less
-
Applications are migrating en masse to the cloud, while accelerators such as GPUs, TPUs, and FPGAs proliferate in the wake of Moore's Law. These trends are in conflict: cloud applications run on virtual platforms, but existing virtualization techniques have not provided production-ready solutions for accelerators. As a result, cloud providers expose accelerators by dedicating physical devices to individual guests. Multi-tenancy and consolidation are lost as a consequence. We present AvA, which addresses limitations of existing virtualization techniques with automated construction of hypervisor-managed virtual accelerator stacks. AvA combines a DSL for describing APIs and sharing policies, device-agnostic runtime components, and a compiler to generate accelerator-specific components such as guest libraries and API servers. AvA uses Hypervisor Interposed Remote Acceleration (HIRA), a new technique to enable hypervisor-enforcement of sharing policies from the specification. We use AvA to virtualize nine accelerators and eleven framework APIs, including six for which no virtualization support has been previously explored. AvA provides near-native performance and can enforce sharing policies that are not possible with current techniques, with orders of magnitude less developer effort than required for hand-built virtualization support.more » « less
-
In the past decade, GPUs have become an important resource for compute-intensive, general-purpose GPU applications such as machine learning, big data analysis, and large-scale simulations. In the future, with the explosion of machine learning and big data, application demands will keep increasing, resulting in more data and computation being pushed to GPUs. However, due to the slowing of Moore’s Law and rising manufacturing costs, it is becoming more and more challenging to add compute resources into a single GPU device to improve its throughput. As a result, spreading work across multiple GPUs is popular in data-centric and scientific applications. For example, Facebook uses 8 GPUs per server in their recent machine learning platform. However, research infrastructure has not kept pace with this trend: most GPU hardware simulators, including gem5, only support a single GPU. Thus, it is hard to study interference between GPUs, communication between GPUs, or work scheduling across GPUs. Our research group has been working to address this shortcoming by adding multi-GPU support to gem5. Here, we discuss the changes that were needed, which included updating the emulated driver, GPU components, and coherence protocol.more » « less
-
Neuromorphic computing is an emerging field with the potential to offer performance and energy-efficiency gains over traditional machine learning approaches. Most neuromorphic hardware, however, has been designed with limited concerns to the problem of integrating it with other components in a heterogeneous System-on-Chip (SoC). Building on a state-of-the-art reconfigurable neuromorphic architecture, we present the design of a neuromorphic hardware accelerator equipped with a programmable interface that simplifies both the integration into an SoC and communication with the processor present on the SoC. To optimize the allocation of on-chip resources, we develop an optimizer to restructure existing neuromorphic models for a given hardware architecture, and perform design-space exploration to find highly efficient implementations. We conduct experiments with various FPGA-based prototypes of many-accelerator SoCs, where Linux-based applications running on a RISC-V processor invoke Pareto-optimal implementations of our accelerator alongside third-party accelerators. These experiments demonstrate that our neuromorphic hardware, which is up to 89× faster and 170× more energy efficient after applying our optimizer, can be used in synergy with other accelerators for different application purposes.more » « less
An official website of the United States government

