Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
In this paper, we explore the prospect of accelerating tree-based genetic programming (TGP) by way of modern field-programmable gate array (FPGA) devices, which is motivated by the fact that FPGAs can sometimes leverage larger amounts of data/function parallelism, as well as better energy efficiency, when compared to general-purpose CPU/GPU systems. In our preliminary study, we introduce a fixed-depth, tree-based architecture capable of evaluating type-consistent primitives that can be fully unrolled and pipelined. The current primitive constraints preclude arbitrary control structures, but they allow for entire programs to be evaluated every clock cycle. Using a variety of floating-point primitives and random programs, we compare to the recent TensorGP tool executing on a modern 8 nm GPU, and we show that our accelerator implemented on a 14 nm FPGA achieves an average speedup of 43×. When compared to the popular baseline tool DEAP executing across all cores of a 2-socket, 28-core (56-thread), 14 nm CPU server, our accelerator achieves an average speedup of 4,902×. Finally, when compared to the recent state-of-the-art tool Operon executing on the same 2-processor CPU system, our accelerator executes about 2.4× slower on average. Despite not achieving an average speedup over every tool tested, our single-FPGA accelerator is the fastest in several instances, and we describe five future extensions that could allow for a 32–144× speedup over our current design as well as allow for larger program depths/sizes. Overall, we estimate that a future version of our accelerator will constitute a state-of-the-art GP system for many applications.more » « less
-
Genetic programming (GP) is a general, broadly effective procedure by which computable solutions are constructed from high-level objectives. As with other machine-learning endeavors, one continual trend for GP is to exploit ever-larger amounts of parallelism. In this paper, we explore the possibility of accelerating GP by way of modern field-programmable gate arrays (FPGAs), which is motivated by the fact that FPGAs can sometimes leverage larger amounts of both function and data parallelism—common characteristics of GP— when compared to CPUs and GPUs. As a first step towards more general acceleration, we present a preliminary accelerator for the evaluation phase of "tree-based GP"—the original, and still popular, flavor of GP—for which the FPGA dynamically compiles programs of varying shapes and sizes onto a reconfigurable function tree pipeline. Overall, when compared to a recent open-source GPU solution implemented on a modern 8nm process node, our accelerator implemented on an older 20nm FPGA achieves an average speedup of 9.7×. Although our accelerator is 7.9× slower than most examples of a state-of-the-art CPU solution implemented on a recent 7nm process node, we describe future extensions that can make FPGA acceleration provide attractive Pareto-optimal tradeoffs.more » « less
-
null (Ed.)In this article, we introduce a p arallelizing a pproximatio n - d isc o very f ra mework, PANDORA, for automatically discovering application- and architecture-specialized approximations of provided code. PANDORA complements existing compilers and runtime optimizers by generating approximations with a range of Pareto-optimal tradeoffs between performance and error, which enables adaptation to different inputs, different user preferences, and different runtime conditions (e.g., battery life). We demonstrate that PANDORA can create parallel approximations of inherently sequential code by discovering alternative implementations that eliminate loop-carried dependencies. For a variety of functions with loop-carried dependencies, PANDORA generates approximations that achieve speedups ranging from 2.3x to 81x, with acceptable error for many usage scenarios. We also demonstrate PANDORA’s architecture-specialized approximations via FPGA experiments, and highlight PANDORA’s discovery capabilities by removing loop-carried dependencies from a recurrence relation with no known closed-form solution.more » « less
-
In this paper, we introduce PANDORA---a framework that complements existing parallelizing compilers by automatically discovering application- and architecture-specialized approximations. We demonstrate that PANDORA creates approximations that extract massive amounts of parallelism from inherently sequential code by eliminating loop-carried dependencies---a long-time goal of the compiler research community. Compared to exact parallel baselines, preliminary results show speedups ranging from 2.3x to 81x with acceptable error for many usage scenarios.more » « less
-
FPGAs commonly have significantly lower clock frequencies than many microprocessors and GPUs, due largely to propagation delays incurred by the reconfigurable interconnect. The Stratix 10 HyperFlex architecture reduces this problem by embedding numerous registers throughout the routing resources. However, such Hyper-Registers do not support back-pressure (i.e., pipeline stalls) that is commonly used in FPGA pipelines. In this paper, we present and evaluate pipeline transformations using absorption FIFOs, which avoid back-pressure limitations to enable numerous pipelines to benefit from HyperFlex, while also eliminating potentially expensive stall penalties incurred by existing techniques. We demonstrate that these transformations not only enable significant clock improvements on Stratix 10, but also for devices without HyperFlex, potentially making absorption FIFOs a better high-frequency strategy for any FPGA. In addition, we introduce optimizations that yield additional performance improvements by reducing stall penalties that can increase linearly with pipeline depth when restarting after a stall.more » « less
-
Emerging FPGA systems are providing higher external memory bandwidth to compete with GPU performance. However, because FPGAs often achieve parallelism through deep pipelines, traditional FPGA design strategies do not necessarily scale well to large amounts of replicated pipelines that can take advantage of higher bandwidth. We show that sliding-window applications, an important subset of digital signal processing, demonstrate this scalability problem. We introduce a window generator architecture that enables replication to over 330 GB/s, which is an 8.7x improvement over previous work. We evaluate the window generator on the Intel Broadwell+Arria10 system for 2D convolution and show that for traditional convolution (one filter per image), our approach outperforms a 12-core Xeon Broadwell E5 by 81x and a high-end Nvidia P6000 GPU by an order of magnitude for most input sizes, while improving energy by 15.7x. For convolutional neural nets (CNNs), we show that although the GPU and Xeon typically outperform existing FPGA systems, projected performances of the window generator running on FPGAs with sufficient bandwidth can outperform high-end GPUs for many common CNN parameters.more » « less