skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 31, 2025

Title: Evolving to Find Optimizations Humans Miss: Using Evolutionary Computation to Improve GPU Code for Bioinformatics Applications
GPUs are used in many settings to accelerate large-scale scientific computation, including simulation, computational biology, and molecular dynamics. However, optimizing codes to run efficiently on GPUs requires developers to have both detailed understanding of the application logic and significant knowledge of parallel programming and GPU architectures. This paper shows that an automated GPU program optimization tool, GEVO, can leverage evolutionary computation to find code edits that reduce the runtime of three important applications, multiple sequence alignment, agent-based simulation and molecular dynamics codes, by 28.9%, 29%, and 17.8% respectively. The paper presents an in-depth analysis of the discovered optimizations, revealing that (1) several of the most important optimizations involve significant epistasis, (2) the primary sources of improvement are application-specific, and (3) many of the optimizations generalize across GPU architectures. In general, the discovered optimizations are not straightforward even for a GPU human expert, showcasing the potential of automated program optimization tools to both reduce the optimization burden for human domain experts and provide new insights for GPU experts.  more » « less
Award ID(s):
2239518
PAR ID:
10565959
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
ACM
Date Published:
Journal Name:
ACM Transactions on Evolutionary Learning and Optimization
Volume:
4
Issue:
4
ISSN:
2688-3007
Page Range / eLocation ID:
1 to 29
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The many-body correlation function is a fundamental computation kernel in modern physics computing applications, e.g., Hadron Contractions in Lattice quantum chromodynamics (QCD). This kernel is both computation and memory intensive, involving a series of tensor contractions, and thus usually runs on accelerators like GPUs. Existing optimizations on many-body correlation mainly focus on individual tensor contractions (e.g., cuBLAS libraries and others). In contrast, this work discovers a new optimization dimension for many-body correlation by exploring the optimization opportunities among tensor contractions. More specifically, it targets general GPU architectures (both NVIDIA and AMD) and optimizes many-body correlation’s memory management by exploiting a set of memory allocation and communication redundancy elimination opportunities: first, GPU memory allocation redundancy : the intermediate output frequently occurs as input in the subsequent calculations; second, CPU-GPU communication redundancy : although all tensors are allocated on both CPU and GPU, many of them are used (and reused) on the GPU side only, and thus, many CPU/GPU communications (like that in existing Unified Memory designs) are unnecessary; third, GPU oversubscription: limited GPU memory size causes oversubscription issues, and existing memory management usually results in near-reuse data eviction, thus incurring extra CPU/GPU memory communications. Targeting these memory optimization opportunities, this article proposes MemHC, an optimized systematic GPU memory management framework that aims to accelerate the calculation of many-body correlation functions utilizing a series of new memory reduction designs. These designs involve optimizations for GPU memory allocation, CPU/GPU memory movement, and GPU memory oversubscription, respectively. More specifically, first, MemHC employs duplication-aware management and lazy release of GPU memories to corresponding host managing for better data reusability. Second, it implements data reorganization and on-demand synchronization to eliminate redundant (or unnecessary) data transfer. Third, MemHC exploits an optimized Least Recently Used (LRU) eviction policy called Pre-Protected LRU to reduce evictions and leverage memory hits. Additionally, MemHC is portable for various platforms including NVIDIA GPUs and AMD GPUs. The evaluation demonstrates that MemHC outperforms unified memory management by \( 2.18\times \) to \( 10.73\times \) . The proposed Pre-Protected LRU policy outperforms the original LRU policy by up to \( 1.36\times \) improvement. 1 
    more » « less
  2. Ease of use and transparent access to elastic resources have attracted many applications away from traditional platforms toward serverless functions. Many of these applications, such as machine learning, could benefit significantly from GPU acceleration. Unfortunately, GPUs remain inaccessible from serverless functions in modern production settings. We present DGSF, a platform that transparently enables serverless functions to use GPUs through general purpose APIs such as CUDA. DGSF solves provisioning and utilization challenges with disaggregation, serving the needs of a potentially large number of functions through virtual GPUs backed by a small pool of physical GPUs on dedicated servers. Disaggregation allows the provider to decouple GPU provisioning from other resources, and enables significant benefits through consolidation. We describe how DGSF solves GPU disaggregation challenges including supporting API transparency, hiding the latency of communication with remote GPUs, and load-balancing access to heavily shared GPUs. Evaluation of our prototype on six workloads shows that DGSF’s API remoting optimizations can improve the runtime of a function by up to 50% relative to unoptimized DGSF. Such optimizations, which aggressively remove GPU runtime and object management latency from the critical path, can enable functions running over DGSF to have a lower end-to-end time than when running on a GPU natively. By enabling GPU sharing, DGSF can reduce function queueing latency by up to 53%. We use DGSF to augment AWS Lambda with GPU support, showing similar benefits. 
    more » « less
  3. null (Ed.)
    Deterministic execution for GPUs is a desirable property as it helps with debuggability and reproducibility. It is also important for safety regulations, as safety critical workloads are starting to be deployed onto GPUs. Prior deterministic architectures, such as GPUDet, attempt to provide strong determinism for all types of workloads, incurring significant performance overheads due to the many restrictions that are required to satisfy determinism. We observe that a class of reduction workloads, such as graph applications and neural architecture search for machine learning, do not require such severe restrictions to preserve determinism. This motivates the design of our system, Deterministic Atomic Buffering (DAB), which provides deterministic execution with low area and performance overheads by focusing solely on ordering atomic instructions instead of all memory instructions. By scheduling atomic instructions deterministically with atomic buffering, the results of atomic operations are isolated initially and made visible in the future in a deterministic order. This allows the GPU to execute deterministically in parallel without having to serialize its threads for atomic operations as opposed to GPUDet. Our simulation results show that, for atomic-intensive applications, DAB performs 4× better than GPUDet and incurs only a 23% slowdown on average compared to a non-deterministic GPU architecture. We also characterize the bottlenecks and provide insights for future optimizations. 
    more » « less
  4. Existing deep neural network (DNN) frameworks optimize the computation graph of a DNN by applying graph transformations manually designed by human experts. This approach misses possible graph optimizations and is difficult to scale, as new DNN operators are introduced on a regular basis. We propose TASO, the first DNN computation graph optimizer that automatically generates graph substitutions. TASO takes as input a list of operator specifications and generates candidate substitutions using the given operators as basic building blocks. All generated substitutions are formally verified against the operator specifications using an automated theorem prover. To optimize a given DNN computation graph, TASO performs a cost-based backtracking search, applying the substitutions to find an optimized graph, which can be directly used by existing DNN frameworks. Our evaluation on five real-world DNN architectures shows that TASO outperforms existing DNN frameworks by up to 2.8X, while requiring significantly less human effort. For example, TensorFlow currently contains approximately 53,000 lines of manual optimization rules, while the operator specifications needed by TASO are only 1,400 lines of code. 
    more » « less
  5. Chi-Wang Shu (Ed.)
    GPU computing is expected to play an integral part in all modern Exascale supercomputers. It is also expected that higher order Godunov schemes will make up about a significant fraction of the application mix on such supercomputers. It is, therefore, very important to prepare the community of users of higher order schemes for hyperbolic PDEs for this emerging opportunity. Not every algorithm that is used in the space-time update of the solution of hyperbolic PDEs will take well to GPUs. However, we identify a small core of algorithms that take exceptionally well to GPU computing. Based on an analysis of available options, we have been able to identify weighted essentially non-oscillatory (WENO) algorithms for spatial reconstruction along with arbitrary derivative (ADER) algorithms for time extension followed by a corrector step as the winning three-part algorithmic combination. Even when a winning subset of algorithms has been identified, it is not clear that they will port seamlessly to GPUs. The low data throughput between CPU and GPU, as well as the very small cache sizes on modern GPUs, implies that we have to think through all aspects of the task of porting an application to GPUs. For that reason, this paper identifies the techniques and tricks needed for making a successful port of this very useful class of higher order algorithms to GPUs. Application codes face a further challenge—the GPU results need to be practically indistinguishable from the CPU results—in order for the legacy knowledge bases embedded in these applications codes to be preserved during the port of GPUs. This requirement often makes a complete code rewrite impossible. For that reason, it is safest to use an approach based on OpenACC directives, so that most of the code remains intact (as long as it was originally well-written). This paper is intended to be a one-stop shop for anyone seeking to make an OpenACC-based port of a higher order Godunov scheme to GPUs. We focus on three broad and high-impact areas where higher order Godunov schemes are used. The first area is computational fluid dynamics (CFD). The second is computational magnetohydrodynamics (MHD) which has an involution constraint that has to be mimetically preserved. The third is computational electrodynamics (CED) which has involution constraints and also extremely stiff source terms. Together, these three diverse uses of higher order Godunov methodology, cover many of the most important applications areas. In all three cases, we show that the optimal use of algorithms, techniques, and tricks, along with the use of OpenACC, yields superlative speedups on GPUs. As a bonus, we find a most remarkable and desirable result: some higher order schemes, with their larger operations count per zone, show better speedup than lower order schemes on GPUs. In other words, the GPU is an optimal stratagem for overcoming the higher computational complexities of higher order schemes. Several avenues for future improvement have also been identified. A scalability study is presented for a real-world application using GPUs and comparable numbers of high-end multicore CPUs. It is found that GPUs offer a substantial performance benefit over comparable number of CPUs, especially when all the methods designed in this paper are used. 
    more » « less