skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Cache Programming for Scientific Loops Using Leases
Cache management is important in exploiting locality and reducing data movement. This paper studies a new type of programmable cache called the lease cache. By assigning leases, software exerts the primary control on when and how long data stays in the cache. Previous work has shown an optimal solution for an ideal lease cache. This paper develops and evaluates a set of practical solutions for a physical lease cache emulated in FPGA with the full suite of PolyBench benchmarks. Compared to automatic caching, lease programming can further reduce data movement by 10% to over 60% when the data size is 16 times to 3,000 times the cache size, and the techniques in this paper realize over 80% of this potential. Moreover, lease programming can reduce data movement by another 0.8% to 20% after polyhedral locality optimization.  more » « less
Award ID(s):
2114285
PAR ID:
10432352
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Architecture and Code Optimization
ISSN:
1544-3566
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Cache management is important in exploiting locality and reducing data movement. This article studies a new type of programmable cache called the lease cache. By assigning leases, software exerts the primary control on when and how long data stays in the cache. Previous work has shown an optimal solution for an ideal lease cache. This article develops and evaluates a set of practical solutions for a physical lease cache emulated in FPGA with the full suite of PolyBench benchmarks. Compared to automatic caching, lease programming can further reduce data movement by 10% to over 60% when the data size is 16 times to 3,000 times the cache size, and the techniques in this article realize over 80% of this potential. Moreover, lease programming can reduce data movement by another 0.8% to 20% after polyhedral locality optimization. 
    more » « less
  2. Data movement is a common performance bottleneck, and its chief remedy is caching. Traditional cache management is transparent to the workload: data that should be kept in cache are determined by the recency information only, while the program information, i.e., future data reuses, is not communicated to the cache. This has changed in a new cache design named Lease Cache . The program control is passed to the lease cache by a compiler technique called Compiler Assigned Reference Lease (CARL). This technique collects the reuse interval distribution for each reference and uses it to compute and assign the lease value to each reference. In this article, we prove that CARL is optimal under certain statistical assumptions. Based on this optimality, we prove miss curve convexity, which is useful for optimizing shared cache, and sub-partitioning monotonicity, which simplifies lease compilation. We evaluate the potential using scientific kernels from PolyBench and show that compiler insertions of up to 34 leases in program code achieve similar or better cache utilization (in variable size cache) than the optimal fixed-size caching policy, which has been unattainable with automatic caching but now within the potential of cache programming for all tested programs and most cache sizes. 
    more » « less
  3. A modern Graphics Processing Unit (GPU) utilizes L1 Data (L1D) caches to reduce memory bandwidth requirements and latencies. However, the L1D cache can easily be overwhelmed by many memory requests from GPU function units, which can bottleneck GPU performance. It has been shown that the performance of L1D caches is greatly reduced for many GPU applications as a large amount of L1D cache lines are replaced before they are re-referenced. By examining the cache access patterns of these applications, we observe L1D caches with low associativity have difficulty capturing data locality for GPU applications with diverse reuse patterns. These patterns result in frequent line replacements and low data re-usage. To improve the efficiency of L1D caches, we design a Dynamic Line Protection scheme (DLP) that can both preserve valuable cache lines and increase cache line utilization. DLP collects data reuse information from the L1D cache. This information is used to predict protection distances for each memory instruction at runtime, which helps maintain a balance between exploitation of data locality and over-protection of cache lines with long reuse distances. When all cache lines are protected in a set, redundant cache misses are bypassed to reduce the contention for the set. The evaluation result shows that our proposed solution improves cache hits while reducing cache traffic for cache-insufficient applications, achieving up to 137% and an average of 43% IPC improvement over the baseline. 
    more » « less
  4. CPU cache has been used to bridge the processor-memory performance gap to enable high-performance computing. As the cache is of limited capacity, for its maximum efficacy it should (1) avoid caching data that are less likely to be accessed and (2) identify and cache data that would otherwise cost a program multiple memory accesses to reach. Unfortunately, existing cache architectures are inadequate on these two efforts. First, to cost-effectively exploit the spatial locality, they adopt a relatively large and fixed-size cache line as the caching unit. Thus, much of the space in a cache line can be wasted when the data locality is weak. Second, for easy use, the cache is designed to be transparent to programs, which hinders programs from fully exploiting its performance potentials. To address these problems, we propose a high-performance Software Defined Cache (SDC) architecture providing a simple and generic key-value abstraction that allows (1) caching data at a granularity smaller than a cache line, and (2) enabling programs to explicitly insert, retrieve, and invalidate data in the cache with new instructions. By providing a program with the ability of explicitly using the cache as a lookaside key-value buffer, SDC enables a much more efficient cache without disruptively changing the existing cache organization and without substantially increasing hardware cost. We have prototyped SDC on the gem5 simulator and evaluated it with various data index structures and workloads. Experiment results show that SDC can improve the cache performance for the workloads by up to 5.3× over current cache design. 
    more » « less
  5. Traditional workload analysis uses discrete times measured by data accesses. An example is the classic independent reference model (IRM). Effective solutions have been developed to model workloads with stochastic access patterns, but they incur a high cost for Zipfian workloads, which may contain millions of items each accessed with a different frequency. This paper first presents a continuous-time model of locality for workloads with stochastic access patterns. It shows that two previous techniques by Dan and Towsley and by Denning and Schwartz can be interpreted as a single model using different discrete times. Using continuous time, it derives a closed-form solution for an item and a general solution that is a differentiable function. In addition, the paper presents an approximation technique by grouping items into partitions. When evaluated using Zipfian workloads, it shows that a workload with millions of items can be approximated using a small number of partitions, and the continuous-time model has greater accuracy and is faster to compute numerically. For the largest data size verifiable using trace generation and simulation, the new techniques reduce the time of locality analysis by 6 orders of magnitude. 
    more » « less