NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

PDIP: Priority Directed Instruction Prefetching

https://doi.org/10.1145/3620665.3640394

Godala, Bhargav Reddy; Ramesh, Sankara Prasad; Pokam, Gilles A; Stark, Jared; Seznec, Andre; Tullsen, Dean; August, David I (April 2024, ACM)

Modern server workloads have large code footprints which are prone to front-end bottlenecks due to instruction cache capacity misses. Even with the aggressive fetch directed instruction prefetching (FDIP), implemented in modern processors, there are still significant front-end stalls due to I-Cache misses. A major portion of misses that occur on a BPU-predicted path are tolerated by FDIP without causing stalls. Prior work on instruction prefetching, however, has not been designed to work with FDIP processors. Their singular goal is reducing I-Cache misses, whereas FDIP processors are designed to tolerate them. Designing an instruction prefetcher that works in conjunction with FDIP requires identifying the fraction of cache misses that impact front-end performance (that are not fully hidden by FDIP), and only targeting them. In this paper, we propose Priority Directed Instruction Prefetching (PDIP), a novel instruction prefetching technique that complements FDIP by issuing prefetches for only targets where FDIP struggles - along the resteer path of front-end stall-causing events. PDIP identifies these targets and associates them with a trigger for future prefetch. At a 43.5KB budget, PDIP achieves up to 5.1% IPC speedup on important workloads such as cassandra and a geomean IPC speedup of 3.2% across 16 benchmarks.
more » « less
Full Text Available
RPG ² : Robust Profile-Guided Runtime Prefetch Generation

https://doi.org/10.1145/3620665.3640396

Zhang, Yuxuan; Sobotka, Nathan; Park, Soyoon; Jamilan, Saba; Khan, Tanvir Ahmed; Kasikci, Baris; Pokam, Gilles A; Litz, Heiner; Devietti, Joseph (April 2024, ACM)

Data cache prefetching is a well-established optimization to overcome the limits of the cache hierarchy and keep the processor pipeline fed with data. In principle, accurate, well-timed prefetches can sidestep the majority of cache misses and dramatically improve performance. In practice, however, it is challenging to identify which data to prefetch and when to do so. In particular, data can be easily requested too early, causing eviction of useful data from the cache, or requested too late, failing to avoid cache misses. Competition for limited off-chip memory bandwidth must also be balanced between prefetches and a program's regular "demand" accesses. Due to these challenges, prefetching can both help and hurt performance, and the outcome can depend on program structure, decisions about what to prefetch and when to do it, and, as we demonstrate in a series of experiments, program input, processor microarchitecture, and their interaction as well. To try to meet these challenges, we have designed the RPG2 system for online prefetch injection and tuning. RPG2 is a pure-software system that operates on running C/C++ programs, profiling them, injecting prefetch instructions, and then tuning those prefetches to maximize performance. Across dozens of inputs, we find that RPG2 can provide speedups of up to 2.15×, comparable to the best profile-guided prefetching compilers, but can also respond when prefetching ends up being harmful and roll back to the original code - something that static compilers cannot. RPG2 improves prefetching robustness by preserving its performance benefits, while avoiding slowdowns.
more » « less
Full Text Available
Online Code Layout Optimizations via OCOLOS

https://doi.org/10.1109/MM.2023.3274758

Zhang, Yuxuan; Khan, Tanvir Ahmed; Pokam, Gilles; Kasikci, Baris; Litz, Heiner; Devietti, Joseph (July 2023, IEEE Micro)

Full Text Available
EMISSARY: Enhanced Miss Awareness Replacement Policy for L2 Instruction Caching

https://doi.org/10.1145/3579371.3589097

Nagendra, Nayana Prasad; Godala, Bhargav Reddy; Chaturvedi, Ishita; Patel, Atmn; Kanev, Svilen; Moseley, Tipp; Stark, Jared; Pokam, Gilles A.; Campanoni, Simone; August, David I. (June 2023, Proceedings of the 50th International Symposium on Computer Architecture (ISCA))

Full Text Available
OCOLOS: Online COde Layout OptimizationS

https://doi.org/10.1109/MICRO56248.2022.00045

Zhang, Yuxuan; Khan, Tanvir Ahmed; Pokam, Gilles; Kasikci, Baris; Litz, Heiner; Devietti, Joseph (October 2022, International Symposium on Microarchitecture (MICRO))

Full Text Available
DMon: Efficient Detection and Correction of Data Locality Problems Using Selective Profiling

Khan, Tanvir Ahmed; Neal, Ian; Pokam, Gilles; Mozafari, Barzan; Kasikci; Baris (July 2021, Proceedings of the Symposium on Operating Systems Principles)

Full Text Available
Ripple: Profile-Guided Instruction Cache Replacement for Data Center Applications

Khan, Tanvir Ahmed; Zhang, Dexin; Sriraman, Akshitha; Devietti, Joseph; Pokam, Gilles; Litz, Heiner; Kasikci, Baris (June 2021, Proceedings of the 48th International Symposium on Computer Architecture)
null (Ed.)
Modern data center applications exhibit deep software stacks, resulting in large instruction footprints that frequently cause instruction cache misses degrading performance, cost, and energy efficiency. Although numerous mechanisms have been proposed to mitigate instruction cache misses, they still fall short of ideal cache behavior, and furthermore, introduce significant hardware overheads. We first investigate why existing I-cache miss mitigation mechanisms achieve sub-optimal performance for data center applications. We find that widely-studied instruction prefetchers fall short due to wasteful prefetch-induced cache line evictions that are not handled by existing replacement policies. Existing replacement policies are unable to mitigate wasteful evictions since they lack complete knowledge of a data center application’s complex program behavior. To make existing replacement policies aware of these eviction-inducing program behaviors, we propose Ripple, a novel software-only technique that profiles programs and uses program context to inform the underlying replacement policy about efficient replacement decisions. Ripple carefully identifies program contexts that lead to I-cache misses and sparingly injects “cache line eviction” instructions in suitable program locations at link time. We evaluate Ripple using nine popular data center applications and demonstrate that Ripple enables any replacement policy to achieve speedup that is closer to that of an ideal I-cache. Specifically, Ripple achieves an average performance improvement of 1.6% (up to 2.13%) over prior work due to a mean 19% (up to 28.6%) I-cache miss reduction.
more » « less
Full Text Available
Ripple: Profile-Guided Instruction Cache Replacement for Data Center Applications

https://doi.org/10.1109/ISCA52012.2021.00063

Khan, Tanvir Ahmed; Zhang, Dexin; Sriraman, Akshitha; Devietti, Joseph; Pokam, Gilles; Litz, Heiner; Kasikci, Baris (June 2021, 48th Annual International Symposium on Computer Architecture (ISCA))
null (Ed.)
Full Text Available
Twig: Profile-Guided BTB Prefetching for Data Center Applications

https://doi.org/10.1145/3466752.3480124

Khan, Tanvir Ahmed; Brown, Nathan; Sriraman, Akshitha; Soundararajan, Niranjan K; Kumar, Rakesh; Devietti, Joseph; Subramoney, Sreenivas; Pokam, Gilles A; Litz, Heiner; Kasikci, Baris (October 2021, 54th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO ’21))
null (Ed.)
Full Text Available
CHiRP: Control-Flow History Reuse Prediction

Mirbagher-Ajorpaz, Samira; Pokam, Gilles A.; Garza, Elba; Jiménez, Daniel A. (October 2020, Proceedings of the 53rd ACM/IEEE International Symposium on Microarchitecture)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records