NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Asynchronous Automata Processing on GPUs

https://doi.org/10.1145/3579453

Liu, Hongyuan; Pai, Sreepathi; Jog, Adwait (February 2023, Proceedings of the ACM on Measurement and Analysis of Computing Systems)

Finite-state automata serve as compute kernels for many application domains such as pattern matching and data analytics. Existing approaches on GPUs exploit three levels of parallelism in automata processing tasks: 1)~input stream level, 2)~automaton-level and 3)~state-level. Among these, only state-level parallelism is intrinsic to automata while the other two levels of parallelism depend on the number of automata and input streams to be processed. As GPU resources increase, a parallelism-limited automata processing task can underutilize GPU compute resources. To this end, we propose AsyncAP, a low-overhead approach that optimizes for both scalability and throughput. Our insight is that most automata processing tasks have an additional source of parallelism originating from the input symbols which has not been leveraged before. Making the matching process associated with the automata tasks asynchronous, i.e., parallel GPU threads start processing an input stream from different input locations instead of processing it serially, improves throughput significantly, and scales with input length. When the task does not have enough parallelism to utilize all the GPU cores, detailed evaluation across 12 evaluated applications shows that AsyncAP achieves up to 58× speedup on average over the state-of-the-art GPU automata processing engine. When the tasks have enough parallelism to utilize GPU cores, AsyncAP still achieves 2.4× speedup.
more » « less
Full Text Available
Analyzing and Leveraging Decoupled L1 Caches in GPUs

https://doi.org/10.1109/HPCA51647.2021.00047

Ibrahim, Mohamed Assem; Kayiran, Onur; Eckert, Yasuko; Loh, Gabriel H.; Jog, Adwait (February 2021, IEEE International Symposium on High-Performance Computer Architecture (HPCA))
null (Ed.)
Full Text Available
Analyzing and Leveraging Shared L1 Caches in GPUs

https://doi.org/10.1145/3410463.3414623

Ibrahim, Mohamed Assem; Kayiran, Onur; Eckert, Yasuko; Loh, Gabriel H.; Jog, Adwait (September 2020, Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques)
null (Ed.)
Full Text Available
Why GPUs are Slow at Executing NFAs and How to Make them Faster

https://doi.org/10.1145/3373376.3378471

Liu, Hongyuan; Pai, Sreepathi; Jog, Adwait (March 2020, Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems)
null (Ed.)
Full Text Available
Analyzing and Leveraging Remote-Core Bandwidth for Enhanced Performance in GPUs

https://doi.org/10.1109/PACT.2019.00028

Ibrahim, Mohamed Assem; Liu, Hongyuan; Kayiran, Onur; Jog, Adwait (September 2019, 28th International Conference on Parallel Architectures and Compilation Techniques (PACT))

Full Text Available
Exploiting Latency and Error Tolerance of GPGPU Applications for an Energy-Efficient DRAM

https://doi.org/10.1109/DSN.2019.00046

Wang, Haonan; Jog, Adwait (June 2019, 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN))

Full Text Available
Quantifying Data Locality in Dynamic Parallelism in GPUs

https://doi.org/10.1145/3287318

Tang, Xulong; Pattnaik, Ashutosh; Kayiran, Onur; Jog, Adwait; Kandemir, Mahmut Taylan; Das, Chita (December 2018, Proceedings of the ACM on Measurement and Analysis of Computing Systems)

Full Text Available
Architectural Support for Efficient Large-Scale Automata Processing

https://doi.org/10.1109/MICRO.2018.00078

Liu, Hongyuan; Ibrahim, Mohamed; Kayiran, Onur; Pai, Sreepathi; Jog, Adwait (October 2018, 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO))

Full Text Available
MASK: Redesigning the GPU Memory Hierarchy to Support Multi-Application Concurrency

https://doi.org/10.1145/3173162.3173169

Ausavarungnirun, Rachata; Miller, Vance; Landgraf, Joshua; Ghose, Saugata; Gandhi, Jayneel; Jog, Adwait; Rossbach, Christopher J.; Mutlu, Onur (March 2018, Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS '18))

Graphics Processing Units (GPUs) exploit large amounts of thread-level parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces application-level unfairness by 22.4%. MASK's system throughput is within 23.2% of an ideal GPU system with no address translation overhead.
more » « less
Full Text Available

Search for: All records