NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Modeling Speedup in Multi-OS Environments

https://doi.org/10.1109/TPDS.2021.3114984

Tauro, Brian R.; Liu, Conghao; Hale, Kyle C. (June 2022, IEEE Transactions on Parallel and Distributed Systems)

Full Text Available
Isolating functions at the hardware limit with virtines

https://doi.org/10.1145/3492321.3519553

Wanninger, Nicholas C.; Bowden, Joshua J.; Shetty, Kirtankumar; Garg, Ayush; Hale, Kyle C. (March 2022, Proceedings of the 17th European Conference on Computer Systems (EuroSys 2022))

Full Text Available
Paths to OpenMP in the kernel

https://doi.org/10.1145/3458817.3476183

Ma, Jiacheng; Wang, Wenyi; Nelson, Aaron; Cuevas, Michael; Homerding, Brian; Liu, Conghao; Huang, Zhen; Campanoni, Simone; Hale, Kyle; Dinda, Peter (November 2021, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC '21))

OpenMP implementations make increasing demands on the kernel. We take the next step and consider bringing OpenMP into the kernel. Our vision is that the entire OpenMP application, run-time system, and a kernel framework is interwoven to become the kernel, allowing the OpenMP implementation to take full advantage of the hardware in a custom manner. We compare and contrast three approaches to achieving this goal. The first, runtime in kernel (RTK), ports the OpenMP runtime to the kernel, allowing any kernel code to use OpenMP pragmas. The second, process in kernel (PIK) adds a specialized process abstraction for running user-level OpenMP code within the kernel. The third, custom compilation for kernel (CCK), compiles OpenMP into a form that leverages the kernel framework without any intermediaries. We describe the design and implementation of these approaches, and evaluate them using NAS and other benchmarks.
more » « less
Full Text Available
Enabling Extremely Fine-grained Parallelism via Scalable Concurrent Queues on Modern Many-core Architectures

https://doi.org/10.1109/MASCOTS53633.2021.9614292

Nookala, Poornima; Dinda, Peter; Hale, Kyle C.; Chard, Kyle; Raicu, Ioan (November 2021, Proceedings of the 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS '21))

Enabling efficient fine-grained task parallelism is a significant challenge for hardware platforms with increasingly many cores. Existing techniques do not scale to hundreds of threads due to the high cost of synchronization in concurrent data structures. To overcome these limitations we present XQueue, a novel lock-less concurrent queuing system with relaxed ordering semantics that is geared towards realizing scalability up to hundreds of concurrent threads. We demonstrate the scalability of XQueue using microbenchmarks and show that XQueue can deliver concurrent operations with latencies as low as 110 cycles at scales of up to 192 cores (up to 6900× improvement compared to traditional synchronization mechanisms) across our diverse hardware, including x86, ARM, and Power9. The reduced latency allows XQueue to provide orders of magnitude (3300×) better throughput that existing techniques. To evaluate the real-world benefits of XQueue, we integrated XQueue with LLVM OpenMP and evaluated five unmodified benchmarks from the Barcelona OpenMP Task Suite (BOTS) as well as a graph traversal benchmark from the GAP benchmark suite. We compared the XQueue-enabled LLVM OpenMP implementation with the native LLVM and GNU OpenMP versions. Using fine-grained task workloads, XQueue can deliver 4× to 6× speedup compared to native GNU OpenMP and LLVM OpenMP in many cases, with speedups as high as 116× in some cases.
more » « less
Full Text Available
Coalescent Computing

https://doi.org/10.1145/3476886.3477503

Hale, Kyle C. (August 2021, Proceedings of the ACM Asia-Pacific Workshop on Systems (APSys 2021)))

Full Text Available
Task parallel assembly language for uncompromising parallelism

https://doi.org/10.1145/3453483.3460969

Rainey, Mike; Newton, Ryan R.; Hale, Kyle; Hardavellas, Nikos; Campanoni, Simone; Dinda, Peter; Acar, Umut A. (June 2021, PLDI 2021: Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation)
null (Ed.)
Full Text Available
Playing Fetch with CAT: Composing Cache Partitioning and Prefetching for Task-based Query Processing

https://doi.org/10.1145/3465998.3466016

Zeng, Qitian; Hale, Kyle C.; Glavic, Boris (June 2021, Proceedings of the 17th International Workshop on Data Management on New Hardware (DaMoN 2021))
null (Ed.)
Software prefetching and hardware-based cache allocation techniques (CAT) have been successfully applied in main-memory database engines to fetch data into cache before it is needed and to partition a shared last-level cache (LLC) to prevent concurrent tasks from evicting each others' data. We investigate the interaction of these techniques and demonstrate that while a single prefetching strategy is sufficient, the combination of both techniques is only effective if the cache partitioning strategy adapts the partitioning based on the types of tasks currently sharing an LLC. We present a simple, yet effective, scheme that uses prefetching and adapts cache partition allocations dynamically.
more » « less
Full Text Available
Modeling Speedup in Multi-OS Environments

https://doi.org/10.1109/MASCOTS.2019.00044

Tauro, Brian R.; Liu, Conghao; Hale, Kyle C. (October 2019, Proceedings of the 27th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS))

Full Text Available
Towards a Practical Ecosystem of Specialized OS Kernels

https://doi.org/10.1145/3322789.3328742

Liu, Conghao; Hale, Kyle C. (June 2019, roceedings of the 9th International Workshop on Runtime and Operating Systems for Supercomputers)

Specialized operating systems have enjoyed a recent revival driven both by a pressing need to rethink the system software stack in several domains and by the convenience and flexibility that on-demand infrastructure and virtual execution environments offer. Several barriers exist which curtail the widespread adoption of such highly specialized systems, but perhaps the most consequential of them is that these systems are simply difficult to use. In this paper we discuss the challenges faced by specialized OSes, both for HPC and more broadly, and argue that what is needed to make them practically useful is a reasonable development and deployment model that will form the foundation for a kernel ecosystem that allows intrepid developers to discover, experiment with, contribute to, and write programs for available kernel frameworks while safely ignoring complexities such as provisioning, deployment, cross-compilation, and interface compatibility. We argue that such an ecosystem would allow more developers of highly tuned applications to reap the performance benefits of specialized kernels.
more » « less
Full Text Available
An Evaluation of Asynchronous Software Events on Modern Hardware

https://doi.org/10.1109/MASCOTS.2018.00041

Hale, Kyle; Dinda, Peter (September 2018, 2018 IEEE 26th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS))

Runtimes and applications that rely heavily on asynchronous event notifications suffer when such notifications must traverse several layers of processing in software. Many of these layers necessarily exist in order to support a general-purpose, portable kernel architecture, but they introduce considerable overheads for demanding, high-performance parallel runtimes and applications. Other overheads can arise from a mismatched event programming or system call interface. Whatever the case, the average latency and variance in latency of commonly used software mechanisms for event notifications is abysmal compared to the capabilities of the hardware, which can exhibit orders of magnitude lower latency. We leverage the flexibility and freedom of the previously proposed Hybrid Runtime (HRT) model to explore the construction of low-latency, asynchronous software events uninhibited by interfaces and execution models commonly imposed by general-purpose OSes. We propose several mechanisms in a system we call Nemo which employs kernel mode-only features to accelerate event notifications by up to 4,000 times and we provide a detailed evaluation of our implementation using extensive microbenchmarks. We carry out our evaluation both on a modern x64 server and the Intel Xeon Phi. Finally, we propose a small addition to existing interrupt controllers (APICs) that could push the limit of asynchronous events closer to the latency of the hardware cache coherence network.
more » « less
Full Text Available

Search for: All records