NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Certifiably Byzantine-Robust Federated Conformal Prediction

Kang, Mintong; Lin, Zhen; Sun, Jimeng; Xiao, Cao; Li, Bo (July 2024, International Conference on Machine Learning (ICML 2024))

Full Text Available
Nomad: Non-Exclusive Memory Tiering via Transactional Page Migration

Xiang, Lingfeng; Lin, Zhen; Deng, Weishu; Lu, Hui; Rao, Jia (July 2024, The USENIX Advanced Computing Systems Association)

Full Text Available
Nomad: Non-Exclusive Memory Tiering via Transactional Page Migration

Xiang, Lingfeng; Lin, Zhen; Deng, Weishu; Lu, Hui; Rao, Jia Rao; Yuan, Yifan; Wang, Ren (July 2024, The Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI'24).)

With the advent of byte-addressable memory devices, such as CXLmemory, persistent memory, and storage-class memory, tiered memory systems have become a reality. Page migration is the de facto method within operating systems for managing tiered memory. It aims to bring hot data whenever possible into fast memory to optimize the performance of data accesses while using slow memory to accommodate data spilled from fast memory. While the existing research has demonstrated the effectiveness of various optimizations on page migration, it falls short of addressing a fundamental question: Is exclusive memory tiering, in which a page is either present in fast memory or slow memory, but not both simultaneously, the optimal strategy for tiered memory management? We demonstrate that page migration-based exclusive memory tiering suffers significant performance degradation when fast memory is under pressure. In this paper, we propose nonexclusive memory tiering, a page management strategy that retains a copy of pages recently promoted from slow memory to fast memory to mitigate memory thrashing. To enable non-exclusive memory tiering, we develop NOMAD, a new page management mechanism for Linux that features transactional page migration and page shadowing. NOMAD helps remove page migration off the critical path of program execution and makes migration completely asynchronous. Evaluations with carefully crafted micro-benchmarks and real-world applications show that NOMAD is able to achieve up to 6x performance improvement over the state-of-the-art transparent page placement (TPP) approach in Linux when under memory pressure. We also compare NOMAD with a recently proposed hardware-assisted, access sampling-based page migration approach and demonstrate NOMAD’s strengths and potential weaknesses in various scenarios.
more » « less
Full Text Available
Nomad: {Non-Exclusive} Memory Tiering via Transactional Page Migration

Xiang, Lingfeng; Lin, Zhen; Deng, Weishu; Lu, Hui; Rao, Jia Rao; Yuan, Yifan; Wang, Ren (July 2024, USENIX Association)

Full Text Available
NOMAD: Non-Exclusive Memory Tiering via Transactional Page Migration

Xinag, Lingfeng; Lin, Zhen; Deng, Weishu Deng; Lu, Hui; Rao, Jia; Yuan, Yifan Yuan; Wang, Ren (July 2024, The 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI'24))

With the advent of byte-addressable memory devices, such as CXL memory, persistent memory, and storage-class memory, tiered memory systems have become a reality. Page migration is the de facto method within operating systems for managing tiered memory. It aims to bring hot data whenever possible into fast memory to optimize the performance of data accesses while using slow memory to accommodate data spilled from fast memory. While the existing research has demonstrated the effectiveness of various optimizations on page migration, it falls short of addressing a fundamental question: Is exclusive memory tiering, in which a page is either present in fast memory or slow memory, but not both simultaneously, the optimal strategy for tiered memory management? We demonstrate that page migration-based exclusive memory tiering suffers significant performance degradation when fast memory is under pressure. In this paper, we propose nonexclusive memory tiering, a page management strategy that retains a copy of pages recently promoted from slow memory to fast memory to mitigate memory thrashing. To enable non-exclusive memory tiering, we develop NOMAD, a new page management mechanism for Linux that features transactional page migration and page shadowing. NOMAD helps remove page migration off the critical path of program execution and makes migration completely asynchronous. Evaluations with carefully crafted micro-benchmarks and real-world applications show that NOMAD is able to achieve up to 6x performance improvement over the state-of-the-art transparent page placement (TPP) approach in Linux when under memory pressure. We also compare NOMAD with a recently proposed hardware-assisted, access sampling-based page migration approach and demonstrate NOMAD’s strengths and potential weaknesses in various scenarios.
more » « less
Full Text Available
P2CACHE: Exploring Tiered Memory for {In-Kernel} File Systems Caching

Lin, Zhen; Xiang, Lingfeng; Rao, Jia; Lu, Hui (July 2023, 2023 USENIX Annual Technical Conference (USENIX ATC 23))

Fast, byte-addressable persistent memory (PM) is becoming a reality in products. However, porting legacy kernel file systems to fully support PM requires substantial effort and encounters the challenge of bridging the gap between block-based access granularity and byte-addressability. Moreover, new PM-specific file systems remain far from production-ready, preventing them from being widely used. In this paper, we propose P2CACHE, a novel in-kernel caching mechanism to explore how legacy kernel file systems can effectively evolve in the face of fast, byte-addressable PM. P2CACHE exploits a read/write-distinguishable memory hierarchy upon a tiered memory system involving both PM and DRAM. P2CACHE leverages PM to serve all write requests for instant data durability and strong crash consistency while using DRAM to serve most read I/Os for high I/O performance. Further, P2CACHE employs a simple yet effective synchronization model between PM and DRAM by leveraging device-level parallelism. Our evaluation shows that P2CACHE can significantly increase the performance of legacy kernel file systems -- e.g., by 200x for RocksDB on Ext4 -- meanwhile equipping them with instant data durability and strong crash consistency, similar to PM-specialized file systems.
more » « less
Full Text Available
PyHealth: A Deep Learning Toolkit for Healthcare Applications

https://doi.org/10.1145/3580305.3599178

Yang, Chaoqi; Wu, Zhenbang; Jiang, Patrick; Lin, Zhen; Gao, Junyi; Danek, Benjamin P.; Sun, Jimeng (August 2023, KDD)

Full Text Available
FlashCube: Fast Provisioning of Serverless Functions with Streamlined Container Runtimes

https://doi.org/10.1145/3477113.3487273

Lin, Zhen; Hsieh, Kao-Feng; Sun, Yu; Shin, Seunghee; Lu, Hui (October 2021, The 11th Workshop on Programming Languages and Operating Systems (PLOS' 21))

Fast provisioning of serverless functions is salient for serverless platforms. Though lightweight sandboxes (e.g., containers) enclose only necessary files and libraries, a cold launch still requires up to a few seconds to complete. Such slow provisioning prolongs the response time of serverless functions and negatively impacts users’ experiences. This paper analyzes the main reasons for such slowdown and introduces an effective containerization framework, FlashCube. Instead of building a container from scratch, FlashCube quickly and eff iciently assembles it through a group of pre-created general container parts (e.g., namespaces, cgroups, and language runtimes). In addition, FlashCube’s user-space implementation makes it easily applicable to existing commodity serverless platforms. Our preliminary evaluation demonstrates that FlashCube can quickly provision containerized functions in less than 10 ms (vs. ∼400 ms using Docker containers).
more » « less
Full Text Available
Exploring Memory Persistency Models for GPUs

https://doi.org/10.1109/PACT.2019.00032

Lin, Zhen; Alshboul, Mohammad; Solihin, Yan; Zhou, Huiyang (September 2019, 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT))
null (Ed.)
Full Text Available
Coordinated CTA Combination and Bandwidth Partitioning for GPU Concurrent Kernel Execution

https://doi.org/10.1145/3326124

Lin, Zhen; Dai, Hongwen; Mantor, Michael; Zhou, Huiyang (August 2019, ACM Transactions on Architecture and Code Optimization)
null (Ed.)
Contemporary GPUs support multiple kernels to run concurrently on the same streaming multiprocessors (SMs). Recent studies have demonstrated that such concurrent kernel execution (CKE) improves both resource utilization and computational throughput. Most of the prior works focus on partitioning the GPU resources at the cooperative thread array (CTA) level or the warp scheduler level to improve CKE. However, significant performance slowdown and unfairness are observed when latency-sensitive kernels co-run with bandwidth-intensive ones. The reason is that bandwidth over-subscription from bandwidth-intensive kernels leads to much aggravated memory access latency, which is highly detrimental to latency-sensitive kernels. Even among bandwidth-intensive kernels, more intensive kernels may unfairly consume much higher bandwidth than less-intensive ones. In this article, we first make a case that such problems cannot be sufficiently solved by managing CTA combinations alone and reveal the fundamental reasons. Then, we propose a coordinated approach for CTA combination and bandwidth partitioning. Our approach dynamically detects co-running kernels as latency sensitive or bandwidth intensive. As both the DRAM bandwidth and L2-to-L1 Network-on-Chip (NoC) bandwidth can be the critical resource, our approach partitions both bandwidth resources coordinately along with selecting proper CTA combinations. The key objective is to allocate more CTA resources for latency-sensitive kernels and more NoC/DRAM bandwidth resources to NoC-/DRAM-intensive kernels. We achieve it using a variation of dominant resource fairness (DRF). Compared with two state-of-the-art CKE optimization schemes, SMK [52] and WS [55], our approach improves the average harmonic speedup by 78% and 39%, respectively. Even compared to the best possible CTA combinations, which are obtained from an exhaustive search among all possible CTA combinations, our approach improves the harmonic speedup by up to 51% and 11% on average.
more » « less
Full Text Available

« Prev Next »

Search for: All records