NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

GPEmu: A GPU Emulator for Faster and Cheaper Prototyping and Evaluation of Deep Learning System Research

https://doi.org/10.14778/3725688.3725716

Wang, Meng; Waldspurger, Gus; Ananda, Naufal; Huang, Yuyang; Wiharja, Kemas; Bent, John; Sundararaman, Swaminathan; Chidambaram, Vijay; Gunawi, Haryadi S (February 2025, Proceedings of the VLDB Endowment)

Deep learning (DL) system research is often impeded by the limited availability and expensive costs of GPUs. In this paper, we introduce GPEmu, a GPU emulator for faster and cheaper prototyping and evaluation of deep learning system research without using real GPUs. GPEmu comes with four novel features: time emulation, memory emulation, distributed system support, and sharing support. We support over 30 DL models and 6 GPU models, the largest scale to date. We demonstrate the power of GPEmu by successfully reproducing the main results of nine recent publications and easily prototyping three new micro-optimizations.
more » « less
Free, publicly-accessible full text available February 1, 2026
Lowering the Pre-training Tax for Gradient-based Subset Training: A Lightweight Distributed Pre-Training Toolkit

Ro, Yeonju; Wang, Zhangyang; Chidambaram, Vijay; Akella, Aditya (July 2023, International Conference on Machine Learning (ICML))

Full Text Available
DINOMO: An Elastic, Scalable, High-Performance Key-Value Store for Disaggregated Persistent Memory

https://doi.org/10.14778/3565838.3565854

Lee, Sekwon; Ponnapalli, Soujanya; Singhal, Sharad; Aguilera, Marcos; Keeton, Kimberly; Chidambaram, Vijay (August 2023, VLDB)

We present Dinomo, a novel key-value store for disaggregated persistent memory (DPM). Dinomo is the first key-value store for DPM that simultaneously achieves high common-case performance, scalability, and lightweight online reconfiguration. We observe that previously proposed key-value stores for DPM had architectural limitations that prevent them from achieving all three goals simultaneously. Dinomo uses a novel combination of techniques such as ownership partitioning, disaggregated adaptive caching, selective replication, and lock-free and log-free indexing to achieve these goals. Compared to a state-of-the-art DPM key-value store, Dinomo achieves at least 3.8× better throughput at scale on various workloads and higher scalability, while providing fast reconfiguration.
more » « less
Full Text Available
Chipmunk: Investigating Crash-Consistency in Persistent-Memory File Systems

https://doi.org/10.1145/3552326.3567498

LeBlanc, Hayley; Pailoor, Shankara; K R E, Om Saran; Dillig, Isil; Bornholt, James; Chidambaram, Vijay (May 2023, ACM)

We present Chipmunk, a new framework to test persistent-memory (PM) file systems for crash-consistency bugs. Using Chipmunk, we discovered 23 new bugs across five PM file systems; most bugs have been confirmed and fixed by developers. The discovered bugs have serious consequences, including making the file system un-mountable or breaking rename atomicity. We present a detailed study of the bugs found using Chipmunk and discuss important lessons learned for designing and testing PM file systems.
more » « less
Full Text Available
TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches

Shah, Aashaka; Chidambaram, Vijay; Cowan, Meghan; Maleki, Saeed; Musuvathi, Madan; Mytkowicz, Todd; Nelson, Jacob; Saarikivi, Olli; Singh, Rachee (April 2023, USENIX)

Machine learning models are increasingly being trained across multiple GPUs and servers. In this setting, data is transferred between GPUs using communication collectives such as ALLTOALL and ALLREDUCE, which can become a significant bottleneck in training large models. Thus, it is important to use efficient algorithms for collective communication. We develop TACCL, a tool that enables algorithm designers to guide a synthesizer into automatically generating algorithms for a given hardware configuration and communication collective. TACCL uses a novel communication sketch abstraction to get crucial information from the designer to significantly reduce the search space and guide the synthesizer towards better algorithms. TACCL also uses a novel encoding of the problem that allows it to scale beyond single-node topologies. We use TACCL to synthesize algorithms for three collectives and two hardware topologies: DGX-2 and NDv2. We demonstrate that the algorithms synthesized by TACCL outperform the Nvidia Collective Communication Library (NCCL) by up to 6.7x. We also show that TACCL can speed up end-to-end training of Transformer-XL and BERT models by 11%–2.3x for different batch sizes.
more » « less
Full Text Available
CheckFreq: Frequent, Fine-Grained DNN Checkpointing

Mohan, Jayashree; Phanishayee, Amar; Chidambaram, Vijay (February 2021, Proceedings of the USENIX Conference on File and Storage Technologies (FAST 2021))
Aguilera, Marcos; Yadgar, Gala (Ed.)
Training Deep Neural Networks (DNNs) is a resource-hungry and time-consuming task. During training, the model performs computation at the GPU to learn weights, repeatedly, over several epochs. The learned weights reside in GPU memory, and are occasionally checkpointed (written to persistent storage) for fault-tolerance. Traditionally, model parameters are checkpointed at epoch boundaries; for modern deep networks, an epoch runs for several hours. An interruption to the training job due to preemption, node failure, or process failure, therefore results in the loss of several hours worth of GPU work on recovery. We present CheckFreq, an automatic, fine-grained checkpointing framework that (1) algorithmically determines the checkpointing frequency at the granularity of iterations using systematic online profiling, (2) dynamically tunes checkpointing frequency at runtime to bound the checkpointing overhead using adaptive rate tuning, (3) maintains the training data invariant of using each item in the dataset exactly once per epoch by checkpointing data loader state using a light-weight resumable iterator, and (4) carefully pipelines checkpointing with computation to reduce the checkpoint cost by introducing two-phase checkpointing. Our experiments on a variety of models, storage backends, and GPU generations show that CheckFreq can reduce the recovery time from hours to seconds while bounding the runtime overhead within 3.5%.
more » « less
Full Text Available
RainBlock: Faster Transaction Processing in Public Blockchains

Ponnapalli, Soujanya; Shah, Aashaka; Banerjee, Souvik; Malkhi, Dahli; Chidambaram, Vijay; Wei, Michael (July 2021, Proceedings of the USENIX Conference)
Calciu, Irina; Kuenning, Geoff (Ed.)
We present RAINBLOCK, a public blockchain that achieves high transaction throughput without modifying the proof-ofwork consensus. The chief insight behind RAINBLOCK is that while consensus controls the rate at which new blocks are added to the blockchain, the number of transactions in each block is limited by I/O bottlenecks. Public blockchains like Ethereum keep the number of transactions in each block low so that all participating servers (miners) have enough time to process a block before the next block is created. By removing the I/O bottlenecks in transaction processing, RAINBLOCK allows miners to process more transactions in the same amount of time. RAINBLOCK makes two novel contributions: the RAINBLOCK architecture that removes I/O from the critical path of processing transactions (txs), and the distributed, multiversioned DSM-TREE data structure that stores the system state efficiently. We evaluate RAINBLOCK using workloads based on public Ethereum traces (including smart contracts). We show that a single RAINBLOCK miner processes 27.4K txs per second (27× higher than a single Ethereum miner). In a geo-distributed setting with four regions spread across three continents, RAINBLOCK miners process 20K txs per second.
more » « less
Full Text Available
Analyzing and mitigating data stalls in DNN training

https://doi.org/10.14778/3446095.3446100

Mohan, Jayashree; Phanishayee, Amar; Raniwala, Ashish; Chidambaram, Vijay (January 2021, Proceedings of the VLDB Endowment)
null (Ed.)
Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline , i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data pre-processing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time : time spent waiting for data to be fetched and pre-processed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5X on a single server).
more » « less
Full Text Available
SplinterDB: Closing the Bandwidth Gap for NVMe Key-Value Stores

Conway, Alexander; Gupta, Abhishek; Chidambaram, Vijay; Farach-Colton, Martin; Spillane, Richard; Tai, Amy; Johnson, Rob (January 2020, 2020 USENIX Annual Technical Conference)

Modern NVMe solid state drives offer significantly higher bandwidth and low latency than prior storage devices. Current key-value stores struggle to fully utilize the bandwidth of such devices. This paper presents SplinterDB, a new key-value store explicitly designed for NVMe solid state drives. SplinterDB is designed around a novel data structure (the STBε-tree), that exposes I/O and CPU concurrency and reduces write amplification without sacrificing query performance. STBε-tree combines ideas from log-structured merge trees and Bε-trees to reduce write amplification and CPU costs of compaction. The SplinterDB memtable and cache are designed to be highly concurrent and to reduce cache misses. We evaluate SplinterDB on a number of micro- and macro-benchmarks, and show that SplinterDB outperforms RocksDB, a state-of-the-art key-value store, by a factor of 6–10x on insertions and 2–2.6x on point queries, while matching RocksDB on small range queries. Furthermore, SplinterDB reduces write amplification by 2x compared to RocksDB.
more » « less
Full Text Available
SplinterDB: Closing the Bandwidth Gap for NVMe Key-Value Stores

Conway, Alexander; Gupta, Abhishek; Chidambaram, Vijay; Farach-Colton, Martin; Spillane, Richard; Tai, Amy; Johnson, Rob (January 2020, Proceedings of the USENIX Conference)

Modern NVMe solid state drives offer significantly higher bandwidth and lower latency than prior storage devices. Cur- rent key-value stores struggle to fully utilize the bandwidth of such devices. This paper presents SplinterDB, a new key- value store explicitly designed for NVMe solid-state-drives. SplinterDB is designed around a novel data structure (the STBe-tree) that exposes I/O and CPU concurrency and re- duces write amplification without sacrificing query perfor- mance. STBe-tree combines ideas from log-structured merge trees and Be-trees to reduce write amplification and CPU costs of compaction. The SplinterDB memtable and cache are designed to be highly concurrent and to reduce cache misses. We evaluate SplinterDB on a number of micro- and macro-benchmarks, and show that SplinterDB outperforms RocksDB, a state-of-the-art key-value store, by a factor of 6–10⇥ on insertions and 2–2.6⇥ on point queries, while matching RocksDB on small range queries. Furthermore, SplinterDB reduces write amplification by 2⇥ compared to RocksDB.
more » « less
Full Text Available

« Prev Next »

Search for: All records