NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Splitwise: Efficient Generative LLM Inference Using Phase Splitting

https://doi.org/10.1109/ISCA59077.2024.00019

Patel, Pratyush; Choukse, Esha; Zhang, Chaojie; Shah, Aashaka; Goiri, Íñigo; Maleki, Saeed; Bianchini, Ricardo (June 2024, IEEE)

Full Text Available
TACCL: Guiding Collective Algorithm Synthesis using Communication Sketches

Shah, Aashaka; Chidambaram, Vijay; Cowan, Meghan; Maleki, Saeed; Musuvathi, Madan; Mytkowicz, Todd; Nelson, Jacob; Saarikivi, Olli; Singh, Rachee (April 2023, USENIX)

Machine learning models are increasingly being trained across multiple GPUs and servers. In this setting, data is transferred between GPUs using communication collectives such as ALLTOALL and ALLREDUCE, which can become a significant bottleneck in training large models. Thus, it is important to use efficient algorithms for collective communication. We develop TACCL, a tool that enables algorithm designers to guide a synthesizer into automatically generating algorithms for a given hardware configuration and communication collective. TACCL uses a novel communication sketch abstraction to get crucial information from the designer to significantly reduce the search space and guide the synthesizer towards better algorithms. TACCL also uses a novel encoding of the problem that allows it to scale beyond single-node topologies. We use TACCL to synthesize algorithms for three collectives and two hardware topologies: DGX-2 and NDv2. We demonstrate that the algorithms synthesized by TACCL outperform the Nvidia Collective Communication Library (NCCL) by up to 6.7x. We also show that TACCL can speed up end-to-end training of Transformer-XL and BERT models by 11%–2.3x for different batch sizes.
more » « less
Full Text Available
RainBlock: Faster Transaction Processing in Public Blockchains

Ponnapalli, Soujanya; Shah, Aashaka; Banerjee, Souvik; Malkhi, Dahli; Chidambaram, Vijay; Wei, Michael (July 2021, Proceedings of the USENIX Conference)
Calciu, Irina; Kuenning, Geoff (Ed.)
We present RAINBLOCK, a public blockchain that achieves high transaction throughput without modifying the proof-ofwork consensus. The chief insight behind RAINBLOCK is that while consensus controls the rate at which new blocks are added to the blockchain, the number of transactions in each block is limited by I/O bottlenecks. Public blockchains like Ethereum keep the number of transactions in each block low so that all participating servers (miners) have enough time to process a block before the next block is created. By removing the I/O bottlenecks in transaction processing, RAINBLOCK allows miners to process more transactions in the same amount of time. RAINBLOCK makes two novel contributions: the RAINBLOCK architecture that removes I/O from the critical path of processing transactions (txs), and the distributed, multiversioned DSM-TREE data structure that stores the system state efficiently. We evaluate RAINBLOCK using workloads based on public Ethereum traces (including smart contracts). We show that a single RAINBLOCK miner processes 27.4K txs per second (27× higher than a single Ethereum miner). In a geo-distributed setting with four regions spread across three continents, RAINBLOCK miners process 20K txs per second.
more » « less
Full Text Available

Search for: All records