NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

You, Haoran; Fu, Yichao; Wang, Zheng; Yazdanbakhsh, Amir; Lin, Yingyan Celine (July 2024, Cambridge MA: JMLR)
Lawrence, Neil (Ed.)
Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited efficiency due to the sequential processing nature of autoregressive LLMs during generation. While linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain. We conduct the first comprehensive study on the efficacy of existing linear attention methods for autoregressive LLMs, integrating them with speculative decoding. We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. Extensive experiments and ablation studies involving seven existing linear attention models and five encoder/decoder-based LLMs consistently validate the effectiveness of our augmented linearized LLMs. Notably, our approach achieves up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2× speedup during generation compared to prior linear attention methods. Codes and models are available at https://github.com/GATECH-EIC/Linearized-LLM.
more » « less
Full Text Available
When Linear Attention Meets Autoregressive Decoding: Towards More Effective and Efficient Linearized Large Language Models

You, Haoran; Fu, Yichao; Wang, Zheng; Yazdanbakhsh, Amir; Lin, Yingyan Celine (July 2024, Proceedings of Machine Learning Research)
Lawrence, Neil (Ed.)
Autoregressive Large Language Models (LLMs) have achieved impressive performance in language tasks but face two significant bottlenecks: (1) quadratic complexity in the attention module as the number of tokens increases, and (2) limited efficiency due to the sequential processing nature of autoregressive LLMs during generation. While linear attention and speculative decoding offer potential solutions, their applicability and synergistic potential for enhancing autoregressive LLMs remain uncertain. We conduct the first comprehensive study on the efficacy of existing linear attention methods for autoregressive LLMs, integrating them with speculative decoding. We introduce an augmentation technique for linear attention that ensures compatibility with speculative decoding, enabling more efficient training and serving of LLMs. Extensive experiments and ablation studies involving seven existing linear attention models and five encoder/decoder-based LLMs consistently validate the effectiveness of our augmented linearized LLMs. Notably, our approach achieves up to a 6.67 reduction in perplexity on the LLaMA model and up to a 2× speedup during generation compared to prior linear attention methods. Codes and models are available at https://github.com/GATECH-EIC/Linearized-LLM.
more » « less
Full Text Available
ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient Vision Transformer

You, Haoran; Shi, Huihong; Guo, Yipin; Lin, Yingyan Celine (September 2023, NeurIPS 2023 Conference Proceedings)

Vision Transformers (ViTs) have shown impressive performance and have become a unified backbone for multiple vision tasks. However, both the attention mechanism and multi-layer perceptrons (MLPs) in ViTs are not sufficiently efficient due to dense multiplications, leading to costly training and inference. To this end, we propose to reparameterize pre-trained ViTs with a mixture of multiplication primitives, e.g., bitwise shifts and additions, towards a new type of multiplication-reduced model, dubbed ShiftAddViT, which aims to achieve end-to-end inference speedups on GPUs without requiring training from scratch. Specifically, all MatMuls among queries, keys, and values are reparameterized using additive kernels, after mapping queries and keys to binary codes in Hamming space. The remaining MLPs or linear layers are then reparameterized with shift kernels. We utilize TVM to implement and optimize those customized kernels for practical hardware deployment on GPUs. We find that such a reparameterization on (quadratic or linear) attention maintains model accuracy, while inevitably leading to accuracy drops when being applied to MLPs. To marry the best of both worlds, we further propose a new mixture of experts (MoE) framework to reparameterize MLPs by taking multiplication or its primitives as experts, e.g., multiplication and shift, and designing a new latency-aware load-balancing loss. Such a loss helps to train a generic router for assigning a dynamic amount of input tokens to different experts according to their latency. In principle, the faster the experts run, the more input tokens they are assigned. Extensive experiments on various 2D/3D Transformer-based vision tasks consistently validate the effectiveness of our proposed ShiftAddViT, achieving up to 5.18x latency reductions on GPUs and 42.9% energy savings, while maintaining a comparable accuracy as original or efficient ViTs. Codes and models are available at https://github.com/GATECH-EIC/ShiftAddViT.
more » « less
Full Text Available
NetDistiller: Empowering Tiny Deep Learning via In Situ Distillation

https://doi.org/10.1109/MM.2023.3324261

Zhang, Shunyao; Fu, Yonggan; Wu, Shang; Dass, Jyotikrishna; You, Haoran; Lin, Yingyan Celine (November 2023, IEEE Micro)

Full Text Available
Gen-NeRF: Efficient and Generalizable Neural Radiance Fields via Algorithm-Hardware Co-Design

https://doi.org/10.1145/3579371.3589109

Fu, Yonggan; Ye, Zhifan; Yuan, Jiayi; Zhang, Shunyao; Li, Sixu; You, Haoran; Lin, Yingyan (June 2023, The 50th IEEE/ACM International Symposium on Computer Architecture 2023 (ISCA 2023))

Full Text Available
Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference

https://doi.org/10.1109/CVPR52729.2023.01387

You, Haoran; Xiong, Yunyang; Dai, Xiaoliang; Wu, Bichen; Zhang, Peizhao; Fan, Haoqi; Vajda, Peter; Lin, Yingyan Celine (June 2023, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR))

Vision Transformers (ViTs) have shown impressive per-formance but still require a high computation cost as compared to convolutional neural networks (CNNs), one rea-son is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of in-put tokens. Existing efficient ViTs adopt local attention or linear attention, which sacrifice ViTs' capabilities of capturing either global or local context. In this work, we ask an important research question: Can ViTs learn both global and local context while being more efficient during inference? To this end, we propose a framework called Castling- ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention, but then switches to having only linear-angular attention during inference. Our Castling- ViT leverages angular ker-nels to measure the similarities between queries and keys via spectral angles. And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an aux-iliary masked softmax attention to help learn global and lo-cal information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during inference. Extensive experiments validate the effectiveness of our Castling- ViT, e.g., achieving up to a 1.8% higher accuracy or 40% MACs reduction on classification and 1.2 higher mAP on detection under comparable FLOPs, as compared to ViTs with vanilla softmax-based at-tentions. Project page is available at here.
more » « less
Instant-3D: Instant Neural Radiance Field Training Towards On-Device AR/VR 3D Reconstruction

https://doi.org/10.1145/3579371.3589115

Li, Sixu; Li, Chaojian; Zhu, Wenbo; Yu, Boyang; Zhao, Yang; Wan, Cheng; You, Haoran; Shi, Huihong; Lin, Yingyan (June 2023, The The IEEE/ACM International Symposium on Computer Architecture 2023 (ISCA 2023))

Full Text Available
SuperTickets: Drawing Task-Agnostic Lottery Tickets from Supernets via Jointly Architecture Searching and Parameter Pruning

https://doi.org/10.1007/978-3-031-20083-0_40

You, Haoran; Li, Baopu; Sun, Zhanyi; Ouyang, Xu; Lin, Yingyan (October 2022, The 2022 European Conference on Computer Vision (ECCV 2022))

Full Text Available
SuperTickets: Drawing Task-Agnostic Lottery Tickets from Supernets via Jointly Architecture Searching and Parameter Pruning

You, Haoran; Li, Baopu; Sun, Zhanyi; Ouyang, Xu; Lin, Yingyan (October 2022, Computer Vision – ECCV 2022: 17th European Conference)

Neural architecture search (NAS) has demonstrated amazing success in searching for efficient deep neural networks (DNNs) from a given supernet. In parallel, lottery ticket hypothesis has shown that DNNs contain small subnetworks that can be trained from scratch to achieve a comparable or even higher accuracy than the original DNNs. As such, it is currently a common practice to develop efficient DNNs via a pipeline of first search and then prune. Nevertheless, doing so often requires a tedious and costly process of search-train-prune-retrain and thus prohibitive computational cost. In this paper, we discover for the first time that both efficient DNNs and their lottery subnetworks (i.e., lottery tickets) can be directly identified from a supernet, which we term as SuperTickets, via a two-in-one training scheme with jointly architecture searching and parameter pruning. Moreover, we develop a progressive and unified SuperTickets identificationcesstab strategy that allows the connectivity of subnetworks to change during supernet training, achieving better accuracy and efficiency trade-offs than conventional sparse training. Finally, we evaluate whether such identified SuperTickets drawn from one task can transfer well to other tasks, validating their potential of simultaneously handling multiple tasks. Extensive experiments and ablation studies on three tasks and four benchmark datasets validate that our proposed SuperTickets achieve boosted accuracy and efficiency trade-offs than both typical NAS and pruning pipelines, regardless of having retraining or not. Codes and pretrained models are available at https://github.com/RICE-EIC/SuperTickets.
more » « less
ViTCoD: Vision Transformer Acceleration via Dedicated Algorithm and Accelerator Co-Design

https://doi.org/10.1109/HPCA56546.2023.10071027

You, Haoran; Sun, Zhanyi; Shi, Huihong; Yu, Zhongzhi; Zhao, Yang; Zhang, Yongan; Li, Chaojian; Li, Baopu; Lin, Yingyan (February 2023, 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA))

Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs’ self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency and more extensive applications to resource constrained platforms. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and Transformers for natural language processing (NLP) tasks: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns, without severely hurting the model accuracy (e.g., <=1.5% under 90% pruning ratio); while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the aforementioned enforced denser and sparser workloads for boosted hardware utilization, while integrating on-chip encoder and decoder engines to leverage ViTCoD’s algorithm pipeline for much reduced data movements. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3×, 142.9×, 86.0×, 10.1×, and 6.8× over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively. Our code implementation is available at https://github.com/GATECH-EIC/ViTCoD.
more » « less
Full Text Available

« Prev Next »

Search for: All records