NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Understanding Performance Implications of LLM Inference on CPUs

https://doi.org/10.1109/IISWC63097.2024.00024

Na, Seonjin; Jeong, Geonhwa; Ahn, Byung Hoon; Young, Jeffrey; Krishna, Tushar; Kim, Hyesoon (September 2024, IEEE)

Full Text Available
Special Session: Neuro-Symbolic Architecture Meets Large Language Models: A Memory-Centric Perspective

https://doi.org/10.1109/CODES-ISSS60120.2024.00012

Ibrahim, Mohamed; Wan, Zishen; Li, Haitong; Panda, Priyadarshini; Krishna, Tushar; Kanerva, Pentti; Chen, Yiran; Raychowdhury, Arijit (September 2024, IEEE)

Full Text Available
TNT: A Modular Approach to Traversing Physically Heterogeneous NOCs at Bare-wire Latency

https://doi.org/10.1145/3597611

Ravi, Gokul Subramanian; Krishna, Tushar; Lipasti, Mikko (September 2023, ACM Transactions on Architecture and Code Optimization)

The ideal latency for on-chip network traversal would be the delay incurred from wire traversal alone. Unfortunately, in a realistic modular network, the latency for a packet to traverse the network is significantly higher than this wire delay. The main limiter to achieving lower latency is the modular quantization of network traversal into hops. Beyond this, the physical heterogeneity in real-world systems further complicate the ability to reach ideal wire-only delay. In this work, we propose TNT or Transparent Network Traversal . TNT targets ideal network latency by attempting source to destination network traversal as a single multi-cycle ‘long-hop’, bypassing the quantization effects of intermediate routers via transparent data/information flow. TNT is built in a modular tile-scalable manner via a novel control path performing neighbor-to-neighbor interactions but enabling end-to-end transparent flit traversal. Further, TNT’s fine grained on-the-fly delay tracking allows it to cope with physical NOC heterogeneity across the chip. Analysis on Ligra graph workloads shows that TNT can reduce NOC latency by as much as 43% compared to the state of the art and allows efficiency gains up to 38%. Further, it can achieve more than 3x the benefits of the best/closest alternative research proposal, SMART [ 43 ].
more » « less
Full Text Available
Efficient Distributed Inference of Deep Neural Networks via Restructuring and Pruning

https://doi.org/10.1609/aaai.v37i6.25815

Abdi, Afshin; Rashidi, Saeed; Fekri, Faramarz; Krishna, Tushar (June 2023, Proceedings of the AAAI Conference on Artificial Intelligence)

In this paper, we consider the parallel implementation of an already-trained deep model on multiple processing nodes (a.k.a. workers). Specifically, we investigate as to how a deep model should be divided into several parallel sub-models, each of which is executed efficiently by a worker. Since latency due to synchronization and data transfer among workers negatively impacts the performance of the parallel implementation, it is desirable to have minimum interdependency among parallel sub-models. To achieve this goal, we propose to rearrange the neurons in the neural network, partition them (without changing the general topology of the neural network), and modify the weights such that the interdependency among sub-models is minimized under the computations and communications constraints of the workers while minimizing its impact on the performance of the model. We propose RePurpose, a layer-wise model restructuring and pruning technique that guarantees the performance of the overall parallelized model. To efficiently apply RePurpose, we propose an approach based on L0 optimization and the Munkres assignment algorithm. We show that, compared to the existing methods, RePurpose significantly improves the efficiency of the distributed inference via parallel implementation, both in terms of communication and computational complexity.
more » « less
Full Text Available
Hardware–Software Co-Design for Real-Time Latency–Accuracy Navigation in Tiny Machine Learning Applications

https://doi.org/10.1109/MM.2023.3317243

Behnam, Payman; Tong, Jianming; Khare, Alind; Chen, Yangyu; Pan, Yue; Gadikar, Pranav; Bambhaniya, Abhimanyu; Krishna, Tushar; Tumanov, Alexey (November 2023, IEEE Micro)

Tiny machine learning (TinyML) applications increasingly operate in dynamically changing deployment scenarios, requiring optimization for both accuracy and latency. Existing methods mainly target a single point in the accuracy/latency tradeoff space, which is insufficient as no single static point can be optimal under variable conditions. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that activates different SubNets within a SuperNet. This creates an opportunity to exploit the inherent temporal locality of different queries that use the same SuperNet. We propose a hardware–software co-design called SUSHI that introduces a novel SubGraph Stationary optimization. SUSHI consists of a novel field-programmable gate array implementation and a software scheduler that controls which SubNets to serve and which SubGraph to cache in real time. SUSHI yields up to a 32% improvement in latency, 0.98% increase in served accuracy, and achieves up to 78.7% off-chip energy saved across several neural network architectures.
more » « less
Full Text Available
SubGraph Stationary Hardware-Software Inference Co-Design

Behnam, Payman; Tong, Jianming; Khare, Alind; Chen, Yangyu; Pan, Yue; Gadikar, Pranav; Bambhaniya, Abhimanyu Rajeshkumar; Krishna, Tushar; Tumanov, Alexey (June 2023, Proceedings of Machine Learning and Systems)
Song, Dawn; Carbin, Michael; Chen, T (Ed.)
Full Text Available
Demystifying Map Space Exploration for NPUs

https://doi.org/10.1109/IISWC55918.2022.00031

Kao, Sheng-Chun; Parashar, Angshuman; Tsai, Po-An; Krishna, Tushar (November 2022, Proceedings of the IEEE International Symposium on Workload Characterization)

Full Text Available
MicroEdge: a multi-tenant edge cluster system architecture for scalable camera processing

https://doi.org/10.1145/3528535.3565254

Cao, Difei; Yoo, Jinsun; Xu, Zhuangdi; Saurez, Enrique; Gupta, Harshit; Krishna, Tushar; Ramachandran, Umakishore (November 2022, ACM/IFIP Middleware 2022)

Full Text Available
MAGMA: An Optimization Framework for Mapping Multiple DNNs on Multiple Accelerator Cores

https://doi.org/10.1109/HPCA53966.2022.00065

Kao, Sheng-Chun; Krishna, Tushar (April 2022, 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA))

Full Text Available
DiGamma: Domain-aware Genetic Algorithm for HW-Mapping Co-optimization for DNN Accelerators

https://doi.org/10.23919/DATE54114.2022.9774568

Kao, Sheng-Chun; Pellauer, Michael; Parashar, Angshuman; Krishna, Tushar (March 2022, 2022 Design, Automation & Test in Europe Conference & Exhibition (DATE))

Full Text Available

« Prev Next »

Search for: All records