NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Odin: Learning to Optimize Operation Unit Configuration for Energy-efficient DNN Inferencing

https://doi.org/10.23919/DATE64628.2025.10993275

Narang, Gaurav; Doppa, Janardhan Rao; Pande, Partha Pratim (March 2025, IEEE)

Free, publicly-accessible full text available March 31, 2026
HpT: Hybrid Acceleration of Spatio-Temporal Attention Model Training on Heterogeneous Manycore Architectures

https://doi.org/10.1109/TPDS.2024.3522781

Dahal, Saiman; Dhingra, Pratyush; Thapa, Krishu Kumar; Pande, Partha Pratim; Kalyanaraman, Ananth (March 2025, IEEE Transactions on Parallel and Distributed Systems)

Free, publicly-accessible full text available March 1, 2026
Heterogeneous Manycore In-Memory Computing Architectures

https://doi.org/10.1145/3676536.3697138

Ogbogu, Chukwufumnanya; Narang, Gaurav; Joardar, Biresh Kumar; Doppa, Janardhan Rao; Pande, Partha Pratim (October 2024, ACM)

Full Text Available
Data Pruning-enabled High Performance and Reliable Graph Neural Network Training on ReRAM-based Processing-in-Memory Accelerators

https://doi.org/10.1145/3656171

Ogbogu, Chukwufumnanya; Joardar, Biresh; Chakrabarty, Krishnendu; Doppa, Jana; Pande, Partha Pratim (September 2024, ACM Transactions on Design Automation of Electronic Systems)

Graph Neural Networks (GNNs) have achieved remarkable accuracy in cognitive tasks such as predictive analytics on graph-structured data. Hence, they have become very popular in diverse real-world applications. However, GNN training with large real-world graph datasets in edge-computing scenarios is both memory- and compute-intensive. Traditional computing platforms such as CPUs and GPUs do not provide the energy efficiency and low latency required in edge intelligence applications due to their limited memory bandwidth. Resistive random-access memory (ReRAM)-based processing-in-memory (PIM) architectures have been proposed as suitable candidates for accelerating AI applications at the edge, including GNN training. However, ReRAM-based PIM architectures suffer from low reliability due to their limited endurance, and low performance when they are used for GNN training in real-world scenarios with large graphs. In this work, we propose a learning-for-data-pruning framework, which leverages a trained Binary Graph Classifier (BGC) to reduce the size of the input data graph by pruning subgraphs early in the training process to accelerate the GNN training process on ReRAM-based architectures. The proposed light-weight BGC model reduces the amount of redundant information in input graph(s) to speed up the overall training process, improves the reliability of the ReRAM-based PIM accelerator, and reduces the overall training cost. This enables fast, energy-efficient, and reliable GNN training on ReRAM-based architectures. Our experimental results demonstrate that using this learning for data pruning framework, we can accelerate GNN training and improve the reliability of ReRAM-based PIM architectures by up to 1.6×, and reduce the overall training cost by 100× compared to state-of-the-art data pruning techniques.
more » « less
Full Text Available
HeTraX: Energy Efficient 3D Heterogeneous Manycore Architecture for Transformer Acceleration

https://doi.org/10.1145/3665314.3670814

Dhingra, Pratyush; Doppa, Jana; Pande, Partha Pratim (August 2024, ACM)

Full Text Available
HuNT: Exploiting Heterogeneous PIM Devices to Design a 3-D Manycore Architecture for DNN Training

https://doi.org/10.1109/TCAD.2024.3444708

Ogbogu, Chukwufumnanya; Narang, Gaurav; Joardar, Biresh Kumar; Doppa, Janardhan Rao; Chakrabarty, Krishnendu; Pande, Partha Pratim (November 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Full Text Available
Dataflow-Aware PIM-Enabled Manycore Architecture for Deep Learning Workloads

https://doi.org/10.23919/DATE58400.2024.10546730

Sharma, Harsh; Narang, Gaurav; Doppa, Janardhan Rao; Ogras, Umit; Pande, Partha Pratim (March 2024, IEEE)

Full Text Available
Block-Wise Mixed-Precision Quantization: Enabling High Efficiency for Practical ReRAM-based DNN Accelerators

https://doi.org/10.1109/TCAD.2024.3409193

Wu, Xueying; Hanson, Edward; Wang, Nansu; Zheng, Qilin; Yang, Xiaoxuan; Yang, Huanrui; Li, Shiyu; Cheng, Feng; Pande, Partha Pratim; Doppa, Janardhan Rao; et al (June 2024, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Full Text Available
Uncertainty-Aware Online Learning for Dynamic Power Management in Large Manycore Systems

https://doi.org/10.1109/ISLPED58423.2023.10244486

Narang, Gaurav; Ayoub, Raid; Kishinevsky, Michael; Doppa, Janardhan Rao; Pande, Partha Pratim (August 2023, IEEE)

Full Text Available
DYNAMIC POWER MANAGEMENT IN LARGE MANYCORE SYSTEMS: A LEARNING-TO-SEARCH FRAMEWORK

https://doi.org/10.1145/3603501

Narang, Gaurav; Deshwal, Aryan; Ayoub, Raid; Kishinevsky, Michael; Doppa, Janardhan Rao; Pande, Partha Pratim (July 2023, ACM Transactions on Design Automation of Electronic Systems)

The complexity of manycore System-on-chips (SoCs) is growing faster than our ability to manage them to reduce the overall energy consumption. Further, as SoC design moves towards 3D-architectures, the core's power density increases leading to unacceptable high peak chip temperatures. In this paper, we consider the optimization problem of dynamic power management (DPM) in manycore SoCs for an allowable performance penalty (say 5%) and admissible peak chip temperature. We employ a machine learning (ML) based DPM policy, which selects the voltage/frequency (V/F) levels for different cluster of cores as a function of the application workload features such as core computation and inter-core traffic etc. We propose a novel learning-to-search (L2S) framework to automatically identify an optimized sequence of DPM decisions from a large combinatorial space for joint energy-thermal optimization for one or more given applications. The optimized DPM decisions are given to a supervised learning algorithm to train a DPM policy, which mimics the corresponding decision-making behavior. Our experiments on two different manycore architectures designed using wireless interconnect and monolithic 3D demonstrate that principles behind the L2S framework are applicable for more than one configuration. Moreover, L2S-based DPM policies achieve up to 30 energy-delay product savings and reduce the peak chip temperature by up to 17 °C compared to the state-of-the-art ML methods for an allowable performance overhead of only 5 .
more » « less
Full Text Available

« Prev Next »

Search for: All records