NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

HISIM: Analytical Performance Modeling and Design Space Exploration of 2.5D/3D Integration for AI Computing

https://doi.org/10.1109/TCAD.2025.3531348

Wang, Zhenyu; Nalla, Pragnya Sudershan; Sun, Jingbo; Goksoy, A Alper; Mandal, Sumit K; Seo, Jae-sun; Chhabria, Vidya A; Zhang, Jeff; Chakrabarti, Chaitali; Ogras, Umit Y; et al (January 2025, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems)

Full Text Available
DTRL: Decision Tree-based Multi-Objective Reinforcement Learning for Runtime Task Scheduling in Domain-Specific System-on-Chips

https://doi.org/10.1145/3609108

Basaklar, Toygun; Goksoy, A Alper; Krishnakumar, Anish; Gumussoy, Suat; Ogras, Umit Y (October 2023, ACM Transactions on Embedded Computing Systems)

Domain-specific systems-on-chip (DSSoCs) combine general-purpose processors and specialized hardware accelerators to improve performance and energy efficiency for a specific domain. The optimal allocation of tasks to processing elements (PEs) with minimal runtime overheads is crucial to achieving this potential. However, this problem remains challenging as prior approaches suffer from non-optimal scheduling decisions or significant runtime overheads. Moreover, existing techniques focus on a single optimization objective, such as maximizing performance. This work proposes DTRL, a decision-tree-based multi-objective reinforcement learning technique for runtime task scheduling in DSSoCs. DTRL trains a single global differentiable decision tree (DDT) policy that covers the entire objective space quantified by a preference vector. Our extensive experimental evaluations using our novel reinforcement learning environment demonstrate that DTRL captures the trade-off between execution time and power consumption, thereby generating a Pareto set of solutions using a single policy. Furthermore, comparison with state-of-the-art heuristic–, optimization–, and machine learning-based schedulers shows that DTRL achieves up to 9× higher performance and up to 3.08× reduction in energy consumption. The trained DDT policy achieves 120 ns inference latency on Xilinx Zynq ZCU102 FPGA at 1.2 GHz, resulting in negligible runtime overheads. Evaluation on the same hardware shows that DTRL achieves up to 16% higher performance than a state-of-the-art heuristic scheduler.
more » « less
Full Text Available
CANNON: C ommunication- A ware Sparse N eural N etwork O ptimizatio n

https://doi.org/10.1109/TETC.2023.3289778

Goksoy, A. Alper; Li, Guihong; Mandal, Sumit K.; Ogras, Umit Y.; Marculescu, Radu (January 2023, IEEE Transactions on Emerging Topics in Computing)

Sparse deep neural networks (DNNs) have the potential to deliver compelling performance and energy efficiency without significant accuracy loss. However, their benefits can quickly diminish if their training is oblivious to the target hardware. For example, fewer critical connections can have a significant overhead if they translate into long-distance communication on the target hardware. Therefore, hardware-aware sparse training is needed to leverage the full potential of sparse DNNs. To this end, we propose a novel and comprehensive communication-aware sparse DNN optimization framework for tile-based in-memory computing (IMC) architectures. The proposed technique, CANNON first maps the DNN layers onto the tiles of the target architecture. Then, it replaces the fully connected and convolutional layers with communication-aware sparse connections. After that, CANNON optimizes the communication cost with minimal impact on the DNN accuracy. Extensive experimental evaluations with a wide range of DNNs and datasets show up to 3.0× lower communication energy, 3.1× lower communication latency, and 6.8× lower energy-delay product compared to state-of-the-art pruning approaches with a negligible impact on the classification accuracy on IMC-based machine learning accelerators.
more » « less
Full Text Available

Search for: All records