NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Edge-Based Heuristics for Optimizing Shortcut-Augmented Topologies for HPC Interconnects

https://doi.org/10.3390/electronics11172778

Fuad, Kazi Ahmed; Zeng, Kai; Chen, Lizhong (September 2022, Electronics)

Interconnection network topology is critical for the overall performance of HPC systems. While many regular and irregular topologies have been proposed in the past, recent work has shown the promise of shortcut-augmented topologies that offer multi-fold reduction in network diameter and hop count over conventional topologies. However, the large number of possible shortcuts creates an enormous design space for this new type of topology, and existing approaches are extremely slow and do not find shortcuts that are globally optimal. In this paper, we propose an efficient heuristic approach, called EdgeCut, which generates high-quality shortcut-augmented topologies. EdgeCut can identify more globally useful shortcuts by making its considerations from the perspective of edges instead of vertices. An additional implementation is proposed that approximates the costly all-pair shortest paths calculation, thereby further speeding up the scheme. Quantitative comparisons over prior work show that the proposed approach achieves a 1982× reduction in search time while generating better or equivalent topologies in 94.9% of the evaluated cases.
more » « less
Full Text Available
UVMBench: A Comprehensive Benchmark Suite for Researching Unified Virtual Memory in GPUs

Gu, Yongbin; Wu, Wenxuan; Li, Yunfan; Chen, Lizhong (July 2021, International Conference on Scientific Computing)

The recent introduction of Unified Virtual Memory (UVM) in GPUs offers a new programming model that allows GPUs and CPUs to share the same virtual memory space, which shifts the complex memory management from programmers to GPU driver/ hardware and enables kernel execution even when memory is oversubscribed. Meanwhile, UVM may also incur considerable performance overhead due to tracking and data migration along with special handling of page faults and page table walk. As UVM is attracting significant attention from the research community to develop innovative solutions to these problems, in this paper, we propose a comprehensive UVM benchmark suite named UVMBench to facilitate future research on this important topic. The proposed UVMBench consists of 32 representative benchmarks from a wide range of application domains. The suite also features unified programming implementation and diverse memory access patterns across benchmarks, thus allowing thorough evaluation and comparison with current state-of-the-art. A set of experiments have been conducted on real GPUs to verify and analyze the benchmark suite behaviors under various scenarios.
more » « less
Full Text Available
Polymorphic Accelerators for Deep Neural Networks

https://doi.org/10.1109/TC.2020.3048624

Azizimazreah, Arash; Chen, Lizhong (January 2021, IEEE Transactions on Computers)

Deep neural networks (DNNs) come with many forms, such as convolutional neural networks, multilayer perceptron and recurrent neural networks, to meet diverse needs of machine learning applications. However, existing DNN accelerator designs, when used to execute multiple neural networks, suffer from underutilization of processing elements, heavy feature map traffic, and large area overhead. In this paper, we propose a novel approach, Polymorphic Accelerators, to address the flexibility issue fundamentally. We introduce the abstraction of logical accelerators to decouple the fixed mapping with physical resources. Three procedures are proposed that work collaboratively to reconfigure the accelerator for the current network that is being executed and to enable cross-layer data reuse among logical accelerators. Evaluation results show that the proposed approach achieves significant improvement in data reuse, inference latency and performance, e.g., 1.52x and 1.63x increase in throughput compared with state-of-the-art flexible dataflow approach and resource partitioning approach, respectively. This demonstrates the effectiveness and promise of polymorphic accelerator architecture.
more » « less
Full Text Available
A Deep Reinforcement Learning Framework for Architectural Exploration: A Routerless NoC Case Study

https://doi.org/10.1109/HPCA47549.2020.00018

Lin, Ting-Ru; Penney, Drew; Pedram, Massoud; Chen, Lizhong (February 2020, IEEE International Symposium on High Performance Computer Architecture (HPCA))
null (Ed.)
Machine learning applied to architecture design presents a promising opportunity with broad applications. Recent deep reinforcement learning (DRL) techniques, in particular, enable efficient exploration in vast design spaces where conventional design strategies may be inadequate. This paper proposes a novel deep reinforcement framework, taking routerless networks-on-chip (NoC) as an evaluation case study. The new framework successfully resolves problems with prior design approaches, which are either unreliable due to random searches or inflexible due to severe design space restrictions. The framework learns (near-)optimal loop placement for routerless NoCs with various design constraints. A deep neural network is developed using parallel threads that efficiently explore the immense routerless NoC design space with a Monte Carlo search tree. Experimental results show that, compared with conventional mesh, the proposed deep reinforcement learning (DRL) routerless design achieves a 3.25x increase in throughput, 1.6x reduction in packet latency, and 5x reduction in power. Compared with the state-of-the-art routerless NoC, DRL achieves a 1.47x increase in throughput, 1.18x reduction in packet latency, 1.14x reduction in average hop count, and 6.3% lower power consumption.
more » « less
Full Text Available
EquiNox: Equivalent NoC Injection Routers for Silicon Interposer-Based Throughput Processors

https://doi.org/10.1109/HPCA47549.2020.00043

Li, Yunfan; Chen, Lizhong (February 2020, IEEE International Symposium on High Performance Computer Architecture (HPCA))
null (Ed.)
Throughput-oriented many-core processors demand highly efficient network-on-chip (NoC) architecture for data transferring. Recent advent of silicon interposer, stacked memory and 2.5D integration have further increased data transfer rate. This greatly intensifies traffic bottleneck in the NoC but, at the same time, also brings a significant new opportunity in utilizing wiring resources in the interposer. In this paper, we propose a novel concept called Equivalent Injection Routers (EIRs) which, together with interposer links, transform the few-to-many traffic pattern to many-to-many pattern, thus fundamentally solving the bottleneck problem. We have developed EquiNox as a design example. We utilize N-Queen and Monte Carlo Tree Search (MCTS) methods to help select EIRs by considering comprehensively from topological, architectural and physical aspects. Evaluation results show that, compared with prior work, the proposed EquiNox is able to reduce execution time by 23.5%, energy consumption by 18.9%, and EDP by 32.8%, under similar hardware cost.
more » « less
Full Text Available
Characterizing On-Chip Traffic Patterns in General-Purpose GPUs: A Deep Learning Approach

https://doi.org/10.1109/ICCD46524.2019.00016

Li, Yunfan; Penney, Drew; Ramamurthy, Abhishek; Chen, Lizhong (November 2019, IEEE 37th International Conference on Computer Design (ICCD))
null (Ed.)
Architectural optimizations in general-purpose graphics processing units (GPGPUs) often exploit workload characteristics to reduce power and latency while improving performance. This paper finds, however, that prevailing assumptions about GPGPU traffic pattern characterization are inaccurate. These assumptions must therefore be re-evaluated, and more appropriate new patterns must be identified. This paper proposes a methodology to classify GPGPU traffic patterns, combining a convolutional neural network (CNN) for feature extraction and a t-distributed stochastic neighbor embedding (t-SNE) algorithm to determine traffic pattern clusters. A traffic pattern dataset is generated from common GPGPU benchmarks, transformed using heat mapping, and iteratively refined to ensure appropriate and highly accurate labels. The proposed classification model achieves 98.8% validation accuracy and 94.24% test accuracy. Furthermore, traffic in 96.6% of examined kernels can be classified into the eight identified traffic pattern categories.
more » « less
Full Text Available
Express Link Placement for NoC-Based Many-Core Platforms

https://doi.org/10.1145/3337821.3337877

Li, Yunfan; Zhu, Di; Chen, Lizhong (August 2019, International Conference on Parallel Processing)

Full Text Available
Tolerating Soft Errors in Deep Learning Accelerators with Reliable On-Chip Memory Designs

https://doi.org/10.1109/NAS.2018.8515692

Azizimazreah, Arash; Gu, Yongbin; Gu, Xiang; Chen, Lizhong (October 2018, 2018 IEEE International Conference on Networking, Architecture and Storage (NAS))

Full Text Available
CART: Cache Access Reordering Tree for Efficient Cache and Memory Accesses in GPUs

Gu, Yongbin; Chen, Lizhong (October 2018, IEEE International Conference on Computer Design)

Full Text Available

Search for: All records