The architecture of a coarse-grained reconfigurable array (CGRA) processing element (PE) has a significant effect on the performance and energy-efficiency of an application running on the CGRA. This paper presents APEX, an automated approach for generating specialized PE architectures for an application or an application domain. APEX first analyzes application domain benchmarks using frequent subgraph mining to extract commonly occurring computational subgraphs. APEX then generates specialized PEs by merging subgraphs using a datapath graph merging algorithm. The merged datapath graphs are translated into a PE specification from which we automatically generate the PE hardware description in Verilog along with a compiler that maps applications to the PE. The PE hardware and compiler are inserted into a flexible CGRA generation and compilation toolchain that allows for agile evaluation of CGRAs. We evaluate APEX for two domains, machine learning and image processing. For image processing applications, our automatically generated CGRAs with specialized PEs achieve from 5% to 30% less area and from 22% to 46% less energy compared to a general-purpose CGRA. For machine learning applications, our automatically generated CGRAs consume 16% to 59% less energy and 22% to 39% less area than a general-purpose CGRA. This work paves the way for creation of application domain-driven design-space exploration frameworks that automatically generate efficient programmable accelerators, with a much lower design effort for both hardware and compiler generation. 
                        more » 
                        « less   
                    
                            
                            Canal: A Flexible Interconnect Generator for Coarse-Grained Reconfigurable Arrays
                        
                    
    
            The architecture of a coarse-grained reconfigurable array (CGRA) interconnect has a significant effect on not only the flexibility of the resulting accelerator, but also its power, performance, and area. Design decisions that have complex trade-offs need to be explored to maintain efficiency and performance across a variety of evolving applications. This paper presents Canal, a Python-embedded domain-specific language (eDSL) and compiler for specifying and generating reconfigurable interconnects for CGRAs. Canal uses a graph-based intermediate representation (IR) that allows for easy hardware generation and tight integration with place and route tools. We evaluate Canal by constructing both a fully static interconnect and a hybrid interconnect with ready-valid signaling, and by conducting design space exploration of the interconnect architecture by modifying the switch box topology, the number of routing tracks, and the interconnect tile connections. Through the use of a graph-based IR for CGRA interconnects, the eDSL, and the interconnect generation system, Canal enables fast design space exploration and creation of CGRA interconnects. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2238006
- PAR ID:
- 10493554
- Publisher / Repository:
- IEEE
- Date Published:
- Journal Name:
- IEEE Computer Architecture Letters
- Volume:
- 22
- Issue:
- 1
- ISSN:
- 1556-6056
- Page Range / eLocation ID:
- 45 to 48
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Chiplet-based architectures have been proposed to scale computing systems for deep neural networks (DNNs). Prior work has shown that for the chiplet-based DNN accelerators, the electrical network connecting the chiplets poses a major challenge to system performance, energy consumption, and scalability. Some emerging interconnect technologies such as silicon photonics can potentially overcome the challenges facing electrical interconnects as photonic interconnects provide high bandwidth density, superior energy efficiency, and ease of implementing broadcast and multicast operations that are prevalent in DNN inference. In this paper, we propose a chiplet-based architecture named SPRINT for DNN inference. SPRINT uses a global buffer to simplify the data transmission between storage and computation, and includes two novel designs: (1) a reconfigurable photonic network that can support diverse communications in DNN inference with minimal implementation cost, and (2) a customized dataflow that exploits the ease of broadcast and multicast feature of photonic interconnects to support highly parallel DNN computations. Simulation studies using ResNet50 DNN model show that SPRINT achieves 46% and 61% execution time and energy consumption reduction, respectively, as compared to other state-of-the-art chiplet-based architectures with electrical or photonic interconnects.more » « less
- 
            As the technology node of VLSI designs advances to sub10 nm, two interconnect-centric metrics of a circuit, the interconnect complexity (either number of interconnects or wirelength/WL) and congestion, become critically important across all design stages alongside conventional resource or function-unit (FU)-centric metrics like area/number-of-FUs and leakage power. High Level synthesis (HLS), one of the earliest and most impactful design stages, rarely monitors interconnect metrics, which makes their recovery at later stages very difficult. HLS algorithms and tools typically perform FU-centric minimization via operation scheduling, module selection (S&MS) and binding. As a consequence, it mostly overlooks interconnect-based metrics. In this paper, we explore whether this can adversely affect interconnect metrics, and in general explore the correlation between FU-centric optimization in S&MS, and the resulting interconnect metrics co-optimized (along with FU metrics) in the later binding stage(s). For this purpose we develop a probabilistic analysis for post-scheduling binding to estimate interconnect metrics, and verify its accuracy by comparison to empirical results across different scheduling techniques that generate different degrees of FU optimization. Based on both empirical and analytical results we predict how interconnects metrics will pan out with different degrees of FU optimization. Finally, based on our analysis, we also provide suggestions to improve interconnect metrics for whatever FU optimization degree an available S&MS technique can achieve.more » « less
- 
            The size of transistors has drastically reduced over the years. Interconnects have likewise also been scaled down. Today, conventional copper (Cu)-based interconnects face a significant impediment to further scaling since their electrical conductivity decreases at smaller dimensions, which also worsens the signal delay and energy consumption. As a result, alternative scalable materials such as semi-metals and 2D materials were being investigated as potential Cu replacements. In this paper, we experimentally showed that CoPt can provide better resistivity than Cu at thin dimensions and proposed hybrid poly-Si with a CoPt coating for local routing in standard cells for compactness. We evaluated the performance gain for DRAM/eDRAM, and area vs. performance trade-off for D-Flip-Flop (DFF) using hybrid poly-Si with a thin film of CoPt. We gained up to a 3-fold reduction in delay and a 15.6% reduction in cell area with the proposed hybrid interconnect. We also studied the system-level interconnect design using NbAs, a topological semi-metal with high electron mobility at the nanoscale, and demonstrated its advantages over Cu in terms of resistivity, propagation delay, and slew rate. Our simulations revealed that NbAs could reduce the propagation delay by up to 35.88%. We further evaluated the potential system-level performance gain for NbAs-based interconnects in cache memories and observed an instructions per cycle (IPC) improvement of up to 23.8%.more » « less
- 
            Gate-exhaustive and cell-aware tests are generated based on input patterns of cells in a design. While the tests provide thorough testing of the cells, the interconnects between them are tested only as input and output lines of cells. This paper defines cell-based faults that allow the interconnects to be tested more thoroughly within a uniform framework that only targets input patterns of cells. In contrast to a real cell that is part of the design, a dummy cell is used for defining interconnect-aware faults. Using a gate-level description of the circuit, a dummy cell contains an interconnect, an output gate of the real cell that drives it, and an input gate of the real cell that it drives. Experimental results for benchmark circuits show that many of the interconnect-aware faults are not detected accidentally by gate-exhaustive tests, and that the quality of the test set is improved by targeting interconnect-aware faults. Here, quality is measured by the numbers of detections of single stuck-at faults in a gate-level representation of the circuit.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    