Not AvailableModern Artificial Intelligence (AI) workloads demand computing systems with large silicon area to sustain throughput and competitive performance. However, prohibitive manufacturing costs and yield limitations at advanced tech nodes and die-size reaching the reticle limit restrain us from achieving this. With the recent innovations in advanced packaging technologies, chiplet-based architectures have gained significant attention in the AI hardware domain. However, the vast design space of chiplet-based AI accelerator design and the absence of system and package-level co-design methodology make it difficult for the designer to find the optimum design point regarding Power, Performance, Area, and manufacturing Cost (PPAC). This paper presents Chiplet-Gym, a Reinforcement Learning (RL)-based optimization framework to explore the vast design space of chiplet-based AI accelerators, encompassing the resource allocation, placement, and packaging architecture. We analytically model the PPAC of the chiplet-based AI accelerator and integrate it into an OpenAI gym environment to evaluate the design points. We also explore non-RL-based optimization approaches and combine these two approaches to ensure the robustness of the optimizer. The optimizer-suggested design point achieves 1.52× throughput, 0.27× energy, and 0.89× cost of its monolithic counterpart at iso-area. 
                        more » 
                        « less   
                    This content will become publicly available on May 31, 2026
                            
                            Efficient and Robust Edge AI: Software, Hardware, and the Co-design
                        
                    
    
            Artificial intelligence (AI) provides versatile capabilities in applications such as image classification and voice recognition that are most useful in edge or mobile computing settings. Shrinking these sophisticated algorithms into small form factors with minimal computing resources and power budgets requires innovation at several layers of abstraction: software, algorithmic, architectural, circuit, and device-level innovations. However, improvements to system efficiency may impact robustness and vice-versa. Therefore, a co-design framework is often necessary to customize a system for its given application. A system that prioritizes efficiency might use circuit-level innovations that introduce process variations or signal noise into the system, which may use software-level redundancy in order to compensate. In this tutorial, we will first examine various methods of improving efficiency and robustness in edge AI and their tradeoffs at each level of abstraction.Then, we will outline co-design techniques for designing efficient and robust edge AI systems, using federated learning as a specific example to illustrate the effectiveness of co-design. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10611488
- Publisher / Repository:
- ACM
- Date Published:
- Journal Name:
- ACM Transactions on Embedded Computing Systems
- Volume:
- 24
- Issue:
- 3
- ISSN:
- 1539-9087
- Page Range / eLocation ID:
- 1 to 34
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Abstract Realizing increasingly complex artificial intelligence (AI) functionalities directly on edge devices calls for unprecedented energy efficiency of edge hardware. Compute-in-memory (CIM) based on resistive random-access memory (RRAM) 1 promises to meet such demand by storing AI model weights in dense, analogue and non-volatile RRAM devices, and by performing AI computation directly within RRAM, thus eliminating power-hungry data movement between separate compute and memory 2–5 . Although recent studies have demonstrated in-memory matrix-vector multiplication on fully integrated RRAM-CIM hardware 6–17 , it remains a goal for a RRAM-CIM chip to simultaneously deliver high energy efficiency, versatility to support diverse models and software-comparable accuracy. Although efficiency, versatility and accuracy are all indispensable for broad adoption of the technology, the inter-related trade-offs among them cannot be addressed by isolated improvements on any single abstraction level of the design. Here, by co-optimizing across all hierarchies of the design from algorithms and architecture to circuits and devices, we present NeuRRAM—a RRAM-based CIM chip that simultaneously delivers versatility in reconfiguring CIM cores for diverse model architectures, energy efficiency that is two-times better than previous state-of-the-art RRAM-CIM chips across various computational bit-precisions, and inference accuracy comparable to software models quantized to four-bit weights across various AI tasks, including accuracy of 99.0 percent on MNIST 18 and 85.7 percent on CIFAR-10 19 image classification, 84.7-percent accuracy on Google speech command recognition 20 , and a 70-percent reduction in image-reconstruction error on a Bayesian image-recovery task.more » « less
- 
            null; null; null; null; null; null (Ed.)The National Ecological Observatory Network (NEON) is a continental-scale observatory with sites across the US collecting standardized ecological observations that will operate for multiple decades. To maximize the utility of NEON data, we envision edge computing systems that gather, calibrate, aggregate, and ingest measurements in an integrated fashion. Edge systems will employ machine learning methods to cross-calibrate, gap-fill and provision data in near-real time to the NEON Data Portal and to High Performance Computing (HPC) systems, running ensembles of Earth system models (ESMs) that assimilate the data. For the first time gridded EC data products and response functions promise to offset pervasive observational biases through evaluating, benchmarking, optimizing parameters, and training new ma- chine learning parameterizations within ESMs all at the same model-grid scale. Leveraging open-source software for EC data analysis, we are al- ready building software infrastructure for integration of near-real time data streams into the International Land Model Benchmarking (ILAMB) package for use by the wider research community. We will present a perspective on the design and integration of end-to-end infrastructure for data acquisition, edge computing, HPC simulation, analysis, and validation, where Artificial Intelligence (AI) approaches are used throughout the distributed workflow to improve accuracy and computational performance.more » « less
- 
            This review explores the intersection of bio-plausible artificial intelligence in the form of spiking neural networks (SNNs) with the analog in-memory computing (IMC) domain, highlighting their collective potential for low-power edge computing environments. Through detailed investigation at the device, circuit, and system levels, we highlight the pivotal synergies between SNNs and IMC architectures. Additionally, we emphasize the critical need for comprehensive system-level analyses, considering the inter-dependencies among algorithms, devices, circuit, and system parameters, crucial for optimal performance. An in-depth analysis leads to the identification of key system-level bottlenecks arising from device limitations, which can be addressed using SNN-specific algorithm–hardware co-design techniques. This review underscores the imperative for holistic device to system design-space co-exploration, highlighting the critical aspects of hardware and algorithm research endeavors for low-power neuromorphic solutions.more » « less
- 
            Abstract Problems arising in Earth's mantle convection involve finding the solution to Stokes systems with large viscosity contrasts. These systems contain localized features which, even with adaptive mesh refinement, result in linear systems that can be on the order of 109or more unknowns. One common approach for preconditioning to the velocity block of these systems is to apply an Algebraic Multigrid (AMG) V‐cycle (as is done in the ASPECT software, for example), however, we find that AMG is lacking robustness with respect to problem size and number of parallel processes. Additionally, we see an increase in iteration counts with refinement when using AMG. In contrast, the Geometric Multigrid (GMG) method, by using information about the geometry of the problem, should offer a more robust option.Here we present a matrix‐free GMG V‐cycle which works on adaptively refined, distributed meshes, and we will compare it against the current AMG preconditioner (Trilinos ML) used in theASPECT1software. We will demonstrate the robustness of GMG with respect to problem size and show scaling up to 114,688 cores and 217 billion unknowns. All computations are run using the open‐source, finite element librarydeal.II.2more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
