skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on August 4, 2026

Title: Toward Weight Sharing Paradigm for Efficient AI: Training and Inference Serving
Deep neural networks are increasingly required to operate across diverse hardware platforms, latency constraints, and power budgets, which motivates the need for specialized models for each scenario. However, designing and training a separate model per scenario or serving a large ensemble of models is often impractical. Weight sharing has emerged as a promising paradigm to address this challenge by training a single ''SuperNet'' that subsumes many sub-models (SubNets), and by reusing weights across those SubNets both at training and inference time. This paper provides an abridged survey of our recent advances that leverage weight sharing for efficient AI, covering both training and inference serving. In centralized once-for-all training, Delayed ε-Shrinking (DεS) improves training efficiency by strategically scheduling the introduction of smaller SubNets during training. In a federated fashion, SuperFedNas co-trains a SuperNet across distributed clients and disjoins training and searching, which enables oneshot specialization to many deployment targets at minimal cost. ∇QDARTS integrates quantization into differentiable architecture search, jointly finding neural architectures, weights, and low-precision settings to yield highly efficient models in a single search. For inference serving, SuperServe introduces a weight-shared model with dynamic SubNet routing (SubNetAct) to instantaneously switch among a spectrum of accuracy-latency operating points, coupled with a scheduler (SlackFit) for unpredictable workloads. Finally, SUSHI co-designs model, system, and accelerator to exploit weightshared SuperNets on tinyML devices, caching SubGraphs on FPGA to reduce latency and energy. Together, these works demonstrate that the weight sharing paradigm can dramatically improve the efficiency of both training and inference serving of deep models across a range of scenarios.  more » « less
Award ID(s):
2420977
PAR ID:
10656319
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Association for Computing Machinery
Date Published:
Journal Name:
ACM SIGOPS Operating Systems Review
Volume:
59
Issue:
1
ISSN:
0163-5980
Page Range / eLocation ID:
34 to 45
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The increasing popularity of deep learning models has created new opportunities for developing AI-based recommender systems. Designing recommender systems using deep neural networks requires careful architecture design, and further optimization demands extensive co-design efforts on jointly optimizing model architecture and hardware. Design automation, such as Automated Machine Learning (AutoML), is necessary to fully exploit the potential of recommender model design, including model choices and model-hardware co-design strategies. We introduce a novel paradigm that utilizes weight sharing to explore abundant solution spaces. Our paradigm creates a large supernet to search for optimal architectures and co-design strategies to address the challenges of data multi-modality and heterogeneity in the recommendation domain. From a model perspective, the supernet includes a variety of operators, dense connectivity, and dimension search options. From a co-design perspective, it encompasses versatile Processing-In-Memory (PIM) configurations to produce hardware-efficient models. Our solution space’s scale, heterogeneity, and complexity pose several challenges, which we address by proposing various techniques for training and evaluating the supernet. Our crafted models show promising results on three Click-Through Rates (CTR) prediction benchmarks, outperforming both manually designed and AutoML-crafted models with state-of-the-art performance when focusing solely on architecture search. From a co-design perspective, we achieve 2 × FLOPs efficiency, 1.8 × energy efficiency, and 1.5 × performance improvements in recommender models. 
    more » « less
  2. Tiny machine learning (TinyML) applications increasingly operate in dynamically changing deployment scenarios, requiring optimization for both accuracy and latency. Existing methods mainly target a single point in the accuracy/latency tradeoff space, which is insufficient as no single static point can be optimal under variable conditions. We draw on a recently proposed weight-shared SuperNet mechanism to enable serving a stream of queries that activates different SubNets within a SuperNet. This creates an opportunity to exploit the inherent temporal locality of different queries that use the same SuperNet. We propose a hardware–software co-design called SUSHI that introduces a novel SubGraph Stationary optimization. SUSHI consists of a novel field-programmable gate array implementation and a software scheduler that controls which SubNets to serve and which SubGraph to cache in real time. SUSHI yields up to a 32% improvement in latency, 0.98% increase in served accuracy, and achieves up to 78.7% off-chip energy saved across several neural network architectures. 
    more » « less
  3. The rise of deep neural networks offers new opportunities in optimizing recommender systems. However, optimizing recommender systems using deep neural networks requires delicate architecture fabrication. We propose NASRec, a paradigm that trains a single supernet and efficiently produces abundant models/sub-architectures by weight sharing. To overcome the data multi-modality and architecture heterogeneity challenges in the recommendation domain, NASRec establishes a large supernet (i.e., search space) to search the full architectures. The supernet incorporates versatile choice of operators and dense connectivity to minimize human efforts for finding priors. The scale and heterogeneity in NASRec impose several challenges, such as training inefficiency, operator-imbalance, and degraded rank correlation. We tackle these challenges by proposing single-operator any-connection sampling, operator-balancing interaction modules, and post-training fine-tuning. Our crafted models, NASRecNet, show promising results on three Click-Through Rates (CTR) prediction benchmarks, indicating that NASRec outperforms both manually designed models and existing NAS methods with state-of-the-art performance. Our work is publicly available here. 
    more » « less
  4. There is a growing rise of applications that need to support a library of models with diverse latency-accuracy trade-offs on a Pareto frontier, especially in the health-care domain. This work presents an end-to-end system for training and serving weight-sharing models. On the training end, we leverage recent research in creating a family of models on the latency- accuracy Pareto frontier that share weights, reducing the total number of unique parameters. On the serving (inference end), we propose a novel accelerator FastSwitch that extracts weight reuse across different models, thereby providing fast real-time switching between different models. 
    more » « less
  5. Low-latency and low-power edge AI is crucial for Virtual Reality and Augmented Reality applications. Recent advances demonstrate that hybrid models, combining convolution layers (CNN) and transformers (ViT), often achieve a superior accuracy/performance tradeoff on various computer vision and machine learning (ML) tasks. However, hybrid ML models can present system challenges for latency and energy efficiency due to their diverse nature in dataflow and memory access patterns. In this work, we leverage architecture heterogeneity from Neural Processing Units (NPU) and Compute-In-Memory (CIM) and explore diverse execution schemas to efficiently execute these hybrid models. We introduce H4H-NAS, a two-stage Neural Architecture Search (NAS) framework to automate the design of efficient hybrid CNN/ViT models for heterogeneous edge systems featuring both NPU and CIM. We propose a two-phase incremental supernet training in our NAS framework to resolve gradient conflicts between sampled subnets caused by different types of blocks in a hybrid model search space. Our H4H-NAS approach is also powered by a performance estimator built with NPU performance results measured on real silicon, and CIM performance based on industry IPs. H4H-NAS searches hybrid CNN-ViT models with fine granularity and achieves significant (up to 1.34%) top-1 accuracy improvement on ImageNet. Moreover, results from our algorithm/hardware co-design reveal up to 56.08% overall latency and 41.72% energy improvements by introducing heterogeneous computing over baseline solutions. Overall, our framework guides the design of hybrid network architectures and system architectures for NPU+CIM heterogeneous systems. 
    more » « less