skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Cross-Feature Transfer Learning for Efficient Tensor Program Generation
Tuning tensor program generation involves navigating a vast search space to find optimal program transformations and measurements for a program on the target hardware. The complexity of this process is further amplified by the exponential combinations of transformations, especially in heterogeneous environments. This research addresses these challenges by introducing a novel approach that learns the joint neural network and hardware features space, facilitating knowledge transfer to new, unseen target hardware. A comprehensive analysis is conducted on the existing state-of-the-art dataset, TenSet, including a thorough examination of test split strategies and the proposal of methodologies for dataset pruning. Leveraging an attention-inspired technique, we tailor the tuning of tensor programs to embed both neural network and hardware-specific features. Notably, our approach substantially reduces the dataset size by up to 53% compared to the baseline without compromising Pairwise Comparison Accuracy (PCA). Furthermore, our proposed methodology demonstrates competitive or improved mean inference times with only 25–40% of the baseline tuning time across various networks and target hardware. The attention-based tuner can effectively utilize schedules learned from previous hardware program measurements to optimize tensor program tuning on previously unseen hardware, achieving a top-5 accuracy exceeding 90%. This research introduces a significant advancement in autotuning tensor program generation, addressing the complexities associated with heterogeneous environments and showcasing promising results regarding efficiency and accuracy.  more » « less
Award ID(s):
2211982 2211983
PAR ID:
10536154
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
MDPI
Date Published:
Journal Name:
Applied Sciences
Volume:
14
Issue:
2
ISSN:
2076-3417
Page Range / eLocation ID:
513
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Throughout its lifecycle, an LLM incurs significantly higher carbon emissions during inference than training. Inference requests vary in batch size, prompt length, and token generation, while cloud providers deploy heterogeneous GPU configurations to meet diverse service-level objectives. Unlike training, inference exhibits lower and highly variable hardware utilization, making equation-based carbon models unreliable. Existing network-based estimators lack accuracy, as they fail to account for the distinct prefill and decode phases, hardware-specific features, and realistic request distributions. We propose LLMCO2, a graph neural network (GNN)-based model, to improve the accuracy of LLM inference carbon footprint estimation by ~ 67% over prior approaches. Source code is available at https://github.com/fuzhenxiao/LLMCO2
    more » « less
  2. Piracy and overproduction of hardware intellectual properties are growing concerns for the semiconductor industry under the fabless paradigm. Although chip designers have attempted to secure their designs against these threats by means of logic locking and obfuscation, due to the increasing number of powerful oracle-guided attacks, they are facing an ever-increasing challenge in evaluating the security of their designs and their associated overhead. Especially while many so-called "provable" logic locking techniques are subjected to a novel attack surface, overcoming these attacks may impose a huge overhead on the circuit. Thus, in this paper, after investigating the shortcoming of state-of-the-art graph neural network models in logic locking and refuting the use of hamming distance as a proper key accuracy metric, we employ two machine learning models, a decision tree to predict the security degree of the locked benchmarks and a convolutional neural network to assign a low-overhead and secure locking scheme to a given circuit. We first build multi-label datasets by running different attacks on locked benchmarks with existing logic locking methods to evaluate the security and compute the imposed area overhead. Then, we design and train a decision tree model to learn the features of the created dataset and predict the security degree of each given locked circuit. Furthermore, we utilize a convolutional neural network model to extract more features, obtain higher accuracy, and consider overhead. Then, we put our trained models to the test against different unseen benchmarks. The experimental results reveal that the convolutional neural network model has a higher capability for extracting features from unseen, large datasets which comes in handy in assigning secure and low-overhead logic locking to a given netlist. 
    more » « less
  3. High-performance kernel libraries are critical to exploiting accelerators and specialized instructions in many applications. Because compilers are difficult to extend to support diverse and rapidly-evolving hardware targets, and automatic optimization is often insufficient to guarantee state-of-the-art performance, these libraries are commonly still coded and optimized by hand, at great expense, in low-level C and assembly. To better support development of high-performance libraries for specialized hardware, we propose a new programming language, Exo, based on the principle of exocompilation: externalizing target-specific code generation support and optimization policies to user-level code. Exo allows custom hardware instructions, specialized memories, and accelerator configuration state to be defined in user libraries. It builds on the idea of user scheduling to externalize hardware mapping and optimization decisions. Schedules are defined as composable rewrites within the language, and we develop a set of effect analyses which guarantee program equivalence and memory safety through these transformations. We show that Exo enables rapid development of state-of-the-art matrix-matrix multiply and convolutional neural network kernels, for both an embedded neural accelerator and x86 with AVX-512 extensions, in a few dozen lines of code each. 
    more » « less
  4. null (Ed.)
    To deploy powerful deep neural network (DNN) into smart, but resource limited IoT devices, many prior works have been proposed to compress DNN to reduce the network size and computation complexity with negligible accuracy degradation, such as weight quantization, network pruning, convolution decomposition, etc. However, by utilizing conventional DNN compression methods, a smaller, but fixed, network is generated from a relative large background model to achieve resource limited hardware acceleration. However, such optimization lacks the ability to adjust its structure in real-time to adapt for a dynamic computing hardware resource allocation and workloads. In this paper, we mainly review our two prior works [13], [15] to tackle this challenge, discussing how to construct a dynamic DNN by means of either uniform or non-uniform sub-nets generation methods. Moreover, to generate multiple non-uniform sub-nets, [15] needs to fully retrain the background model for each sub-net individually, named as multi-path method. To reduce the training cost, in this work, we further propose a single-path sub-nets generation method that can sample multiple sub-nets in different epochs within one training round. The constructed dynamic DNN, consisting of multiple sub-nets, provides the ability to run-time trade-off the inference accuracy and latency according to hardware resources and environment requirements. In the end, we study the the dynamic DNNs with different sub-nets generation methods on both CIFAR-10 and ImageNet dataset. We also present the run-time tuning of accuracy and latency on both GPU and CPU. 
    more » « less
  5. Accurate modeling of the gain spectrum in erbium-doped fiber amplifiers (EDFAs) is essential for optimizing optical network performance, particularly as networks evolve toward multi-vendor solutions. In this work, we propose a generalized few-shot transfer learning architecture based on a semi-supervised self-normalizing neural network (SS-NN) that leverages internal EDFA features—such as VOA input/output power and attenuation—to improve gain spectrum prediction. Our SS-NN model employs a two-phase training strategy comprising unsupervised pre-training with noise-augmented measurements and supervised fine-tuning with a custom-weighted MSE loss. Furthermore, we extend the framework with transfer learning (TL) techniques that enable both homogeneous (same-feature space) and heterogeneous (different-feature sets) model adaptation across booster, pre-amplifier, and ILA EDFAs. To address feature mismatches in heterogeneous TL, we incorporate a covariance matching loss to align second-order feature statistics between the source and target domains. Extensive experiments conducted across 26 EDFAs in the COSMOS and Open Ireland testbeds demonstrate that the proposed approach significantly reduces the number of measurement requirements on the system while achieving lower mean absolute errors and improved error distributions compared to benchmark methods. 
    more » « less