skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on February 28, 2026

Title: Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge
This paper investigates how to efficiently deploy vision transformers on edge devices for small workloads. Recent methods reduce the latency of transformer neural networks by removing or merging tokens, with small accuracy degradation. However, these methods are not designed with edge device deployment in mind: they do not leverage information about the latency-workload trends to improve efficiency. We address this shortcoming in our work. First, we identify factors that affect ViT latency-workload relationships. Second, we determine token pruning schedule by leveraging non-linear latency-workload relationships. Third, we demonstrate a training-free, token pruning method utilizing this schedule. We show other methods may increase latency by 2-30%, while we reduce latency by 9-26%. For similar latency (within 5.2% or 7ms) across devices we achieve 78.6%-84.5% ImageNet1K accuracy, while the state-of-the-art, Token Merging, achieves 45.8%-85.4%.  more » « less
Award ID(s):
2104709
PAR ID:
10638751
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
The Computer Vision Foundation.
Date Published:
Subject(s) / Keyword(s):
computer vision, token pruning
Format(s):
Medium: X
Location:
Tucson Arizona
Sponsoring Org:
National Science Foundation
More Like this
  1. With increasingly deployed cameras and the rapid advances of Computer Vision, large-scale live video analytics becomes feasible. However, analyzing videos is compute-intensive. In addition, live video analytics needs to be performed in real time. In this paper, we design an edge server system for live video analytics. We propose to perform configuration adaptation without profiling video online. We select configurations with a prediction model based on object movement features. In addition, we reduce the latency through resource orchestration on video analytics servers. The key idea of resource orchestration is to batch inference tasks that use the same CNN model, and schedule tasks based on a priority value that estimates their impact on the total latency. We evaluate our system with two video analytic applications, road traffic monitoring and pose detection. The experimental results show that our profiling-free adaptation reduces the workload by 80% of the state-of-the-art adaptation without lowering the accuracy. The average serving latency is reduced by up to 95% comparing with the profiling-based adaptation. 
    more » « less
  2. Efficient deployment of Deep Neural Networks (DNNs) on edge devices (i.e., FPGAs and mobile platforms) is very challenging, especially under a recent witness of the increasing DNN model size and complexity. Model compression strategies, including weight quantization and pruning, are widely recognized as effective approaches to significantly reduce computation and memory intensities, and have been implemented in many DNNs on edge devices. However, most state-of-the-art works focus on ad-hoc optimizations, and there lacks a thorough study to comprehensively reveal the potentials and constraints of different edge devices when considering different compression strategies. In this paper, we qualitatively and quantitatively compare the energy efficiency of FPGA-based and mobile-based DNN executions using mobile GPU and provide a detailed analysis. Based on the observations obtained from the analysis, we propose a unified optimization framework using block-based pruning to reduce the weight storage and accelerate the inference speed on mobile devices and FPGAs, achieving high hardware performance and energy-efficiency gain while maintaining accuracy. 
    more » « less
  3. In the IoT and smart systems era, the massive amount of data generated from various IoT and smart devices are often sent directly to the cloud infrastructure for processing, analyzing, and storing. While handling this big data, conventional cloud infrastructure encounters many challenges, e.g., scarce bandwidth, high latency, real-time constraints, high power, and privacy issues. The edge-centric computing is transpiring as a synergistic solution to address these issues of cloud computing, by enabling processing/analyzing the data closer to the source of the data or at the network’s edge. This in turn allows real-time and in-situ data analytics and processing, which is imperative for many real-world IoT and smart systems, such as smart cars. Since the edge computing is still in its infancy, innovative solutions, models, and techniques are needed to support real-time and in-situ data processing and analysis of edge computing platforms. In this research work, we introduce a novel, unique, and efficient FPGA-HLS-based hardware accelerator for PCA+SVM model for real-time processing and analysis on edge computing platforms. This is inspired by our previous work on PCA+SVM models for edge computing applications. It was demonstrated that the amalgamation of principal component analysis (PCA) and support vector machines (SVM) leads to high classification accuracy in many fields. Also, machine learning techniques, such as SVM, can be utilized for many edge tasks, e.g. anomaly detection, health monitoring, etc.; and dimensionality reduction techniques, such as PCA, are often used to reduce the data size, which in turn vital for memory-constrained edge devices/platforms. Furthermore, our previous works demonstrated that FPGA’s many traits, including parallel processing abilities, low latency, and stable throughput despite the workload, make FPGAs suitable for real-time processing of edge computing applications/platforms. Our proposed FPGA-HLS-based PCA+SVM hardware IP achieves up to 254x speedup compared to its embedded software counterpart, while maintaining small area and low power requirements of edge computing applications. Our experimental results show great potential in utilizing FPGA-based architectures to support real-time processing on edge computing applications. 
    more » « less
  4. It is challenging to deploy 3D Convolutional Neural Networks (3D CNNs) on mobile devices, specifically if both real-time execution and high inference accuracy are in demand, because the increasingly large model size and complex model structure of 3D CNNs usually require tremendous computation and memory resources. Weight pruning is proposed to mitigate this challenge. However, existing pruning is either not compatible with modern parallel architectures, resulting in long inference latency or subject to significant accuracy degradation. This paper proposes an end-to-end 3D CNN acceleration framework based on pruning/compilation co-design called Mobile-3DCNN that consists of two parts: a novel, fine-grained structured pruning enhanced by a prune/Winograd adaptive selection (that is mobile-hardware-friendly and can achieve high pruning accuracy), and a set of compiler optimization and code generation techniques enabled by our pruning (to fully transform the pruning benefit to real performance gains). The evaluation demonstrates that Mobile-3DCNN outperforms state-of-the-art end-to-end DNN acceleration frameworks that support 3D CNN execution on mobile devices, Alibaba Mobile Neural Networks and Pytorch-Mobile with speedup up to 34 × with minor accuracy degradation, proving it is possible to execute high-accuracy large 3D CNNs on mobile devices in real-time (or even ultra-real-time). 
    more » « less
  5. Binary neural networks (BNNs) substitute complex arithmetic operations with simple bit-wise operations. The binarized weights and activations in BNNs can drastically reduce memory requirement and energy consumption, making it attractive for edge ML applications with limited resources. However, the severe memory capacity and energy constraints of low-power edge devices call for further reduction of BNN models beyond binarization. Weight pruning is a proven solution for reducing the size of many neural network (NN) models, but the binary nature of BNN weights make it difficult to identify insignificant weights to remove. In this paper, we present a pruning method based on latent weight with layer-level pruning sensitivity analysis which reduces the over-parameterization of BNNs, allowing for accuracy gains while drastically reducing the model size. Our method advocates for a heuristics that distinguishes weights by their latent weights, a real-valued vector used to compute the pseudogradient during backpropagation. It is tested using three different convolutional NNs on the MNIST, CIFAR-10, and Imagenette datasets with results indicating a 33%--46% reduction in operation count, with no accuracy loss, improving upon previous works in accuracy, model size, and total operation count. 
    more » « less