skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on November 7, 2026

Title: Tiny Models, Tough Limits: EdgeTinyBench for Border-Control AI on Ten-Watt Platforms
Emerging applications—disaster response drones, in-vehicle assistants, and field medical devices—require on-device language intelligence when cloud links are unreliable, privacy is mandatory, and subsecond latency is nonnegotiable. We benchmark seven SLMs (DistilBERT, MobileBERT, ALBERT, MiniLM, Phi-3 Mini, MobileLLaMA and TinyLLaMA) across four mission-aligned use cases (Watchlist Screening, Threat Detection, Document Triage, Multilingual Routing) on five border-relevant datasets (e.g., GTD, FLORES-200). Under controlled edge-like constraints (mobile-class CPU, 1–8 GB shared memory, intermittent networking), we report task quality (accuracy/F1 or ROUGE), batch-1 inference latency, and peak memory, and we introduce a reproducible, edge-budgeted evaluation protocol for security-critical scenarios. We also outline a path to multimodal edge workloads by pairing compact audio/vision encoders with SLM back ends.  more » « less
Award ID(s):
2244283
PAR ID:
10653860
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Springer Verlag
Date Published:
Subject(s) / Keyword(s):
Small Language Models · Edge AI · On-Device Inference ·Quantization · Low-Latency NLP · Security-Critical NLP
Format(s):
Medium: X
Location:
Shanghai, China
Sponsoring Org:
National Science Foundation
More Like this
  1. User-associated contents play an increasingly important role in modern network applications. With growing deployments of edge servers, the capacity of content storage in edge clusters significantly increases, which provides great potential to satisfy content requests with much shorter latency. However, the large number of contents also causes the difficulty of searching contents on edge servers in different locations because indexing contents costs huge DRAM on each edge server. In this work, we explore the opportunity of efficiently indexing user-associated contents and propose a scalable content-sharing mechanism for edge servers, called EdgeCut, that significantly reduces content access latency by allowing many edge servers to share their cached contents. We design a compact and dynamic data structure called Ludo Locator that returns the IP address of the edge server that stores the requested user-associated content. We have implemented a prototype of EdgeCut in a real network environment running in a public geo-distributed cloud. The experiment results show that EdgeCut reduces content access latency by up to 50% and reduces cloud traffic by up to 50% compared to existing solutions. The memory cost is less than 50MB for 10 million mobile users. The simulations using real network latency data show EdgeCut’s advantages over existing solutions on a large scale. 
    more » « less
  2. With the emergence of wearable devices and other embedded systems, deploying large language models (LLMs) on edge platforms has become an urgent need. However, this is challenging because of their high computational and memory demands. Although recent low-bit quantization methods (e.g., BitNet, DeepSeek) compress weights to as low as 1.58~bits with minimal accuracy loss, edge deployment is still constrained by limited on-chip resources, power budgets, and the often-neglected long latency of the prefill stage. We present TeLLMe, the first table-lookup-based ternary LLM accelerator for low-power edge FPGAs that fully supports both prefill and autoregressive decoding using 1.58-bit weights and 8-bit activations. TeLLMe incorporates several novel techniques, including (1) a table-lookup-based ternary matrix multiplication (TLMM) engine utilizing grouped activations and online precomputation for low resource utilization and high throughput; (2) a fine-grained analytic URAM-based weight buffer management scheme for efficient loading and compute engine access; (3) a streaming dataflow architecture that fuses floating-point element-wise operations with linear computations to hide latency; (4) a reversed-reordered prefill stage attention with fused attention operations for high memory efficiency; and (5) a resource-efficient specialized decoding stage attention. Under a 5~W power budget, TeLLMe delivers up to 25~tokens/s decoding throughput and 0.45--0.96~s time-to-first-token (TTFT) for 64--128 token prompts, marking a significant energy-efficiency advancement in LLM inference on edge FPGAs. 
    more » « less
  3. Smart IoT-based systems often desire continuous execution of multiple latency-sensitive Deep Learning (DL) appli- cations. The edge servers serve as the cornerstone of such IoT- based systems, however, their resource limitations hamper the continuous execution of multiple (multi-tenant) DL applications. The challenge is that, DL applications function based on bulky “neural network (NN) models” that cannot be simultaneously maintained in the limited memory space of the edge. Accordingly, the main contribution of this research is to overcome the memory contention challenge, thereby, meeting the latency constraints of the DL applications without compromising their inference accuracy. We propose an efficient NN model management frame- work, called Edge-MultiAI, that ushers the NN models of the DL applications into the edge memory such that the degree of multi-tenancy and the number of warm-starts are maximized. Edge-MultiAI leverages NN model compression techniques, such as model quantization, and dynamically loads NN models for DL applications to stimulate multi-tenancy on the edge server. We also devise a model management heuristic for Edge-MultiAI, called iWS-BFE, that functions based on the Bayesian theory to predict the inference requests for multi-tenant applications, and uses it to choose the appropriate NN models for loading, hence, increasing the number of warm-start inferences. We evaluate the efficacy and robustness of Edge-MultiAI under various configurations. The results reveal that Edge-MultiAI can stimulate the degree of multi-tenancy on the edge by at least 2× and increase the number of warm-starts by ≈ 60% without any major loss on the inference accuracy of the applications. 
    more » « less
  4. Natural-language interaction between passengers and autonomous vehicles is essential for trust, safety, and user experience, but deploying Large Language Models (LLMs) on automotive edge platforms is constrained by compute, memory, energy, and privacy. We present Pi-talk, an edge-only system that enables real-time passenger–vehicle dialogue using a Small Language Model (SLM) running entirely on embedded hardware. Pi-talk performs multimodal fusion of onboard camera, ultrasonic distance, and navigation context via a lightweight encoder–adapter module that aligns modalities into compact semantic tokens for a pre-trained SLM. The SLM produces context-aware explanations of driving decisions, route options, and situational updates without cloud connectivity. Safety is enforced through a real-time safety envelope that gates responses and actions using distance thresholds and timing constraints. We further adapter-tune the SLM (on-device or offline) and deploy it with INT8 quantization and an Open Neural Network Exchange (ONNX) runtime to achieve efficient batch = 1 inference on Raspberry-Pi–class hardware. We evaluate task quality (evaluation loss), end-to-end latency, CPU utilization, and memory footprint, and include ablations contrasting unimodal vs. fused inputs. Results show that Pi-talk sustains few-second, edge-only inference while meeting stringent resource and latency limits and maintaining the safety envelope required for autonomous operation. To our knowledge, Pi-talk is among the first edgeonly, multimodal passenger–vehicle dialogue systems that both fine-tune and run a small language model entirely on Raspberry Pi–class, CPU-only hardware with an explicit while enforcing a runtime safety envelope. 
    more » « less
  5. Low-latency and low-power edge AI is crucial for Virtual Reality and Augmented Reality applications. Recent advances demonstrate that hybrid models, combining convolution layers (CNN) and transformers (ViT), often achieve a superior accuracy/performance tradeoff on various computer vision and machine learning (ML) tasks. However, hybrid ML models can present system challenges for latency and energy efficiency due to their diverse nature in dataflow and memory access patterns. In this work, we leverage architecture heterogeneity from Neural Processing Units (NPU) and Compute-In-Memory (CIM) and explore diverse execution schemas to efficiently execute these hybrid models. We introduce H4H-NAS, a two-stage Neural Architecture Search (NAS) framework to automate the design of efficient hybrid CNN/ViT models for heterogeneous edge systems featuring both NPU and CIM. We propose a two-phase incremental supernet training in our NAS framework to resolve gradient conflicts between sampled subnets caused by different types of blocks in a hybrid model search space. Our H4H-NAS approach is also powered by a performance estimator built with NPU performance results measured on real silicon, and CIM performance based on industry IPs. H4H-NAS searches hybrid CNN-ViT models with fine granularity and achieves significant (up to 1.34%) top-1 accuracy improvement on ImageNet. Moreover, results from our algorithm/hardware co-design reveal up to 56.08% overall latency and 41.72% energy improvements by introducing heterogeneous computing over baseline solutions. Overall, our framework guides the design of hybrid network architectures and system architectures for NPU+CIM heterogeneous systems. 
    more » « less