skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 5, 2026

Title: Pi-talk: Edge-Only, Adapter-Tuned Multimodal Small Language Model for Safe, Real-Time In-Vehicle Dialogue
Natural-language interaction between passengers and autonomous vehicles is essential for trust, safety, and user experience, but deploying Large Language Models (LLMs) on automotive edge platforms is constrained by compute, memory, energy, and privacy. We present Pi-talk, an edge-only system that enables real-time passenger–vehicle dialogue using a Small Language Model (SLM) running entirely on embedded hardware. Pi-talk performs multimodal fusion of onboard camera, ultrasonic distance, and navigation context via a lightweight encoder–adapter module that aligns modalities into compact semantic tokens for a pre-trained SLM. The SLM produces context-aware explanations of driving decisions, route options, and situational updates without cloud connectivity. Safety is enforced through a real-time safety envelope that gates responses and actions using distance thresholds and timing constraints. We further adapter-tune the SLM (on-device or offline) and deploy it with INT8 quantization and an Open Neural Network Exchange (ONNX) runtime to achieve efficient batch = 1 inference on Raspberry-Pi–class hardware. We evaluate task quality (evaluation loss), end-to-end latency, CPU utilization, and memory footprint, and include ablations contrasting unimodal vs. fused inputs. Results show that Pi-talk sustains few-second, edge-only inference while meeting stringent resource and latency limits and maintaining the safety envelope required for autonomous operation. To our knowledge, Pi-talk is among the first edgeonly, multimodal passenger–vehicle dialogue systems that both fine-tune and run a small language model entirely on Raspberry Pi–class, CPU-only hardware with an explicit while enforcing a runtime safety envelope.  more » « less
Award ID(s):
2244283
PAR ID:
10653861
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
IEEE
Date Published:
Format(s):
Medium: X
Location:
Boca Raton, FL
Sponsoring Org:
National Science Foundation
More Like this
  1. Emerging applications—disaster response drones, in-vehicle assistants, and field medical devices—require on-device language intelligence when cloud links are unreliable, privacy is mandatory, and subsecond latency is nonnegotiable. We benchmark seven SLMs (DistilBERT, MobileBERT, ALBERT, MiniLM, Phi-3 Mini, MobileLLaMA and TinyLLaMA) across four mission-aligned use cases (Watchlist Screening, Threat Detection, Document Triage, Multilingual Routing) on five border-relevant datasets (e.g., GTD, FLORES-200). Under controlled edge-like constraints (mobile-class CPU, 1–8 GB shared memory, intermittent networking), we report task quality (accuracy/F1 or ROUGE), batch-1 inference latency, and peak memory, and we introduce a reproducible, edge-budgeted evaluation protocol for security-critical scenarios. We also outline a path to multimodal edge workloads by pairing compact audio/vision encoders with SLM back ends. 
    more » « less
  2. Autonomous cyber-physical systems must be able to operate safely in a wide range of complex environments. To ensure safety without limiting mitigation options, these systems require detection of safety violations by mitigation trigger deadlines. As a result of these system’s complex environments, multimodal prediction is often required. For example, an autonomous vehicle (AV) operates in complex traffic scenes that result in any given vehicle having the ability to exhibit several plausible future behavior modes (e.g., stop, merge, turn, etc.); therefore, to ensure collision avoidance, an AV must be able to predict the possible multimodal behaviors of nearby vehicles. In previous work, model predictive runtime verification (MPRV) successfully detected future violations by a given deadline, but MPRV only considers a single mode of prediction (i.e., unimodal prediction). We design multimodal model predictive runtime verification (MMPRV) to extend MPRV to consider multiple modes of prediction, and we introduce Predictive Mission-Time Linear Temporal Logic (PMLTL) as an extension of MLTL to support the evaluation of probabilistic multimodal predictions. We examine the correctness and real-time feasibility of MMPRV through two AV case studies where MMPRV utilizes (1) a physics-based multimodal predictor on the F1Tenth autonomous racing vehicle and (2) current state-of-the-art deep neural network multimodal predictors trained and evaluated on the Argoverse motion forecasting dataset. We found that the ability to meet real-time requirements was a challenge for the latter, especially when targeting an embedded computing platform. 
    more » « less
  3. null (Ed.)
    This paper presents a hardware prototype and a framework for a new communication-aware model compression for distributed on-device inference. Our approach relies on Knowledge Distillation (KD) and achieves orders of magnitude compression ratios on a large pre-trained teacher model. The distributed hardware prototype consists of multiple student models deployed on Raspberry-Pi 3 nodes that run Wide ResNet and VGG models on the CIFAR10 dataset for real-time image classification. We observe significant reductions in memory footprint (50×), energy consumption (14×), latency (33×) and an increase in performance (12×) without any significant accuracy loss compared to the initial teacher model. This is an important step towards deploying deep learning models for IoT applications. 
    more » « less
  4. With the burgeoning of autonomous driving, the edgedeployed integrated CPU/GPU (iGPU) platform gains significant attention from both academia and industries. NVIDIA issues a series of Jetson iGPU platforms that perform well in terms of computation capability, power consumption, and mobile size. However, these iGPU platforms typically contain very limited physical memory, which could be the bottleneck of these autonomous driving and edge computing applications. Although the introduction of the Unified Memory (UM) model in GPU programming can reduce the memory footprint, the programming legacy of the UM model initializes data on the CPU side by default as the conventional copyand- execute model does, which causes significant latency of application execution. In this paper, we propose an enhanced unified memory management model (eUMM), which delivers a prefetch-enhanced data initialization method on the GPU side of the iGPU platform. We evaluate eUMM on the representative Jetson TX2 and Xavier AGX platforms and demonstrate that eUMM not only reduces the initialization latency significantly but also benefits the following kernel computation and the entire application execution latency. 
    more » « less
  5. null (Ed.)
    Deep neural networks (DNNs) are increasingly used for real-time inference, requiring low latency, but require significant computational power as they continue to increase in complexity. Edge clouds promise to offer lower latency due to their proximity to end-users and having powerful accelerators like GPUs to provide the computation power needed for DNNs. But it is also important to ensure that the edge-cloud resources are utilized well. For this, multiplexing several DNN models through spatial sharing of the GPU can substantially improve edge-cloud resource usage. Typical GPU runtime environments have significant interactions with the CPU, to transfer data to the GPU, for CPU-GPU synchronization on inference task completions, etc. These result in overheads. We present a DNN inference framework with a set of software primitives that reduce the overhead for DNN inference, increase GPU utilization and improve performance, with lower latency and higher throughput. Our first primitive uses the GPU DMA effectively, reducing the CPU cycles spent to transfer the data to the GPU. A second primitive uses asynchronous ‘events’ for faster task completion notification. GPU runtimes typically preclude fine-grained user control on GPU resources, causing long GPU downtimes when adjusting resources. Our third primitive supports overlapping of model-loading and execution, thus allowing GPU resource re-allocation with very little GPU idle time. Our other primitives increase inference throughput by improving scheduling and processing more requests. Overall, our primitives decrease inference latency by more than 35% and increase DNN throughput by 2-3×. 
    more » « less