Spatial Audio Processing with Large Language Model on Wearable Devices

Mishra, Ayushi; Bai, Yang; Narayanasamy, Priyadarshan; Garg, Nakul; Roy, Nirupam

Citation Details

Integrating spatial context into large language models (LLMs) has the potential to revolutionize human-computer interaction, particularly in wearable devices. In this work, we present a novel system architecture that incorporates spatial speech understanding into LLMs, enabling contextually aware and adaptive applications for wearable technologies. Our approach leverages microstructure-based spatial sensing to extract precise Direction of Arrival (DoA) information using a monaural microphone. To address the lack of existing dataset for microstructure-assisted speech recordings, we synthetically create a dataset called OmniTalk by using the LibriSpeech dataset. This spatial information is fused with linguistic embeddings from OpenAI’s Whisper model, allowing each modality to learn complementary contextual representations. The fused embeddings are aligned with the input space of LLaMA-3.2 3B model and fine-tuned with lightweight adaptation technique LoRA to optimize for on-device processing. more »

Award ID(s):: 2238433

PAR ID:: 10589180

Author(s) / Creator(s):: Mishra, Ayushi; Bai, Yang; Narayanasamy, Priyadarshan; Garg, Nakul; Roy, Nirupam

Publisher / Repository:: International Conference on Machine Learning (ICML)

Date Published:: 2025-07-13

Format(s):: Medium: X

Location:: Vancouver, Canada

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript
Conference Paper:
The DOI is not currently available.

More Like this