skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 17, 2026

Title: BAT: Learning to Reason about Spatial Sounds with Large Language Models
Spatial sound reasoning is a fundamental human skill, enabling us to navigate and interpret our surroundings based on sound. In this paper we present BAT, which combines the spatial sound perception ability of a binaural acoustic scene analysis model with the natural language reasoning capabilities of a large language model (LLM) to replicate this innate ability. To address the lack of existing datasets of in-the-wild spatial sounds, we synthesized a binaural audio dataset using AudioSet and SoundSpaces 2.0. Next, we developed SpatialSoundQA, a spatial sound-based question-answering dataset, offering a range of QA tasks that train BAT in various aspects of spatial sound perception and reasoning. The acoustic front end encoder of BAT is a novel spatial audio encoder named Spatial Audio Spectrogram Transformer, or Spatial-AST, which by itself achieves strong performance across sound event detection, spatial localization, and distance estimation. By integrating Spatial-AST with LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between the sounds in its environment. Our experiments demonstrate BAT's superior performance on both spatial sound perception and reasoning, showcasing the immense potential of LLMs in navigating and interpreting complex spatial audio environments.  more » « less
Award ID(s):
2505865
PAR ID:
10631895
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
https://doi.org/10.48550/arXiv.2402.01591
Date Published:
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Smart IoT Speakers, while connected over a network, currently only produce sounds that come directly from the individual devices. We envision a future where smart speakers collaboratively produce a fabric of spatial audio, capable of perceptually placing sound in a range of locations in physical space. This could provide audio cues in homes, offices and public spaces that are flexibly linked to various positions. The perception of spatialized audio relies on binaural cues, especially the time difference and the level difference of incident sound at a user’s left and right ears. Traditional stereo speakers cannot create the spatialization perception for a user when playing binaural audio due to auditory crosstalk, as each ear hears a combination of both speaker outputs. We present Xblock, a novel time-domain pose-adaptive crosstalk cancellation technique that creates a spatial audio perception over a pair of speakers using knowledge of the user’s head pose and speaker positions. We build a prototype smart speaker IoT system empowered by Xblock, explore the effectiveness of Xblock through signal analysis, and discuss future perceptual user studies and future work. 
    more » « less
  2. We introduce multimodal neural acoustic fields for synthesizing spatial sound and enabling the creation of immersive auditory experiences from novel viewpoints and in completely unseen new environments, both virtual and real. Extending the concept of neural radiance fields to acoustics, we develop a neural network-based model that maps an environment's geometric and visual features to its audio characteristics. Specifically, we introduce a novel hybrid transformer-convolutional neural network to accomplish two core tasks: capturing the reverberation characteristics of a scene from audio-visual data, and generating spatial sound in an unseen new environment from signals recorded at sparse positions and orientations within the original scene. By learning to represent spatial acoustics in a given environment, our approach enables creation of realistic immersive auditory experiences, thereby enhancing the sense of presence in augmented and virtual reality applications. We validate the proposed approach on both synthetic and real-world visual-acoustic data and demonstrate that our method produces nonlinear acoustic effects such as reverberations, and improves spatial audio quality compared to existing methods. Furthermore, we also conduct subjective user studies and demonstrate that the proposed framework significantly improves audio perception in immersive mixed reality applications. 
    more » « less
  3. Locomotion generates adventitious sounds which enable detection and localization of predators and prey. Such sounds contain brisk changes or transients in amplitude. We investigated the hypothesis that ill-understood temporal specializations in binaural circuits subserve lateralization of such sound transients, based on different time of arrival at the ears (interaural time differences, ITDs). We find that Lateral Superior Olive (LSO) neurons show exquisite ITD-sensitivity, reflecting extreme precision and reliability of excitatory and inhibitory postsynaptic potentials, in contrast to Medial Superior Olive neurons, traditionally viewed as the ultimate ITD-detectors. In vivo, inhibition blocks LSO excitation over an extremely short window, which, in vitro, required synaptically evoked inhibition. Light and electron microscopy revealed inhibitory synapses on the axon initial segment as the structural basis of this observation. These results reveal a neural vetoing mechanism with extreme temporal and spatial precision and establish the LSO as the primary nucleus for binaural processing of sound transients. 
    more » « less
  4. Standard training for Multi-modal Large Language Models (MLLMs) involves concatenating non-textual information, like vision or audio, with a text prompt. This approach may not encourage deep integration of modalities, limiting the model's ability to leverage the core language model's reasoning capabilities. This work examined the impact of interleaved instruction tuning in an audio MLLM, where audio tokens are interleaved within the prompt. Using the Listen, Think, and Understand (LTU) model as a testbed, we conduct an experiment using the Synonym and Hypernym Audio Reasoning Dataset (SHARD), our newly created reasoning benchmark for audio-based semantic reasoning focusing on synonym and hypernym recognition. Our findings show that while even zero-shot interleaved prompting improves performance on our reasoning tasks, a small amount of fine-tuning using interleaved training prompts improves the results further, however, at the expense of the MLLM's audio labeling ability. 
    more » « less
  5. Understanding the behavioral and neural dynamics of social interactions is a goal of contemporary neuroscience. Many machine learning methods have emerged in recent years to make sense of complex video and neurophysiological data that result from these experiments. Less focus has been placed on understanding how animals process acoustic information, including social vocalizations. A critical step to bridge this gap is determining the senders and receivers of acoustic information in social interactions. While sound source localization (SSL) is a classic problem in signal processing, existing approaches are limited in their ability to localize animal-generated sounds in standard laboratory environments. Advances in deep learning methods for SSL are likely to help address these limitations, however there are currently no publicly available models, datasets, or benchmarks to systematically evaluate SSL algorithms in the domain of bioacoustics. Here, we present the VCL Benchmark: the first large-scale dataset for benchmarking SSL algorithms in rodents. We acquired synchronized video and multi-channel audio recordings of 767,295 sounds with annotated ground truth sources across 9 conditions. The dataset provides benchmarks which evaluate SSL performance on real data, simulated acoustic data, and a mixture of real and simulated data. We intend for this benchmark to facilitate knowledge transfer between the neuroscience and acoustic machine learning communities, which have had limited overlap. 
    more » « less