Spatial sound reasoning is a fundamental human skill, enabling us to navigate and interpret our surroundings based on sound. In this paper we present BAT, which combines the spatial sound perception ability of a binaural acoustic scene analysis model with the natural language reasoning capabilities of a large language model (LLM) to replicate this innate ability. To address the lack of existing datasets of in-the-wild spatial sounds, we synthesized a binaural audio dataset using AudioSet and SoundSpaces 2.0. Next, we developed SpatialSoundQA, a spatial sound-based question-answering dataset, offering a range of QA tasks that train BAT in various aspects of spatial sound perception and reasoning. The acoustic front end encoder of BAT is a novel spatial audio encoder named Spatial Audio Spectrogram Transformer, or Spatial-AST, which by itself achieves strong performance across sound event detection, spatial localization, and distance estimation. By integrating Spatial-AST with LLaMA-2 7B model, BAT transcends standard Sound Event Localization and Detection (SELD) tasks, enabling the model to reason about the relationships between the sounds in its environment. Our experiments demonstrate BAT's superior performance on both spatial sound perception and reasoning, showcasing the immense potential of LLMs in navigating and interpreting complex spatial audio environments.
more »
« less
Demo of Spatial Audification in OpenSpace: MMS Mission
This is an audio demo; listen with headphones. The audio begins around the 0:55 mark. In Collins et al 2024, we demonstrated a spatial audification of data from NASA's Magnetospheric Multiscale (MMS) mission produced with open-source tools in Python. In that demo, however, the sound sources for each satellite are placed in a static and representative position. Here, we use OpenSpace to associate each audio stream with its respective spacecraft, so that the audification may be experienced with spatial fidelity on a flexible timescale. This proof-of-concept uses the Open Sound Control protocol to send positional data of the sound sources from OpenSpace to SuperCollider, a method also used in Elmquist et al 2024.
more »
« less
- Award ID(s):
- 2218996
- PAR ID:
- 10562842
- Publisher / Repository:
- Zenodo
- Date Published:
- Format(s):
- Medium: X
- Right(s):
- Creative Commons Attribution 4.0 International
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Smart IoT Speakers, while connected over a network, currently only produce sounds that come directly from the individual devices. We envision a future where smart speakers collaboratively produce a fabric of spatial audio, capable of perceptually placing sound in a range of locations in physical space. This could provide audio cues in homes, offices and public spaces that are flexibly linked to various positions. The perception of spatialized audio relies on binaural cues, especially the time difference and the level difference of incident sound at a user’s left and right ears. Traditional stereo speakers cannot create the spatialization perception for a user when playing binaural audio due to auditory crosstalk, as each ear hears a combination of both speaker outputs. We present Xblock, a novel time-domain pose-adaptive crosstalk cancellation technique that creates a spatial audio perception over a pair of speakers using knowledge of the user’s head pose and speaker positions. We build a prototype smart speaker IoT system empowered by Xblock, explore the effectiveness of Xblock through signal analysis, and discuss future perceptual user studies and future work.more » « less
-
This dataset provides estimates of total Irrigation Water Use (IWU) by crop, county, water source, and year for the Continental United States. Total irrigation from Surface Water Withdrawals (SWW), total Groundwater Withdrawals (GWW), and nonrenewable Groundwater Depletion (GWD) is provided for 20 crops and crop groups from 2008 to 2020 at the county spatial resolution. In total, there are nearly 2.5 million data points in this dataset (3,142 counties; 13 years; 3 water sources; and 20 crops). This dataset supports the paper by Ruess et al (2024) "Total irrigation by crop in the Continental United States from 2008 to 2020", Scientific Data, doi: 10.1038/s41597-024-03244-w When using, please cite as: Ruess, P.J., Konar, M., Wanders, N., and Bierkens, M.F.P. (2024) Total irrigation by crop in the Continental United States from 2008 to 2020, Scientific Data, doi: 10.1038/s41597-024-03244-wmore » « less
-
Devices from smartphones to televisions are beginning to employ dual purpose displays, where the display serves as both a video screen and a loudspeaker. In this paper we demonstrate a method to generate localized sound-radiating regions on a flat-panel display. An array of force actuators affixed to the back of the panel is driven by appropriately filtered audio signals so the total response of the panel due to the actuator array approximates a target spatial acceleration profile. The response of the panel to each actuator individually is initially measured via a laser vibrometer, and the required actuator filters for each source position are determined by an optimization procedure that minimizes the mean squared error between the reconstructed and targeted acceleration profiles. Since the single-actuator panel responses are determined empirically, the method does not require analytical or numerical models of the system’s modal response, and thus is well-suited to panels having the complex boundary conditions typical of television screens, mobile devices, and tablets. The method is demonstrated on two panels with differing boundary conditions. When integrated with display technology, the localized audio source rendering method may transform traditional displays into multimodal audio-visual interfaces by colocating localized audio sources and objects in the video stream.more » « less
-
This work, situated at Rensselaer's Collaborative-Research Augmented Immersive Virtual Environment Laboratory (CRAIVELab), uses panoramic image datasets for spatial audio display. A system is developed for the room-centered immersive virtual reality facility to analyze panoramic images on a segment-by-segment basis, using pre-trained neural network models for semantic segmentation and object detection, thereby generating audio objects with respective spatial locations. These audio objects are then mapped with a series of synthetic and recorded audio datasets and populated within a spatial audio environment as virtual sound sources. The resulting audiovisual outcomes are then displayed using the facility's human-scale panoramic display, as well as the 128-channel loudspeaker array for wave field synthesis (WFS). Performance evaluation indicates effectiveness for real-time enhancements, with potentials for large-scale expansion and rapid deployment in dynamic immersive virtual environments.more » « less
An official website of the United States government
