Generating realistic audio for human actions is critical for applications such as film sound effects and virtual reality games. Existing methods assume complete correspondence between video and audio during training, but in real-world settings, many sounds occur off-screen or weakly correspond to visuals, leading to uncontrolled ambient sounds or hallucinations at test time. This paper introduces AV-LDM, a novel ambient-aware audio generation model that disentangles foreground action sounds from ambient background noise in in-the-wild training videos. The approach leverages a retrieval-augmented generation framework to synthesize audio that aligns both semantically and temporally with the visual input. Trained and evaluated on Ego4D and EPIC-KITCHENS datasets, along with the newly introduced Ego4D-Sounds dataset (1.2M curated clips with action-audio correspondence), the model outperforms prior methods, enables controllable ambient sound generation, and shows promise for generalization to synthetic video game clips. This work is the first to emphasize faithful video-to-audio generation focused on observed visual content despite noisy, uncurated training data.
more »
« less
Visual to Sound: Generating Natural Sound for Videos in the Wild
As two of the five traditional human senses (sight, hearing, taste, smell, and touch), vision and sound are basic sources through which humans understand the world. Often correlated during natural events, these two modalities combine to jointly affect human perception. In this paper, we pose the task of generating sound given visual input. Such capabilities could help enable applications in virtual reality (generating sound for virtual scenes automatically) or provide additional accessibility to images or videos for people with visual impairments. As a first step in this direction, we apply learning-based methods to generate raw waveform samples given input video frames. We evaluate our models on a dataset of videos containing a variety of sounds (such as ambient sounds and sounds from people/animals). Our experiments show that the generated sounds are fairly realistic and have good temporal synchronization with the visual inputs.
more »
« less
- Award ID(s):
- 1633295
- PAR ID:
- 10066894
- Date Published:
- Journal Name:
- IEEE Conference on Computer Vision and Pattern Recognition
- ISSN:
- 2163-6648
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Lovable robots in movies regularly beep, chirp, and whirr, yet robots in the real world rarely deploy such sounds. Despite preliminary work supporting the perceptual and objective benefits of intentionally-produced robot sound, relatively little research is ongoing in this area. In this paper, we systematically evaluate transformative robot sound across multiple robot archetypes and behaviors. We conducted a series of five online video-based surveys, each with N ≈ 100 participants, to better understand the effects of musician-designed transformative sounds on perceptions of personal, service, and industrial robots. Participants rated robot videos with transformative sound as significantly happier, warmer, and more competent in all five studies, as more energetic in four studies, and as less discomforting in one study. Overall, results confirmed that transformative sounds consistently improve subjective ratings but may convey affect contrary to the intent of affective robot behaviors. In future work, we will investigate the repeatability of these results through in-person studies and develop methods to automatically generate transformative robot sound. This work may benefit researchers and designers who aim to make robots more favorable to human users.more » « less
-
To investigate preferences for mobile and wearable sound awareness systems, we conducted an online survey with 201 DHH participants. The survey explores how demographic factors affect perceptions of sound awareness technologies, gauges interest in specific sounds and sound characteristics, solicits reactions to three design scenarios (smartphone, smartwatch, head-mounted display) and two output modalities (visual, haptic), and probes issues related to social context of use. While most participants were highly interested in being aware of sounds, this interest was modulated by communication preference--that is, for sign or oral communication or both. Almost all participants wanted both visual and haptic feedback and 75% preferred to have that feedback on separate devices (e.g., haptic on smartwatch, visual on head-mounted display). Other findings related to sound type, full captions vs. keywords, sound filtering, notification styles, and social context provide direct guidance for the design of future mobile and wearable sound awareness systems.more » « less
-
The Sound Travels research team will share a recording that exemplifies affective associations made with specific sounds by visitors to free-choice learning environments (a science museum, a park, a zoo, and a botanical garden). This recording reflects direct collaboration with visitors and demonstrates the variation in how people make sense of sound, both in identifying its sources and in describing its effects on their emotional and cognitive states. Our US-based, federally funded project explores the impacts of ambient and designed sound on STEM learning and leisure experiences. Beyond addressing our research questions, we embrace the larger goals of seeking meaningful input from professionals in and visitors to these spaces and directly informing educational design practice. Our methods include multiple stationary ambient recordings within spaces of interest, a post-experience visitor questionnaire, and a “sound search” instrument in which visitors record video clips during their experience to represent sounds that make them feel curious, energized, uneasy, and peaceful. Together, the resulting data reveal not only how visitors are affected by sound but also how visitors experience and notice sound in context, and in what ways a person’s embodied and culturally informed associations with sound relate to their experiences of learning and leisure.more » « less
-
Antona, M; Stephanidis, C (Ed.)Environmental sounds can provide important information about surrounding activity, yet recognizing sounds can be challenging for Deaf and Hard-of-Hearing (DHH) individuals. Prior work has examined the preferences of DHH users for various sound-awareness methods. However, these preferences have been observed to vary along some demographic factors. Thus, in this study we investigate the preferences of a specific group of DHH users: current assistive listening devices users. Through a survey of 38 participants, we investigated their challenges and requirements for sound-awareness applications, as well as which type of sounds and what aspects of the sounds are of importance to them. We found that users of assistive listening devices still often miss sounds and rely on other people to obtain information about them. Participants indicated that the importance of awareness of different types of sounds varied according to the environment and the form factor of the sound-awareness technology. Congruent with prior work, participants reported that the location and urgency of the sound were of importance, as well as the confidence of the technology in its identification of that sound.more » « less
An official website of the United States government

