We explore computational strategies for matching human vocal imitations of birdsong to actual birdsong recordings. We recorded human vocal imitations of birdsong and subsequently analysed these data using three categories of audio features for matching imitations to original birdsong: spectral, temporal, and spectrotemporal. These exploratory analyses suggest that spectral features can help distinguish imitation strategies (e.g. whistling vs. singing) but are insufficient for distinguishing species. Similarly, whereas temporal features are correlated between human imitations and natural birdsong, they are also insufficient. Spectrotemporal features showed the greatest promise, in particular when used to extract a representation of the pitch contour of birdsong and human imitations. This finding suggests a link between the task of matching human imitations to birdsong to retrieval tasks in the music domain such as query-by-humming and cover song retrieval; we borrow from such existing methodologies to outline directions for future research.
more »
« less
This content will become publicly available on April 6, 2026
Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations
We present Sketch2Sound, a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound-shape). Sketch2Sound can be implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only 40k steps of fine-tuning and a single linear layer per control, making it more lightweight than existing methods like ControlNet. To synthesize from sketchlike sonic imitations, we propose applying random median filters to the control signals during training, allowing Sketch2Sound to be prompted using controls with flexible levels of temporal specificity. We show that Sketch2Sound can synthesize sounds that follow the gist of input controls from a vocal imitation while retaining the adherence to an input text prompt and audio quality compared to a text-only baseline. Sketch2Sound allows sound artists to create sounds with the semantic flexibility of text prompts and the expressivity and precision of a sonic gesture or vocal imitation.
more »
« less
- Award ID(s):
- 2222369
- PAR ID:
- 10638308
- Publisher / Repository:
- IEEE
- Date Published:
- ISBN:
- 979-8-3503-6874-1
- Page Range / eLocation ID:
- 1 to 5
- Format(s):
- Medium: X
- Location:
- Hyderabad, India
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This work introduces Text2FX, a method that leverages CLAP embeddings and differentiable digital signal processing to control audio effects, such as equalization and reverberation, using open-vocabulary natural language prompts (e.g., ``make this sound in-your-face and bold''). Text2FX operates without retraining any models, relying instead on single-instance optimization within the existing embedding space, thus enabling a flexible, scalable approach to open-vocabulary sound transformations through interpretable and disentangled FX manipulation. We show that CLAP encodes valuable information for controlling audio effects and propose two optimization approaches using CLAP to map text to audio effect parameters. While we demonstrate with CLAP, this approach is applicable to any shared text-audio embedding space. Similarly, while we demonstrate with equalization and reverberation, any differentiable audio effect may be controlled. We conduct a listener study with diverse text prompts and source audio to evaluate the quality and alignment of these methods with human perception. Demos and code are available at anniejchu.github.io/text2fxmore » « less
-
Generating realistic audio for human actions is critical for applications such as film sound effects and virtual reality games. Existing methods assume complete correspondence between video and audio during training, but in real-world settings, many sounds occur off-screen or weakly correspond to visuals, leading to uncontrolled ambient sounds or hallucinations at test time. This paper introduces AV-LDM, a novel ambient-aware audio generation model that disentangles foreground action sounds from ambient background noise in in-the-wild training videos. The approach leverages a retrieval-augmented generation framework to synthesize audio that aligns both semantically and temporally with the visual input. Trained and evaluated on Ego4D and EPIC-KITCHENS datasets, along with the newly introduced Ego4D-Sounds dataset (1.2M curated clips with action-audio correspondence), the model outperforms prior methods, enables controllable ambient sound generation, and shows promise for generalization to synthetic video game clips. This work is the first to emphasize faithful video-to-audio generation focused on observed visual content despite noisy, uncurated training data.more » « less
-
Pleasure in music has been linked to predictive coding of melodic and rhythmic patterns, subserved by connectivity between regions in the brain's auditory and reward networks. Specific musical anhedonics derive little pleasure from music and have altered auditory-reward connectivity, but no difficulties with music perception abilities and no generalized physical anhedonia. Recent research suggests that specific musical anhedonics experience pleasure in nonmusical sounds, suggesting that the implicated brain pathways may be specific to music reward. However, this work used sounds with clear real-world sources (e.g., babies laughing, crowds cheering), so positive hedonic responses could be based on the referents of these sounds rather than the sounds themselves. We presented specific musical anhedonics and matched controls with isolated short pleasing and displeasing synthesized sounds of varying timbres with no clear real-world referents. While the two groups found displeasing sounds equally displeasing, the musical anhedonics gave substantially lower pleasure ratings to the pleasing sounds, indicating that their sonic anhedonia is not limited to musical rhythms and melodies. Furthermore, across a large sample of participants, mean pleasure ratings for pleasing synthesized sounds predicted significant and similar variance in six dimensions of musical reward considered to be relatively independent, suggesting that pleasure in sonic timbres play a role in eliciting reward-related responses to music. We replicate the earlier findings of preserved pleasure ratings for semantically referential sounds in musical anhedonics and find that pleasure ratings of semantic referents, when presented without sounds, correlated with ratings for the sounds themselves. This association was stronger in musical anhedonics than in controls, suggesting the use of semantic knowledge as a compensatory mechanism for affective sound processing. Our results indicate that specific musical anhedonia is not entirely specific to melodic and rhythmic processing, and suggest that timbre merits further research as a source of pleasure in music.more » « less
-
ABSTRACT Acoustic behavior is widespread across vertebrates, including fishes. We report robust acoustic displays during aggressive interactions for a laboratory colony of Danionella dracula, a miniature and transparent species of teleost fish closely related to zebrafish (Danio rerio), which are hypothesized to be sonic based on the presence of a hypertrophied muscle associated with the male swim bladder. Males produce bursts of pulsatile sounds and a distinct postural display – extension of a hypertrophied lower jaw, a morphological trait not present in other Danionella species – during aggressive but not courtship interactions. Females show no evidence of sound production or jaw extension in such contexts. Novel pairs of size-matched or -mismatched males were combined in resident–intruder assays where sound production and jaw extension could be linked to individuals. In both dyad contexts, resident males produced significantly more sound pulses than intruders. During heightened sonic activity, the majority of the highest sound producers also showed increased jaw extension. Residents extended their jaw more than intruders in size-matched but not -mismatched contexts. Larger males in size-mismatched dyads produced more sounds and jaw extensions compared with their smaller counterparts, and sounds and jaw extensions increased with increasing absolute body size. These studies establish D. dracula as a sonic species that modulates putatively acoustic and postural displays during aggressive interactions based on residency and body size, providing a foundation for further investigating the role of multimodal displays in a new model clade for neurogenomic and neuroimaging studies of aggression, courtship and other social interactions.more » « less
An official website of the United States government
