NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Sketch2Sound: Controllable Audio Generation via Time-Varying Signals and Sonic Imitations

https://doi.org/10.1109/ICASSP49660.2025.10888184

Flores_García, Hugo Flores; Nieto, Oriol; Salamon, Justin; Pardo, Bryan; Seetharaman, Prem (April 2025, IEEE)

We present Sketch2Sound, a generative audio model capable of creating high-quality sounds from a set of interpretable time-varying control signals: loudness, brightness, and pitch, as well as text prompts. Sketch2Sound can synthesize arbitrary sounds from sonic imitations (i.e.,~a vocal imitation or a reference sound-shape). Sketch2Sound can be implemented on top of any text-to-audio latent diffusion transformer (DiT), and requires only 40k steps of fine-tuning and a single linear layer per control, making it more lightweight than existing methods like ControlNet. To synthesize from sketchlike sonic imitations, we propose applying random median filters to the control signals during training, allowing Sketch2Sound to be prompted using controls with flexible levels of temporal specificity. We show that Sketch2Sound can synthesize sounds that follow the gist of input controls from a vocal imitation while retaining the adherence to an input text prompt and audio quality compared to a text-only baseline. Sketch2Sound allows sound artists to create sounds with the semantic flexibility of text prompts and the expressivity and precision of a sonic gesture or vocal imitation.
more » « less
Free, publicly-accessible full text available April 6, 2026
Soundata: Reproducible use of audio datasets

https://doi.org/10.21105/joss.06634

Fuentes, Magdalena; Plaja-Roglans, Genís; Cortès-Sebastià, Guillem; Khandelwal, Tanmay; Miron, Marius; Serra, Xavier; Bello, Juan Pablo; Salamon, Justin (June 2024, Journal of Open Source Software)

Full Text Available
Bridging High-Quality Audio and Video Via Language for Sound Effects Retrieval from Visual Queries

https://doi.org/10.1109/WASPAA58266.2023.10248113

Wilkins, Julia; Salamon, Justin; Fuentes, Magdalena; Bello, Juan Pablo; Nieto, Oriol (October 2023, IEEE)

Full Text Available
Chirping up the Right Tree: Incorporating Biological Taxonomies into Deep Bioacoustic Classifiers

https://doi.org/10.1109/ICASSP40776.2020.9052908

Cramer, Jason; Lostanlen, Vincent; Farnsworth, Andrew; Salamon, Justin; Bello, Juan Pablo (May 2020, ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))
null (Ed.)
Class imbalance in the training data hinders the generalization ability of machine listening systems. In the context of bioacoustics, this issue may be circumvented by aggregating species labels into super-groups of higher taxonomic rank: genus, family, order, and so forth. However, different applications of machine listening to wildlife monitoring may require different levels of granularity. This paper introduces TaxoNet, a deep neural network for structured classification of signals from living organisms. TaxoNet is trained as a multitask and multilabel model, following a new architectural principle in end-to-end learning named "hierarchical composition": shallow layers extract a shared representation to predict a root taxon, while deeper layers specialize recursively to lower-rank taxa. In this way, TaxoNet is capable of handling taxonomic uncertainty, out-of-vocabulary labels, and open-set deployment settings. An experimental benchmark on two new bioacoustic datasets (ANAFCC and BirdVox-14SD) leads to state-of-the-art results in bird species classification. Furthermore, on a task of coarse-grained classification, TaxoNet also outperforms a flat single-task model trained on aggregate labels.
more » « less
Full Text Available
Robust sound event detection in bioacoustic sensor networks

https://doi.org/10.1371/journal.pone.0214168

Lostanlen, Vincent; Salamon, Justin; Farnsworth, Andrew; Kelling, Steve; Bello, Juan Pablo (October 2019, PLOS ONE)
McLoughlin, Ian (Ed.)
Full Text Available
Adaptive Pooling Operators for Weakly Labeled Sound Event Detection

https://doi.org/10.1109/TASLP.2018.2858559

McFee, Brian; Salamon, Justin; Bello, Juan Pablo (November 2018, IEEE/ACM Transactions on Audio, Speech, and Language Processing)

Full Text Available
HistoryTracker: Minimizing Human Interactions in Baseball Game Annotation

https://doi.org/https://doi.org/10.1145/3290605.3300293

One, Jorge Piazentin; Gjoka, Arvi; Salamon, Justin; Dietrich, Carlos; Silva, Claudio T. (April 2019, CHI '19 Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems)

The sport data tracking systems available today are based on specialized hardware (high-definition cameras, speed radars, RFID) to detect and track targets on the field. While effective, implementing and maintaining these systems pose a number of challenges, including high cost and need for close human monitoring. On the other hand, the sports analytics community has been exploring human computation and crowdsourcing in order to produce tracking data that is trustworthy, cheaper and more accessible. However, state-of-the-art methods require a large number of users to perform the annotation, or put too much burden into a single user. We propose HistoryTracker, a methodology that facilitates the creation of tracking data for baseball games by warm-starting the annotation process using a vast collection of historical data. We show that HistoryTracker helps users to produce tracking data in a fast and reliable way.
more » « less
Full Text Available
Per-Channel Energy Normalization: Why and How

https://doi.org/10.1109/LSP.2018.2878620

Lostanlen, Vincent; Salamon, Justin; Cartwright, Mark; McFee, Brian; Farnsworth, Andrew; Kelling, Steve; Bello, Juan Pablo (January 2019, IEEE Signal Processing Letters)

Full Text Available
Scaper: A library for soundscape synthesis and augmentation

https://doi.org/10.1109/WASPAA.2017.8170052

Salamon, Justin; MacConnell, Duncan; Cartwright, Mark; Li, Peter; Bello, Juan Pablo (October 2017, Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA-17))

Sound event detection (SED) in environmental recordings is a key topic of research in machine listening, with applications in noise monitoring for smart cities, self-driving cars, surveillance, bioa-coustic monitoring, and indexing of large multimedia collections. Developing new solutions for SED often relies on the availability of strongly labeled audio recordings, where the annotation includes the onset, offset and source of every event. Generating such precise annotations manually is very time consuming, and as a result existing datasets for SED with strong labels are scarce and limited in size. To address this issue, we present Scaper, an open-source library for soundscape synthesis and augmentation. Given a collection of iso-lated sound events, Scaper acts as a high-level sequencer that can generate multiple soundscapes from a single, probabilistically defined, 'specification'. To increase the variability of the output, Scaper supports the application of audio transformations such as pitch shifting and time stretching individually to every event. To illustrate the potential of the library, we generate a dataset of 10,000 sound-scapes and use it to compare the performance of two state-of-The-Art algorithms, including a breakdown by soundscape characteristics. We also describe how Scaper was used to generate audio stimuli for an audio labeling crowdsourcing experiment, and conclude with a discussion of Scaper's limitations and potential applications.
more » « less
Full Text Available
Fusing shallow and deep learning for bioacoustic bird species classification

https://doi.org/10.1109/ICASSP.2017.7952134

Salamon, Justin; Bello, Juan Pablo; Farnsworth, Andrew; Kelling, Steve (March 2017, IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP))

Automated classification of organisms to species based on their vocalizations would contribute tremendously to abilities to monitor biodiversity, with a wide range of applications in the field of ecology. In particular, automated classification of migrating birds' flight calls could yield new biological insights and conservation applications for birds that vocalize during migration. In this paper we explore state-of-the-art classification techniques for large-vocabulary bird species classification from flight calls. In particular, we contrast a “shallow learning” approach based on unsupervised dictionary learning with a deep convolutional neural network combined with data augmentation. We show that the two models perform comparably on a dataset of 5428 flight calls spanning 43 different species, with both significantly outperforming an MFCC baseline. Finally, we show that by combining the models using a simple late-fusion approach we can further improve the results, obtaining a state-of-the-art classification accuracy of 0.96.
more » « less
Full Text Available

« Prev Next »

Search for: All records