ABSTRACT To mitigate the effects of Radio Frequency Interference (RFI) on the data analysis pipelines of 21 cm interferometric instruments, numerous inpaint techniques have been developed. In this paper, we examine the qualitative and quantitative errors introduced into the visibilities and power spectrum due to inpainting. We perform our analysis on simulated data as well as real data from the Hydrogen Epoch of Reionization Array (HERA) Phase 1 upper limits. We also introduce a convolutional neural network that is capable of inpainting RFI corrupted data. We train our network on simulated data and show that our network is capable of inpainting real data without requiring to be retrained. We find that techniques that incorporate high wavenumbers in delay space in their modelling are best suited for inpainting over narrowband RFI. We show that with our fiducial parameters discrete prolate spheroidal sequences (dpss) and clean provide the best performance for intermittent RFI while Gaussian progress regression (gpr) and least squares spectral analysis (lssa) provide the best performance for larger RFI gaps. However, we caution that these qualitative conclusions are sensitive to the chosen hyperparameters of each inpainting technique. We show that all inpainting techniques reliably reproduce foreground dominated modes in the power spectrum. Since the inpainting techniques should not be capable of reproducing noise realizations, we find that the largest errors occur in the noise dominated delay modes. We show that as the noise level of the data comes down, clean and dpss are most capable of reproducing the fine frequency structure in the visibilities.
more »
« less
VampNet: Music Generation via Masked Acoustic Token Modeling
We introduce VampNet, a masked acoustic token modeling approach to music synthesis, compression, inpainting, and variation. We use a variable masking schedule during training which allows us to sample coherent music from the model by applying a variety of masking approaches (called prompts) during inference. VampNet is non-autoregressive, leveraging a bidirectional transformer architecture that attends to all tokens in a forward pass. With just 36 sampling passes, VampNet can generate coherent high-fidelity musical waveforms. We show that by prompting VampNet in various ways, we can apply it to tasks like music compression, inpainting, outpainting, continuation, and looping with variation (vamping). Appropriately prompted, VampNet is capable of maintaining style, genre, instrumentation, and other high-level aspects of the music. This flexible prompting capability makes VampNet a powerful music co-creation tool. Code andaudio samples are available online.
more »
« less
- Award ID(s):
- 2222369
- PAR ID:
- 10543854
- Publisher / Repository:
- 24th International Society for Music Information Retrieval Conference
- Date Published:
- Subject(s) / Keyword(s):
- Generative modeling Music Audio
- Format(s):
- Medium: X
- Location:
- 24th International Society for Music Information Retrieval Conference, Milan, Italy
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
This work describes the design of real-time dance-based interaction with a humanoid robot, where the robot seeks to promote physical activity in children by taking on multiple roles as a dance partner. It acts as a leader by initiating dances but can also act as a follower by mimicking a child’s dance movements. Dances in the leader role are produced by a sequence-to-sequence (S2S) Long Short-Term Memory (LSTM) network trained on children’s music videos taken from YouTube. On the other hand, a music orchestration platform is implemented to generate background music in the follower mode as the robot mimics the child’s poses. In doing so, we also incorporated the largely unexplored paradigm of learning-by-teaching by including multiple robot roles that allow the child to both learn from and teach to the robot. Our work is among the first to implement a largely autonomous, real-time full-body dance interaction with a bipedal humanoid robot that also explores the impact of the robot roles on child engagement. Importantly, we also incorporated in our design formal constructs taken from autism therapy, such as the least-to-most prompting hierarchy, reinforcements for positive behaviors, and a time delay to make behavioral observations. We implemented a multimodal child engagement model that encompasses both affective engagement (displayed through eye gaze focus and facial expressions) as well as task engagement (determined by the level of physical activity) to determine child engagement states. We then conducted a virtual exploratory user study to evaluate the impact of mixed robot roles on user engagement and found no statistically significant difference in the children’s engagement in single-role and multiple-role interactions. While the children were observed to respond positively to both robot behaviors, they preferred the music-driven leader role over the movement-driven follower role, a result that can partly be attributed to the virtual nature of the study. Our findings support the utility of such a platform in practicing physical activity but indicate that further research is necessary to fully explore the impact of each robot role.more » « less
-
This work introduces a transformer-based image and video tokenizer leveraging Binary Spherical Quantization (BSQ). The method projects high-dimensional visual embeddings onto a lower-dimensional hypersphere followed by binary quantization. BSQ offers three key benefits: (1) parameter efficiency without requiring an explicit codebook, (2) scalability to arbitrary token dimensions, and (3) high compression capability—up to 100× compression of visual data with minimal distortion. The tokenizer architecture includes a transformer encoder-decoder with block-wise causal masking to handle variable-length video inputs. The resulting model, BSQ-ViT, achieves state-of-the-art visual reconstruction performance on image and video benchmarks while delivering 2.4× higher throughput compared to previous best methods. Additionally, BSQ-ViT supports video compression via autoregressive priors for adaptive arithmetic coding, achieving results comparable to leading video compression standards. Furthermore, it enables masked language models to achieve competitive image synthesis quality relative to GAN- and diffusion-based approaches.more » « less
-
There are many benefits for a community when there is a vibrant local music scene (e.g., increased mental & physical well-being, increased economic activity) and there are many factors that contribute to an environment in which a live music scene can thrive (e.g., available performance spaces, helpful government policies). In this paper, we explore using an estimate of the live music event rate (LMER) as a rough indicator to measure the strength of a local music scene. We define LMER as the number of music shows per 100,000 people per year and then explore how this indicator is (or is not) correlated with 28 other socioeconomic indicators. To do this, we analyze a set of 308,051 music events from 2019 across 1,139 cities in the United States. Our findings reveal that factors related to transportation (e.g., high walkability), population (high density), economics (high employment rate), age (high proportion of individuals age 20-29), and education (bachelor's degree or higher) are strongly correlated with having a high number of live music events. Conversely, we did not find statistically significant evidence that other indica- tors (e.g., racial diversity) are correlated.more » « less
-
We introduce VoiceCraft, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts. VoiceCraft employs a Transformer decoder architecture and introduces a token rearrangement procedure that combines causal masking and delayed stacking to enable generation within an existing sequence. On speech editing tasks, VoiceCraft produces edited speech that is nearly indistinguishable from unedited recordings in terms of naturalness, as evaluated by humans; for zero-shot TTS, our model outperforms prior SotA models including VALLE and the popular commercial model XTTS-v2. Crucially, the models are evaluated on challenging and realistic datasets, that consist of diverse accents, speaking styles, recording conditions, and background noise and music, and our model performs consistently well compared to other models and real recordings. In particular, for speech editing evaluation, we introduce a high quality, challenging, and realistic dataset named RealEdit.more » « less
An official website of the United States government

