Training personalized speech enhancement models is innately a no-shot learning problem due to privacy constraints and limited access to noise-free speech from the target user. If there is an abundance of unlabeled noisy speech from the test-time user, one may train a personalized speech enhancement model using self-supervised learning. One straightforward approach to model personalization is to use the target speaker’s noisy recordings as pseudo-sources. Then, a pseudo denoising model learns to remove injected training noises and recover the pseudo-sources. However, this approach is volatile as it depends on the quality of the pseudo-sources, which may be too noisy. To remedy this, we propose a data purification step that refines the self-supervised approach. We first train an SNR predictor model to estimate the frame-by-frame SNR of the pseudo- sources. Then, we convert the predictor’s estimates into weights that adjust the pseudo-sources’ frame-by-frame contribution to- wards training the personalized model. We empirically show that the proposed data purification step improves the usability of the speaker-specific noisy data in the context of personalized speech enhancement. Our approach may be seen as privacy-preserving as it does not rely on any clean speech recordings or speaker embeddings.
more »
« less
This content will become publicly available on May 6, 2026
Yong Kyu Yoon, Ratree Wayland, Lori J Altmann, Saeyeong Jeon, Sunghyun Hwang & Kevin Tang. 2025. Smart pseudo-palate for linguistic and biomedical applications. U.S. Patent No. 12,290,378. https://image-ppubs.uspto.gov/dirsearch-public/print/downloadPdf/12290378
In one aspect, the disclosure relates to a smart pseudo-palate for use in a Smart Electropalatograph (EPG) for Linguistic and Medical Applications (SELMA) system. In one aspect, the pseudo-palate is constructed from a thin, flexible polymer membrane and having an embedded electrode array. The pseudo-palate is configured to detect tongue contacts during speech while causing minimal disturbance or interference with speech motion. The disclosed pseudo-palate in the SELMA system is integrated with a microcontroller, wireless electronic module, and external readout app. The disclosure, in another aspect, relates to integration of the pseudo-palate with a smart sports/health mouth guard containing a series of sensors for monitoring head impacts, body temperature, and heart rate. The SELMA system is capable of automated detection of neurological conditions and brain injury including, but not limited to, concussion, and neurological movement disorders, using acoustic, articulatory, and other biosignals from the device using deep data analysis.
more »
« less
- Award ID(s):
- 2037266
- PAR ID:
- 10639674
- Publisher / Repository:
- United States Patent
- Date Published:
- Format(s):
- Medium: X
- Institution:
- United States Patent
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Current leading mispronunciation detection and diagnosis (MDD) systems achieve promising performance via end-to-end phoneme recognition. One challenge of such end-to-end solutions is the scarcity of human-annotated phonemes on natural L2 speech. In this work, we leverage unlabeled L2 speech via a pseudo-labeling (PL) procedure and extend the fine-tuning approach based on pre-trained self-supervised learning (SSL) models. Specifically, we use Wav2vec 2.0 as our SSL model, and fine-tune it using original labeled L2 speech samples plus the created pseudo-labeled L2 speech samples. Our pseudo labels are dynamic and are produced by an ensemble of the online model on-the-fly, which ensures that our model is robust to pseudo label noise. We show that fine-tuning with pseudo labels achieves a 5.35% phoneme error rate reduction and 2.48% MDD F1 score improvement over a labeled-samples-only finetuning baseline. The proposed PL method is also shown to outperform conventional offline PL methods. Compared to the state-of-the-art MDD systems, our MDD solution produces a more accurate and consistent phonetic error diagnosis. In addition, we conduct an open test on a separate UTD-4Accents dataset, where our system recognition outputs show a strong correlation with human perception, based on accentedness and intelligibility.more » « less
-
An unsupervised text-to-speech synthesis (TTS) system learns to generate speech waveforms corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can significantly improve the availability of speech technology to languages without a large amount of parallel speech and text data. This paper proposes an unsupervised TTS system based on an alignment module that outputs pseudo-text and another synthesis module that uses pseudo-text for training and real text for inference. Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each. A careful study on the effect of text units and vocoders has also been conducted to better understand what factors may affect unsupervised TTS performance. The samples generated by our models can be found at https://cactuswiththoughts.github.io/UnsupTTS-Demo, and our code can be found at https://github.com/lwang114/UnsupTTS.more » « less
-
Kirchler, Michael; Weitzel, Utz (Ed.)To improve the stability of the banking system the Dodd-Frank Act mandates that central banks conduct periodic evaluations of banks’ financial conditions. An intensely debated aspect of these ‘stress tests’ re- gards how much of that information generated by stress tests should be disclosed to financial markets. This paper uses an environment constructed from a model by Goldstein and Leitner (2018) to gain some behavioral insight into the policy tradeoffs associated with disclosure. Experimental results indicate that variations in disclosure conditions are sensitive to overbidding for bank assets. Absent overbidding, how- ever, optimal disclosure robustly improves risk sharing even when banks behave non-optimally.more » « less
-
Smart speaker voice assistants (VAs) such as Amazon Echo and Google Home have been widely adopted due to their seamless integration with smart home devices and the Internet of Things (IoT) technologies. These VA services raise privacy concerns, especially due to their access to our speech. This work considers one such use case: the unaccountable and unauthorized surveillance of a user's emotion via speech emotion recognition (SER). This paper presents DARE-GP, a solution that creates additive noise to mask users' emotional information while preserving the transcription-relevant portions of their speech. DARE-GP does this by using a constrained genetic programming approach to learn the spectral frequency traits that depict target users' emotional content, and then generating a universal adversarial audio perturbation that provides this privacy protection. Unlike existing works, DARE-GP provides: a) real-time protection of previously unheard utterances, b) against previously unseen black-box SER classifiers, c) while protecting speech transcription, and d) does so in a realistic, acoustic environment. Further, this evasion is robust against defenses employed by a knowledgeable adversary. The evaluations in this work culminate with acoustic evaluations against two off-the-shelf commercial smart speakers using a small-form-factor (raspberry pi) integrated with a wake-word system to evaluate the efficacy of its real-world, real-time deployment.more » « less
An official website of the United States government
