This paper presents a novel zero-shot learning approach towards personalized speech enhancement through the use of a sparsely active ensemble model. Optimizing speech denoising systems towards a particular test-time speaker can improve performance and reduce run-time complexity. However, test-time model adaptation may be challenging if collecting data from the test-time speaker is not possible. To this end, we propose using an ensemble model wherein each specialist module denoises noisy utterances from a distinct partition of training set speakers. The gating module inexpensively estimates test-time speaker characteristics in the form of an embedding vector and selects the most appropriate specialist module for denoising the test signal. Grouping the training set speakers into non-overlapping semantically similar groups is non-trivial and ill-defined. To do this, we first train a Siamese network using noisy speech pairs to maximize or minimize the similarity of its output vectors depending on whether the utterances derive from the same speaker or not. Next, we perform k-means clustering on the latent space formed by the averaged embedding vectors per training set speaker. In this way, we designate speaker groups and train specialist modules optimized around partitions of the complete training set. Our experiments show that ensemble models made up of low-capacity specialists can outperform high-capacity generalist models with greater efficiency and improved adaptation towards unseen test-time speakers.
more »
« less
Modeling speaker-specific vocal tract kinematics from gestural scores
The theory of Task Dynamics provides a method of predicting articulatory kinematics from a discrete phonologically-relevant representation (“gestural score”). However, because the implementations of that model (e.g., Nam et al., 2004) have generally used a simplified articulatory geometry (Mermelstein et al., 1981) whose forward model (from articulator to constriction coordinates) can be analytically derived, quantitative predictions of the model for individual human vocal tracts have not been possible. Recently, methods of deriving individual speaker forward models from real-time MRI data have been developed (Sorensen et al., 2019). This has further allowed development of task dynamic models for individual speakers, which make quantitative predictions. Thus far, however, these models (Alexander et al., 2019) could only synthesize limited types of utterances due to their inability to model temporally overlapping gestures. An updated implementation is presented, which can accommodate overlapping gestures and incorporates an optimization loop to improve the fit of modeled articulatory trajectories to the observed ones. Using an analysis-by-synthesis approach, the updated implementation can be utilized: (1) to refine the hypothesized speaker-general gestural parameters (target, stiffness) for individual speakers; (2) to test different degrees of temporal overlapping among multiple gestures such as a CCVC syllable. [Work supported by NSF, Grant 1908865.]
more »
« less
- Award ID(s):
- 1908865
- PAR ID:
- 10475755
- Publisher / Repository:
- American Institute of Physics
- Date Published:
- Journal Name:
- The Journal of the Acoustical Society of America
- Volume:
- 150
- Issue:
- 4_Supplement
- ISSN:
- 0001-4966
- Page Range / eLocation ID:
- A188 to A189
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
A Continuous Articulatory Gesture Based Liveness Detection for Voice Authentication on Smart DevicesVoice biometrics is drawing increasing attention to user authentication on smart devices. However, voice biometrics is vulnerable to replay attacks, where adversaries try to spoof voice authentication systems using pre-recorded voice samples collected from genuine users. To this end, we propose VoiceGesture, a liveness detection solution for voice authentication on smart devices such as smartphones and smart speakers. With audio hardware advances on smart devices, VoiceGesture leverages built-in speaker and microphone pairs on smart devices as Doppler Radar to sense articulatory gestures for liveness detection during voice authentication. The experiments with 21 participants and different smart devices show that VoiceGesture achieves over 99% and around 98% detection accuracy for text-dependent and text-independent liveness detection, respectively. Moreover, VoiceGesture is robust to different device placements, low audio sampling frequency, and supports medium range liveness detection on smart speakers in various use scenarios, including smart homes and smart vehicles.more » « less
-
Miller, B; Martin, C (Ed.)Assessment continues to be an important conversation point within Science, Technology, Engineering, and Mathematics (STEM) education scholarship and practice (Krupa et al., 2019; National Research Council, 2001). There are guidelines for developing and evaluating assess- ments (e.g., AERA et al., 2014; Carney et al., 2022; Lavery et al., 2019; Wilson & Wilmot, 2019). There are also Standards for Educational & Psychological Testing (Standards; AERA et al., 2014) that discuss important rele- vant frameworks and information about using assessment results and interpretations. Quantitative assessments are used as part of daily STEM instruction, STEM research, and STEM evaluation; therefore, having robust assess- ments is necessary (National Research Council, 2001). An aim of this editorial is to give readers a few relevant ideas about modern assessment research, some guidance for the use of quantitative assessments, and framing validation and assessment research as equity-forward work.more » « less
-
Expectation is a powerful mechanism in native-language processing. Less is known about its role in non-native language processing, especially for expectations at the discourse level. This study presents evidence from a story-continuation task, adapted from previous work with native speakers (Rohde et al., 2006), probing next-mention and coherence expectations among Japanese- and Korean-speaking learners of English. As in previous work, verbal aspect (perfective/imperfective) in a context sentence describing a transfer-of-possession event (e.g., Ron gave/was giving a towel to Patrick) modulated participants’ choices of next referents in their continuations. However, this effect was diminished in the non-native compared to the native-speaker group, despite comparable performance on an independent task assessing knowledge of verbal aspect in English, and previous evidence for significant effects of aspect on referential patterns in native Japanese and Korean processing (Ueno & Kehler, 2010; Kim et al., 2013). The two groups of speakers were equally sensitive to a cue that does not require predictive processing – the referential form of the story-continuation prompt – in that both groups were significantly more likely to establish reference to the discourse topic/Source of the transfer event for pronoun-initial continuations than for name-initial ones. Moreover, recency played a stronger role in non-native speakers’ referential choices than in those of native speakers. These results suggest that while native speakers engage in proactive discourse processing, non-native speakers are less able to do so, being sufficiently burdened by reactive processes required for information integration that they have only Reduced Ability to Generate Expectations (RAGE).more » « less
-
Tiede, Mark; Whalen, Doug; Gracco, Vincent (Ed.)This paper investigates the relative timing of onset consonant and vowel gestures in Tibetan as spoken in the Tibetan diaspora. According to the coupled oscillator model of articulatory timing (Browman & Goldstein 2000, Nam & Saltzman 2003), the most readily-available coupling modes among gestures are in-phase (synchronous) or anti-phase (sequential) timing, with competition among these modes also giving rise to a stable timing pattern. The model predicts that other timing relations, i.e. ”eccentric timing”, are possible but not as readily available. Data gathered using electromagnetic articulography (EMA) shows relative C-V timing consistent with either competitive coupling or eccentric timing. Competitive coupling is a plausible explanation for CV syllables in a tone language (Gao 2008), but acoustic analysis showed that some speakers do not produce a pitch contrast corresponding to tone. In the apparent absence of a tone gesture, we conclude that these speakers exhibit eccentric C-V timing.more » « less