skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Thursday, January 16 until 2:00 AM ET on Friday, January 17 due to maintenance. We apologize for the inconvenience.


Title: Joint word segmentation and phonetic category induction
We describe a model which jointly performs word segmentation and induces vowel categories from formant values. Vowel induction performance improves slightly over a baseline model which does not segment; segmentation performance decreases slightly from a baseline using entirely symbolic input. Our high joint performance in this idealized setting implies that problems in unsupervised speech recognition reflect the phonetic variability of real speech sounds in context.  more » « less
Award ID(s):
1421695
PAR ID:
10057882
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the conference - Association for Computational Linguistics. Meeting
ISSN:
0736-587X
Page Range / eLocation ID:
59-65
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A system for the lateral transfer of information from end-to-end neural networks recognizing articulatory feature classes to similarly structured networks recognizing phone tokens is here proposed. The system connects recurrent layers of feature detectors pre-trained on a base language to recurrent layers of a phone recognizer for a different target language, this inspired primarily by the progressive neural network scheme. Initial experiments used detectors trained on Bengali speech for four articulatory feature classes—consonant place, consonant manner, vowel height, and vowel backness—attached to phone recognizers for four other Asian languages (Javanese, Nepali, Sinhalese, and Sundanese). While these do not currently suggest consistent performance improvements across different low-resource settings for target languages, irrespective of their genealogic or phonological relatedness to Bengali, they do suggest the need for further trials with different language sets, altered data sources and data configurations, and slightly altered network setups. 
    more » « less
  2. Purpose: Delayed auditory feedback (DAF) interferes with speech output. DAF causes distorted and disfluent productions and errors in the serial order of produced sounds. Although DAF has been studied extensively, the specific patterns of elicited speech errors are somewhat obscured by relatively small speech samples, differences across studies, and uncontrolled variables. The goal of this study was to characterize the types of serial order errors that increase under DAF in a systematic syllable sequence production task, which used a closed set of sounds and controlled for speech rate. Method: Sixteen adult speakers repeatedly produced CVCVCV (C = consonant, V = vowel) sequences, paced to a “visual metronome,” while hearing self-generated feedback with delays of 0–250 ms. Listeners transcribed recordings, and speech errors were classified based on the literature surrounding naturally occurring slips of the tongue. A series of mixed-effects models were used to assess the effects of delay for different error types, for error arrival time, and for speaking rate. Results: DAF had a significant effect on the overall error rate for delays of 100 ms or greater. Statistical models revealed significant effects (relative to zero delay) for vowel and syllable repetitions, vowel exchanges, vowel omissions, onset disfluencies, and distortions. Serial order errors were especially dominated by vowel and syllable repetitions. Errors occurred earlier on average within a trial for longer feedback delays. Although longer delays caused slower speech, this effect was mediated by the run number (time in the experiment) and small compared with those in previous studies. Conclusions: DAF drives a specific pattern of serial order errors. The dominant pattern of vowel and syllable repetition errors suggests possible mechanisms whereby DAF drives changes to the activity in speech planning representations, yielding errors. These mechanisms are outlined with reference to the GODIVA (Gradient Order Directions Into Velocities of Articulators) model of speech planning and production. Supplemental Material: https://doi.org/10.23641/asha.19601785 
    more » « less
  3. Recent advancements in deep learning have led to drastic improvements in speech segregation models. Despite their success and growing applicability, few efforts have been made to analyze the underlying principles that these networks learn to perform segregation. Here we analyze the role of harmonicity on two state-of-the-art Deep Neural Networks (DNN)-based models- Conv-TasNet and DPT-Net [1],[2]. We evaluate their performance with mixtures of natural speech versus slightly manipulated inharmonic speech, where harmonics are slightly frequency jittered. We find that performance deteriorates significantly if one source is even slightly harmonically jittered, e.g., an imperceptible 3% harmonic jitter degrades performance of Conv-TasNet from 15.4 dB to 0.70 dB. Training the model on inharmonic speech does not remedy this sensitivity, instead resulting in worse performance on natural speech mixtures, making inharmonicity a powerful adversarial factor in DNN models. Furthermore, additional analyses reveal that DNN algorithms deviate markedly from biologically inspired algorithms [3] that rely primarily on timing cues and not harmonicity to segregate speech. 
    more » « less
  4. Abstract

    Computational models of infant word‐finding typically operate over transcriptions of infant‐directed speech corpora. It is now possible to test models of word segmentation on speech materials, rather than transcriptions of speech. We propose that such modeling efforts be conducted over the speech of the experimental stimuli used in studies measuring infants' capacity for learning from spoken sentences. Correspondence with infant outcomes in such experiments is an appropriate benchmark for models of infants. We demonstrate such an analysis by applying the DP‐Parser model of Algayres and colleagues to auditory stimuli used in infant psycholinguistic experiments by Pelucchi and colleagues. The DP‐Parser model takes speech as input, and creates multiple overlapping embeddings from each utterance. Prospective words are identified as clusters of similar embedded segments. This allows segmentation of each utterance into possible words, using a dynamic programming method that maximizes the frequency of constituent segments. We show that DP‐Parse mimics American English learners' performance in extracting words from Italian sentences, favoring the segmentation of words with high syllabic transitional probability. This kind of computational analysis over actual stimuli from infant experiments may be helpful in tuning future models to match human performance.

     
    more » « less
  5. null (Ed.)
    This paper analyzes the musical surrogate encoding of Seenku (Mande, Burkina Faso) syllable structure on the balafon, a resonator xylophone used by the Sambla ethnicity. The elements of syllable structure that are encoded include vowel length, sesquisyllabicity, diphthongs, and nasal codas. Certain elements, like vowel length and sesquisyllabicity, involve categorical encoding through conscious rules of surrogate speech, while others, like diphthongs and nasal codas, vary between being treated as simple or complex. Beyond these categorical encodings, subtler aspects of rhythmic structure find their way into the speech surrogate through durational differences; these include duration differences from phonemic distinctions like vowel length in addition to subphonemic differences due to phrasal position. I argue that these subconscious durational differences arise from a “phonetic filter”, which mediates between the musician’s inner voice and their non-verbal behavior. Specifically, syllables encoded on the balafon may be timed according to the perceptual center (p-center) of natural spoken rhythm, pointing to a degree of phonetic detail in a musician’s inner speech. 
    more » « less