skip to main content


Title: Autosegmental Neural Nets: Should Phones and Tones be Synchronous or Asynchronous?
Phones, the segmental units of the International Phonetic Al-phabet (IPA), are used for lexical distinctions in most human languages; Tones, the suprasegmental units of the IPA,are used in perhaps 70%. Many previous studies have explored cross-lingual adaptation of automatic speech recognition(ASR) phone models, but few have explored the multilingual and cross-lingual transfer of synchronization between phones and tones. In this paper, we test four Connectionist Temporal Classification (CTC)-based acoustic models, differing in the degree of synchrony they impose between phones and tones.Models are trained and tested multilingually in three languages,then adapted and tested cross-lingually in a fourth. Both synchronous and asynchronous models are effective in both multi-lingual and cross-lingual settings. Synchronous models achieve lower error rate in the joint phone+tone tier, but asynchronous training results in lower tone error rate.  more » « less
Award ID(s):
1910319
NSF-PAR ID:
10273578
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Interspeech
Page Range / eLocation ID:
1027 to 1031
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Only a handful of the world’s languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a mul-tilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations.In this work, we focus on gaining a deeper understanding ofhow general these representations might be, and how individual phones are getting improved in a multilingual setting. To that end, we select a phonetically diverse set of languages, and perform a series of monolingual, multilingual and crosslingual (zero-shot) experiments. The ASR is trained to recognize the International Phonetic Alphabet (IPA) token sequences. We ob-serve significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting, where the model, among other errors, considers Javanese as a tone language. Notably, as little as 10 hours of the target language training data tremendously reduces ASR error rates.Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages - an encouraging result for the low-resource speech community 
    more » « less
  2. null (Ed.)
    People who grow up speaking a language without lexical tones typically find it difficult to master tonal languages after childhood. Accumulating research suggests that much of the challenge for these second language (L2) speakers has to do not with identification of the tones themselves, but with the bindings between tones and lexical units. The question that remains open is how much of these lexical binding problems are problems of encoding (incomplete knowledge of the tone-to-word relations) vs. retrieval (failure to access those relations in online processing). While recent work using lexical decision tasks suggests that both may play a role, one issue is that failure on a lexical decision task may reflect a lack of learner confidence about what is not a word, rather than non-native representation or processing of known words. Here we provide complementary evidence using a picture- phonology matching paradigm in Mandarin in which participants decide whether or not a spoken target matches a specific image, with concurrent event-related potential (ERP) recording to provide potential insight into differences in L1 and L2 tone processing strategies. As in the lexical decision case, we find that advanced L2 learners show a clear disadvantage in accurately identifying tone mismatched targets relative to vowel mismatched targets. We explore the contribution of incomplete/uncertain lexical knowledge to this performance disadvantage by examining individual data from an explicit tone knowledge post-test. Results suggest that explicit tone word knowledge and confidence explains some but not all of the errors in picture-phonology matching. Analysis of ERPs from correct trials shows some differences in the strength of L1 and L2 responses, but does not provide clear evidence toward differences in processing that could explain the L2 disadvantage for tones. In sum, these results converge with previous evidence from lexical decision tasks in showing that advanced L2 listeners continue to have difficulties with lexical tone recognition, and in suggesting that these difficulties reflect problems both in encoding lexical tone knowledge and in retrieving that knowledge in real time. 
    more » « less
  3. Period-doubled voice consists of two alternating periods with multiple frequencies and is often perceived as rough with an indeterminate pitch. Past pitch-matching studies in period-doubled voice found that the perceived pitch was lower as the degree of amplitude and frequency modulation between the two alternating periods increased. The perceptual outcome also differed across f0s and modulation types: a lower f0 prompted earlier identification of a lower pitch, and the matched pitch dropped more quickly in frequency- than amplitude-modulated tokens (Sun & Xu, 2002; Bergan & Titze, 2001). However, it is unclear how listeners perceive period doubling when identifying linguistic tones. In an artificial language learning paradigm, this study used resynthesized stimuli with alternating amplitudes and/or frequencies of varying degrees, based on a production study of period-doubled voice (Huang, 2022). Listeners were native speakers of English and Mandarin. We confirm the positive relationship between the modulation degree and the proportion of low tones heard, and find that frequency modulation biased listeners to choose more low-tone options than amplitude modulation. However, a higher f0 (300 Hz) leads to a low-tone percept in more amplitude-modulated tokens than a lower f0 (200 Hz). Both English and Mandarin listeners behaved similarly, suggesting that pitch perception during period doubling is not language-specific. Furthermore, period doubling is predicted to signal low tones in languages, even when the f0 is high. 
    more » « less
  4. null (Ed.)
    Abstract Lexical tones are widely believed to be a formidable learning challenge for adult speakers of nontonal languages. While difficulties—as well as rapid improvements—are well documented for beginning second language (L2) learners, research with more advanced learners is needed to understand how tone perception difficulties impact word recognition once learners have a substantial vocabulary. The present study narrows in on difficulties suggested in previous work, which found a dissociation in advanced L2 learners between highly accurate tone identification and largely inaccurate lexical decision for tone words. We investigate a “best-case scenario” for advanced L2 tone word processing by testing performance in nearly ideal listening conditions—with words spoken clearly and in isolation. Under such conditions, do learners still have difficulty in lexical decision for tone words? If so, is it driven by the quality of lexical representations or by L2 processing routines? Advanced L2 and native Chinese listeners made lexical decisions while an electroencephalogram was recorded. Nonwords had a first syllable with either a vowel or tone that differed from that of a common disyllabic word. As a group, L2 learners performed less accurately when tones were manipulated than when vowels were manipulated. Subsequent analyses showed that this was the case even in the subset of items for which learners showed correct and confident tone identification in an offline written vocabulary test. Event-related potential results indicated N400 effects for both nonword conditions in L1, but only vowel N400 effects in L2, with tone responses intermediate between those of real words and vowel nonwords. These results are evidence of the persistent difficulty most L2 learners have in using tones for online word recognition, and indicate it is driven by a confluence of factors related to both L2 lexical representations and processing routines. We suggest that this tone nonword difficulty has real-world implications for learners: It may result in many toneless word representations in their mental lexicons, and is likely to affect the efficiency with which they can learn new tone words. 
    more » « less
  5. Objective Over the past decade, we developed and studied a face-to-face video-based analysis-of-practice professional development (PD) model. In a cluster randomized trial, we found that the face-to-face model enhanced elementary science teacher knowledge and practice and resulted in important improvements to student science achievement (student treatment effect, d = 0.52; Taylor et al, 2017; Roth et al, 2018). The face-to-face PD model is expensive and difficult to scale. In this paper, we present the results of a two-year design-based research study to translate the face-to-face PD into a facilitated online PD experience. The purpose is to create an effective, flexible, and cost-efficient PD model that will reach a broader audience of teachers. Perspective/Theoretical Framework The face-to-face PD model is grounded in situated cognition and cognitive apprenticeship frameworks. Teachers engage in learning science content and effective science teaching practices in the context in which they will be teaching. There are scaffolded opportunities for teachers to learn from analysis of model videos by experienced teachers, to try teaching model units, to analyze video of their own teaching efforts, and ultimately to develop their own unit, with guidance. The PD model attends to the key features of effective PD as described by Desimone (2009) and others. We adhered closely to the design principles of the face-to-face model as described by Authors, 2019. Methods We followed a design-based research approach (DBR; Cobb et al., 2003; Shavelson et al., 2003) to examine the online program components and how they promoted or interfered with the development of teachers’ knowledge and reflective practice. Of central interest was the examination of mechanisms for facilitating teacher learning (Confrey, 2006). To accomplish this goal, design researchers engaged in iterative cycles of problem analysis, design, implementation, examination, and redesign (Wang & Hannafin, 2005) in phase one of the project before studying its effect. Data Three small pilot groups of teachers engaged in both synchronous and asynchronous components of the larger online course which began implementation with a 10-week summer course that leads into study groups of participants meeting through one academic year. We iteratively designed, tested, and revised 17 modules across three pilot versions. On average, pilot groups completed one module every two weeks. Pilot 1 began the work in May 2019; Pilot 2 began in August 2019, and Pilot 3 began in October 2019. Pilot teachers responded to surveys and took part in interviews related to the PD. The PD facilitators took extensive notes after each iteration. The development team met weekly to discuss revisions. We revised all modules between each pilot group and used what we learned to inform our development of later modules within each pilot. For example, we applied what we learned from testing Module 3 with Pilot 1 to the development of Module 3 for Pilots 2, and also applied what we learned from Module 3 with Pilot 1 to the development of Module 7 for Pilot 1. Results We found that community building required the same incremental trust-building activities that occur in face-to-face PD. Teachers began with low-risk activities and gradually engaged in activities that required greater vulnerability (sharing a video of themselves teaching a model unit for analysis and critique by the group). We also identified how to contextualize technical tools with instructional prompts to allow teachers to productively interact with one another about science ideas asynchronously. As part of that effort, we crafted crux questions to surface teachers’ confusions or challenges related to content or pedagogy. We called them crux questions because they revealed teachers’ uncertainty and deepened learning during the discussion. Facilitators leveraged asynchronous responses to crux questions in the synchronous sessions to push teacher thinking further than would have otherwise been possible in a 2-hour synchronous video-conference. Significance Supporting teachers with effective, flexible, and cost-efficient PD is difficult under the best of circumstances. In the era of covid-19, online PD has taken on new urgency. NARST members will gain insight into the translation of an effective face-to-face PD model to an online environment. 
    more » « less