<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>The Influence of Tone on the Alignment of Speech and Co-Speech Gesture</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2022 May</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10354564</idno>
					<idno type="doi">doi: 10.21437/SpeechProsody.2022-63</idno>
					<title level='j'>Proceedings of Speech Prosody</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Kathryn Franich and Hermann Keupdjio</author><author>Chairs: Sónia Frota and Marina Vigário</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Evidence continues to accrue suggesting that co-speech gestures form an integrated part of the prosodic system of languages. Several studies have highlighted a tight link between the timing of gestures of the hands and head with syllables bearing prosodic prominence. Most work to date has examined this relationship in Indo-European languages, where gestures appear to be crucially timed with respect to pitch-accented syllables. Less work has examined the timing of co-speech gestures in tonal languages, where pitch plays quite a different role within the phonological system. Here, we examine the influence of tone on the timing of manual co-speech gestures in Medmba, a Grassfields Bantu language spoken in Cameroon. We investigate 1) whether certain tones are more likely than others to associate with manual gestures in the language; and 2) whether the fine timing of the speech-gesture relationship is influenced by the tone or relative f0 of the syllable it co-occurs with. Our findings indicated no preference for any one tone to occur with co-speech gestures. However, gesture apexes were found to align significantly later with respect to the accompanying syllable's vowel for low-toned syllables as compared with syllables of other tones.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction 1.Speech, Gesture, and Prominence</head><p>Recent work on the timing of co-speech gestures in several Indo-European languages points to the close link between speech and co-speech gesture as parallel modalities through which elements of prosodic structure and pragmatic information are expressed <ref type="bibr">[1,</ref><ref type="bibr">2,</ref><ref type="bibr">3,</ref><ref type="bibr">4,</ref><ref type="bibr">5,</ref><ref type="bibr">6]</ref>. In particular, a number of studies have now shown that gestures of the hands and head are preferentially timed to co-occur with prosodically 'prominent' events in speech, such as pitch-accented syllables, in languages such as English <ref type="bibr">[4,</ref><ref type="bibr">7,</ref><ref type="bibr">3,</ref><ref type="bibr">5]</ref>, French <ref type="bibr">[8]</ref>, Italian <ref type="bibr">[9]</ref>, and Catalan <ref type="bibr">[10]</ref>. Not only are pitch-accented syllables more likely to align with gestures than non-accented syllables, but the precise timing of speech and manual gesture has been argued to revolve around the relative timing of the gestural apex-the point of maximum extension of the articulators (e.g. the fingers)-and the aligning vowel's f0 peak <ref type="bibr">[3,</ref><ref type="bibr">7]</ref>, though other speech-based landmarks, such as the vowel onset or perceptual center have also been posited <ref type="bibr">[7]</ref>. More recently, Im &amp; Baumann <ref type="bibr">[11]</ref> have identified a probabilistic relationship between specific pitch accents (e.g. L+H*, H*, !H*, and L*) and the likelihood of gesture cooccurrence in English. These findings mirror earlier findings by Baumann &amp; R&#246;hr <ref type="bibr">[12]</ref> which suggest that pitch accents can be ordered in their relative prominence based on their pitch attributes. Based on a recent cross-linguistic investigation of per-ceived prominence, Cole et al. <ref type="bibr">[13]</ref> suggest that some aspects of the acoustic signal, including higher peak f0, may be interpreted as prominence-lending regardless of the relationship between pitch patterns, structural prominence, and pragmatic function in a given language. Findings from this study and others also indicate that increased intensity and duration are associated with perceived prominence cross-linguistically; these two variables have also been found to correlate with co-speech gesture presence <ref type="bibr">[14,</ref><ref type="bibr">15]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.2.">Gesture and f0 in a Tone Language</head><p>Less is known about the constraints on the alignment of cospeech gesture in languages in which tone plays a lexicallycontrastive role, and where the relationship between pitch and prominence is less straightforward. Aside from informing the typology of speech-gesture relations more broadly, understanding the relationship between tone, pitch, and gesture in tone languages has the potential to inform our understanding of the nature of prominence itself, and the relationship between articulatory constraints at the level of speech and gesture. For example, the results from Cole et al. suggest that, in spite of its primary role in cuing lexical contrasts, tone and/or f0 might still have a role to play in cuing prominence in tonal languages; by extension, we might also expect that co-speech gestures will be more likely to coincide with syllables which bear high tones or relatively higher f0. Furthermore, recent work by Pouw et al. <ref type="bibr">[16]</ref> has suggested that the link between co-speech gesture and higher f0 (as well as higher amplitude) may be driven by biomechanical factors leading to coupling of manual and articulatory movements. Such a link might be expected to be found across many different languages, perhaps even universally.</p><p>In the present work, we seek to understand whether a language which utilizes lexical tone may show a similar bias towards certain tonal and pitch patterns being aligned to cospeech gestures. Specifically, we investigate the alignment of co-speech gestures in Med0mba, a Grassfields Bantu language spoken in Cameroon, in naturally-occurring conversational contexts. Similar to other Bantu languages, it has a two-tone system, with both high (H) and low (L) tones. Rising and falling 'contour' tones can also arise through a variety of morphological and syntactic processes (see <ref type="bibr">[17]</ref> for further detail) but are reduceable to sequences of high + low (for falling) and low + high (for rising) (1).</p><p>(1) H tone m&#201;n 'child' L tone m&#200;n 'person/someone' HL falling m&#202;n s&#225;N@ 'chief's child' LH rising m&#282;n n&#193; 'that person'</p><p>Table <ref type="table">1</ref>: Four tonal patterns found in Med0mba</p><p>If higher pitch is associated with greater prominence for Med0mba speakers, as was found for the languages investigated by Cole et al., then we might expect that gestures would be more likely to align with high tones, or with vowels with higher f0, than with low tones, or vowels with lower f0, in the language. Similarly, given that rises have been identified as more perceptually prominent than falls <ref type="bibr">[12]</ref>, it may be that syllables bearing LH rising tones will be more likely to occur with gestures in Med0mba than those bearing HL falling contours or level H or L tones.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.3.">Tone, f0, and Timing</head><p>Aside from the broad tendency for certain pitch accent types to associate with co-speech gestures cross-linguistically, the possibility also exists that more subtle patterns in timing will arise based on the tone or fundamental frequency of vowels. For example, size of pitch excursion has been found to influence perceived prominence (with larger excursion size associated with greater perceived prominence) <ref type="bibr">[18]</ref>, and therefore, may also exert an influence on co-speech gesture timing. Previous work has also shown that tone influences the perceptual center, or 'perceived moment of occurrence' <ref type="bibr">[19]</ref> of a syllable in Med0mba, such that p-centers occur later in low-toned syllables than hightoned syllables <ref type="bibr">[20]</ref>. This pattern is thought to relate to the tendency for low tones to have pitch contours which are slightly falling in certain positions, leading to the illusion that their vowels are somewhat longer than those found with high tones (vowel duration is one factor which has been found to influence p-center location; <ref type="bibr">[21]</ref>). Finally, as mentioned previously, pitch peak has been argued to be the acoustic landmark to which gesture apexes align most closely in some languages; if this is also the case in Med0mba, then we would expect that the timing of the pitch accent peak would predict gesture timing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.4.">Other Acoustic Variables</head><p>In addition to tone and f0, we also examine how other acoustic variables, including vowel intensity and duration, predict gesture occurrence. Both of these variables have been found to associate with production of lexical stress, phrase-level accent, and contrastive focus across a number of languages <ref type="bibr">[22,</ref><ref type="bibr">23]</ref>, making them good candidates as predictors of gesture alignment.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Method</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Data Collection</head><p>Results are drawn from a corpus of naturally-occurring speech from four Med0mba speakers (2 identifying as men and two as women) collected through interviews in Bangangt&#233;, Cameroon, in January of 2020. Participants responded to a series of questions about local customs around major events such as marriage ceremonies or the birth of a child. Participants were video-and audio-recorded in a quiet room on a Zoom Q8 recorder positioned approximately 2.5 meters from the participant. A separate time-aligned audio track was recorded of each participant's speech with a AKG C520 head-mounted microphone. The camera captured the participant's head, upper body, and lap. Participants were recorded for an average of 45 minutes. Data analyzed in the present work is based on 4565 vowels, 669 of which aligned to a gesture. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Data Annotation and Preparation</head><p>Audio data were transcribed and glossed by the second author and subject to forced alignment using FAVE-align <ref type="bibr">[24]</ref>; alignment was subsequently checked for accuracy and adjusted as necessary by the first author. Video data were annotated for several manual gestural landmarks by trained student annotators from the University of Delaware according to the MIT gesture studies coding manual <ref type="bibr">[24]</ref>. Gestures were annotated with the sound muted so as to avoid possible bias from the audio signal. Landmarks included gesture preparation, stroke, hold, and recovery. In addition, the point of peak velocity (period within the stroke in which the hands moved with greatest velocity) was coded based on visualized extent of change in position of the hands between video frames and amount of blurring of the hands (Figure <ref type="figure">1</ref>). Peak velocity of the hands has been found to immediately precede the timing of the gesture apex <ref type="bibr">[7]</ref>-typically the point of maximum extension of the fingers (Figure <ref type="figure">1</ref>)-and was a more viable landmark to use for apex calculation than point of maximum extension, which was harder to consistently assess through video data. We henceforth refer to this landmark as the gestural 'apex.' Inter-annotator reliability was achieved by having annotators work in pairs and checking to ensure that annotations between partners differed by no more than one 30 ms video frame from one another. Annotated apexes were time-aligned to the speech signal in Praat <ref type="bibr">[25]</ref>. Given the existing cross-linguistic evidence that vowels, rather than syllable onsets, more closely approximate the timing of gesture apexes, the time between each manual apex and the onset of the temporally-closest vowel was calculated. Only results for monosyllabic words, which accounted for around 70% of the gesture-aligned data overall, were considered here. Gestures occurring more than two standard deviations (around 200 ms) from a vowel were considered outliers and excluded. Target vowels were then analyzed for several f0-related variables using ProsodyPro <ref type="bibr">[26]</ref>. We note that all gesture types were considered for the present study, including 'beat gestures,' and more meaningful iconic and deictic gestures, since all of these gesture types can potentially be associated with prosodic prominence <ref type="bibr">[27]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Proportion of Tones Aligning with Apexes</head><p>Table 2 below provides the percentages of each tone found among syllables aligned to apexes compared with the percentages of tones not aligned with apexes. A Chi-square test of independence revealed that there was no significant difference between proportions (&#967; 2 (3, N = 5685) = .03, p = .99). Gesture presence vs. absence was modeled using mixed effects logistic regression with the lme4 package in R statistical software. Predictor variables included Tone, Mean and Max f0 (converted from raw Hz to semitones), Mean Intensity, and Duration of vowels. Tone was treated as a categorical variable with four levels (High, Low, Rising, Falling) and sum-coded. The other variables were treated as continuous and mean-centered to avoid collinearity. Two-way interactions between Tone and each of the four continuous predictors were also included. Bysubject random intercepts were included for all models. Results revealed that that gesture presence was predicted to a significant degree by both vowel Intensity (B = 0.32, z = 3.39, p &lt; .001 ) and Duration (B = 0.24, z = 4.11, p &lt; .001) (Figures <ref type="figure">2</ref> and<ref type="figure">3</ref>). No significant effects were found for Tone, Mean or Max f0, or any of the interactions (ps &gt; .05), though there was a marginal effect of Mean f0 (B = -0.63, z = -1.76, p = .08); interestingly, vowels aligned to gestures trended toward having lower mean f0, rather than higher f0.   Six separate linear mixed effects models investigated the effects of tonal and pitch-related acoustic variables on the timing between gesture apexes and vowel onsets. These variables included Tone, treated as a categorical variable with four levels (High, Low, Rising, Falling), as well as Maximum f0, Minimum f0, f0 Excursion Size, Time to f0 Peak, and f0 Peak Velocity. Raw Hz values of f0 were converted to semitones. For all models, Vowel Duration was also included as a co-variate.</p><p>Continuous predictors were all mean-centered and Tone was sum-coded. By-subject random intercepts were also included in all models.</p><p>Post-hoc comparisons were conducted using the emmeans package for R. Results revealed there was a significant effect of Tone on apex-to-vowel timing (Figure <ref type="figure">4</ref>): gestures coinciding with L-tone vowels had apexes which were initiated significantly later relative to vowel onsets than gestures coinciding with H-tone vowels (B = 14.78, t = 2.03, p &lt; .05) or than those associated with HL falling-tone vowels (B = 23.72, t = 2.61, p &lt; .01). There was no significant difference in timing between H, LH rising, and HL falling tones in terms of timing (ps &gt; .05). There was a significant effect of Vowel Duration, with longer vowels eliciting later apexes overall (B = 22.09, t = 7.56, p &lt; .001). No significant effects were found for Maximum f0, Minimum f0 f0 Excursion Size, Time to f0 Peak, or f0 Peak Velocity on apex-to-vowel timing (ps &gt; .05). A marginal effect of Minimum f0 was found.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Discussion</head><p>Our findings have revealed several patterns of interest in the alignment of speech and co-speech gesture in Med0mba. In contrast with results from many Indo-European languages, cospeech gestures in Med0mba are not preferentially aligned to vowels bearing higher f0, or to vowels bearing high or rising tones. Indeed, the proportions of the tones found among the vowels with gestures aligned to them were almost identical to those of vowels with no gestures aligned to them. If anything, the trending effect of Mean f0 found in Section 3.2 would seem to suggest that gestures are preferred to occur with lower f0, rather than higher f0. We can therefore conclude that there is no universal notion of pitch 'prominence' when it comes to the alignment of co-speech gesture across languages; the interpretation of pitch in relation to prominence appears to be conditioned by language-specific factors. We did, however, find that both Mean Intensity and Duration predicted gesture occurrence, in the same direction as has been found for other languages: vowels which were louder and longer were more likely to occur with gestures. Indeed, duration has been identified as a correlate of phrase-level prominence previously in Med0mba <ref type="bibr">[27]</ref>, and though the specific prosodic and information-structural factors which influence intensity in the language are not yet clear, it is unsurprising that this variable should be associated with gesture occurrence.</p><p>Our findings also call into question the notion that the relationship between gesture and pitch is conditioned by universal biomechanical factors related to the coupling of manual and articulatory variables, although our results do support the proposal by Pouw et al. <ref type="bibr">[16]</ref> that manual gesture may entrain some aspects of phonation captured in the speech amplitude envelope. Though the lack of a significant relationship between f0 and gesture occurrence in our study may be attributable to the less constrained nature of our conversational sample (as compared with the more tightly-controlled experiments conducted by <ref type="bibr">Pouw et al.)</ref>, the fact that our finding trended in the opposite direction from what is predicted from that study seems to suggest a genuine difference between Med0mba and English in that regard. The finding that gesture is associated with increased intensity but marginally decreased f0 in Med0mba is also consistent with work which has suggested that the relationship between intensity and f0 cannot be boiled down solely to increased subglottal pressure causing faster vocal fold vibration <ref type="bibr">[28,</ref><ref type="bibr">29]</ref>.</p><p>Our findings furthermore highlight the more subtle influence that tone has on speech-gesture timing. Specifically, gestures accompanying low-toned syllables were found to be initiated later with respect to the vowel onset as compared with those accompanying high-or falling-toned syllables. These findings mirror previous findings suggesting that the p-center of low toned syllables in Med0mba is later than that for high toned syllables <ref type="bibr">[20]</ref>. The lateness of low tone p-centers was previously explained due to the slight fall in f0 which occurs on low-toned syllables, particularly before pause. In light of this, it is interesting that falling toned syllables patterned in the opposite way from low-toned syllables with respect to gesture alignment, showing quite early alignment (even before the vowel was initiated). This suggests that manual gestures in Med0mba are not universally timed with respect to the syllable/vowel, but to some other landmark. Surprisingly, no other pitch-related landmarks, including Excursion Size and Time to f0 Peak, predicted gestural apex timing independent of Vowel Duration. Another possibility not explored here is that the Time to Peak Velocity of pitch movement could be an important factor which differentiates the two types of tones, as falling tones tend to demonstrate a steep fall shortly after the vowel is initiated, while low tones have a more gradual fall which occurs later in the vowel. Future work will need to explore this possibility.</p><p>There are, of course, many outstanding questions as to the nature of co-speech gesture alignment in Med0mba. Evidence suggests that Med0mba exhibits stem-initial prominence which is independent of tone <ref type="bibr">[29]</ref>; stem-initial syllables are likely good candidates for gesture alignment, and future work will need to investigate this relationship, as well as the influence of information-structural factors on gesture occurrence. Going forward, we propose that co-speech gesture may provide an important tool for investigating the notion of prominence cross-linguistically. Given the observed links between perceived prominence and gesture, this relationship could prove particularly helpful for understanding prominence in languages, such as Med0mba, which lack canonical acoustic correlates of word stress, and where contrastive focus and other aspects of information structure are encoded through means other than acoustic marking.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Conclusions</head><p>This work presents a study of tone, f0, and other acoustic factors in conditioning the occurrence of co-speech gesture in Med0mba. We have shown that the occurrence of manual gestures in Med0mba is not strongly mediated by either tone or f0 of accompanying speech, as has been found for many betterstudied Indo-European languages. Nonetheless, the fine timing of manual gestures is found to be influenced by the tone of the accompanying vowel, with gestures initiated later relative to the onset of low-toned vowels as compared with vowels of other tones. We also find evidence that gesture occurrence is predicted by both increased duration and greater average intensity, similar to findings from other languages.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Acknowledgements</head></div></body>
		</text>
</TEI>
