146 39 7MB
English Pages 235 [229] Year 2023
Prosody, Phonology and Phonetics
Robert Fuchs Editor
Speech Rhythm in Learner and Second Language Varieties of English
Prosody, Phonology and Phonetics Series Editors Daniel J. Hirst, CNRS Laboratoire Parole et Langage, Aix-en-Provence, France Hongwei Ding, School of Foreign Languages, Shanghai Jiao Tong University, Shanghai, China Qiuwu Ma, School of Foreign Languages, Tongji University, Shanghai, China
The series will publish studies in the general area of Speech Prosody with a particular (but non-exclusive) focus on the importance of phonetics and phonology in this field. The topic of speech prosody is today a far larger area of research than is often realised. The number of papers on the topic presented at large international conferences such as Interspeech and ICPhS is considerable and regularly increasing. The proposed book series would be the natural place to publish extended versions of papers presented at the Speech Prosody Conferences, in particular the papers presented in Special Sessions at the conference. This could potentially involve the publication of 3 or 4 volumes every two years ensuring a stable future for the book series. If such publications are produced fairly rapidly, they will in turn provide a strong incentive for the organisation of other special sessions at future Speech Prosody conferences.
Robert Fuchs Editor
Speech Rhythm in Learner and Second Language Varieties of English
Editor Robert Fuchs Department of English Universität Hamburg Hamburg, Germany
ISSN 2197-8700 ISSN 2197-8719 (electronic) Prosody, Phonology and Phonetics ISBN 978-981-19-8939-1 ISBN 978-981-19-8940-7 (eBook) https://doi.org/10.1007/978-981-19-8940-7 © Springer Nature Singapore Pte Ltd. 2023 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #21-01/04 Gateway East, Singapore 189721, Singapore
Contents
A Synthesis of Research on Speech Rhythm in Native, Learner and Second Language Varieties of English—Introduction to the Volume . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Robert Fuchs
1
Second Language Varieties of English Investigating (Rhythm) Variation in Indian English: An Integrated Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Olga Maxwell and Elinor Payne
17
Rhythmic Contrast in Marathi English and Telugu English . . . . . . . . . . . . Giuliana Regnoli
59
Rhythmic Patterns of Malaysian English Speakers . . . . . . . . . . . . . . . . . . . . Stefanie Pillai, Anussyia Muthiah, and Wan Ahmad Wan Aslynn
79
Learner Varieties of English Speech Rhythm, Length of Residence and Language Experience: A Longitudinal Investigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Donald White and Peggy Mok
97
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Phonological Theories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Lukas Sönning Monolingual-Bilingual (Non-)convergence in L3 Rhythm . . . . . . . . . . . . . . 159 Christina Domene Moreno and Barı¸s Kabak
v
vi
Contents
Measuring Rhythm Rhythm Metrics and the Perception of Rhythmicity in Varieties of English as a Second Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187 Robert Fuchs Novel Methods for Characterising L2 Speech Rhythm . . . . . . . . . . . . . . . . 211 Chris Davis and Jeesun Kim
A Synthesis of Research on Speech Rhythm in Native, Learner and Second Language Varieties of English—Introduction to the Volume Robert Fuchs
Abstract This chapter critically discusses recent research on speech rhythm in native, learner and second language varieties of English. Impressionistic descriptions have often claimed that Second Language or ‘Outer Circle’ varieties of English are syllable-timed, in contrast to the stress-timed rhythm of Native or ‘Inner Circle’ varieties of English. With a view to the reconceptualisation of speech rhythm as a continuum between stress- and syllable-timing, this claim can be rephrased as Second Language (or Outer Circle) varieties of English being more syllable-timed than Native (or Inner Circle) varieties of English. In a synthesis of recent empirical research, results from 18 studies are compared. Overall, the claim of a tendency towards syllable-timing can be confirmed not only for Outer Circle Englishes, but also, more tentatively, for Learner or ‘Expanding Circle’ varieties of English. However, several caveats are needed to qualify these conclusions, relating to the selection of varieties of English studied by recent research and the methodologies employed in the field. The chapter concludes with a discussion of how the subsequent chapters in the volume contribute towards addressing these concerns and thus advance the study of speech rhythm in native, learner and second language varieties of English. Keywords Synthesis · Speech rhythm · Syllable-timing · Kachru · Circle model · Inner Circle · Outer Circle · Expanding Circle
1 Speech Rhythm in Recent Research In the study of phonetics and phonology, investigations of speech rhythm have featured prominently over the last two decades. There is an abundance of relevant publications, investigating the role of speech rhythm in linguistic core areas such as
R. Fuchs (B) Department of English, University of Hamburg, Überseering 35, 22297 Hamburg, Germany e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 R. Fuchs (ed.), Speech Rhythm in Learner and Second Language Varieties of English, Prosody, Phonology and Phonetics, https://doi.org/10.1007/978-981-19-8940-7_1
1
2
R. Fuchs
First Language Acquisition (Goswami, 2019; He, 2018), Second Language Acquisition (Ordin & Polyanskaya 2015; Oñate, 2019; Van Maastricht et al., 2019), accentedness (Van Maastricht et al., 2021) and intelligibility (Vaysse et al., 2021), sociolinguistics (Torgersen & Szakay, 2012; Young, in press) as well as cross-linguistic comparisons (Mok & Dellwo, 2008; Prieto et al., 2012). Other studies explored the conceptualisation of speech rhythm (Gibbon, in press; Nolan & Jeon, 2014; Tilsen, 2019), its role in neurolinguistic processing (Alexandrou et al., 2018; Kawasaki et al., 2013; Peelle & Davis, 2012), dysarthria (Hernandez et al., 2020; Liss et al., 2009), speech therapy (Daigmorte et al., 2022) and general cognitive impairment (Meilán et al., 2020). Research has also explored the relevance of speech rhythm for other modalities, such as children’s reading ability (Holliman et al., 2010; OzernovPalchik & Patel, 2018), gestures (Tilsen, 2009) and musical aptitude (Harding et al., 2019; Magne et al., 2016), as well as the evolutionary relevance of speech rhythm in primates (Pereira et al., 2020) and other animals (Ravignani et al., 2019). The currently dominant linguistic approach to the study of speech rhythm relies on duration-based rhythm metrics. In this approach, greater durational variability of syllables, vowels or consonants (technically, vocalic and consonantal intervals) is associated with a tendency towards stress-timing, whereas less durational variability is associated with a tendency towards syllable-timing. From the outset, varieties of English were at the heart of this approach to speech rhythm. Both Low et al.’s (2000) normalised pairwise variability index for vocalic intervals (nPVI-V) and Deterding’s (2001) Variability Index were introduced in the context of a comparison of Singapore English (SinE) with British English (BrE). Furthermore, both studies showed that SinE has a more ‘even’ rhythm (less durational variability) than BrE, giving credence to previous claims describing SinE as more syllable-timed than BrE.
2 Speech Rhythm and Kachru’s Circles 2.1 The Circle Model Since the publication of these two seminal studies, a wealth of research has applied these and other duration-based metrics to a diverse set of languages. At the same time, a focus on varieties of English prevails in the field. Intriguingly, when the available evidence coming from impressionistic descriptions and empirical studies on various varieties of English is considered in its entirety, an association between the sociolinguistic status of varieties of English and their speech rhythm emerges. Particularly relevant in this connection is a widely used model that groups varieties of English into three classes, based on the sociolinguistic context and their domains of usage (Kachru’s, 1985 circle model):
A Synthesis of Research on Speech Rhythm in Native, Learner …
3
. English as a Native Language (ENL), the so-called Inner Circle, where English is the main language and dominates daily life in public and private domains (e.g. British English and Australian English); . English as a Second Language (ESL), the so-called Outer Circle, where English is widely used as a second language, but is usually much more dominant in public domains, such as education and administration, than in private domains; English is a marker of education and status, but not necessarily known by the majority of the population (e.g. Indian English, Nigerian English); . English as a Foreign Language (EFL), the so-called Expanding Circle, where English is rarely used for communication among permanent residents of the country, not even in specific domains (e.g. English in Japan or Spain). This model has been widely criticised (see, for example, Bruthiaux, 2003; Westphal & Wilson, 2020) because it imposes a hierarchy on the dialects involved, disregards the sociolinguistic complexity and heterogeneity within countries and because borderline cases such as Singapore (with a shift towards English as an L1; Buschfeld, 2019) or the Netherlands (with a shift towards English as an L2; Edwards, 2016; Edwards & Fuchs, 2020) are not adequately accounted for. Still, the model continues to be used in current research and, in its simplicity, captures differences between what might be called archetypes of distinct sociolinguistic contexts of the use of English across countries. With regard to speech rhythm, the notable generalisation that emerges through the lens of the circle model is that Outer Circle varieties are often described as syllable-timed and Inner Circle varieties as stress-timed (Crystal, 1995: 176–177; Mesthrie, 2008: 317). Against the background of the remarks above referring to the reconceptualisation of speech rhythm as a gradient feature, this generalisation can be appropriately rephrased as Outer Circle varieties being more syllable-timed than Inner Circle varieties (or, alternatively, Inner Circle varieties being more stresstimed than Outer Circle varieties). Crucially, this divergence in speech rhythm is probably not primarily caused by the sociolinguistic differences between Inner Circle and Outer Circle varieties of English as described above—for example, being used predominantly as a second language does not necessarily make a variety of English more syllable-timed (although this might be a contributing factor, see Sect. 2.3 for a more detailed discussion). Rather, an important cause of a tendency towards syllabletiming among Outer Circle Englishes is likely to be prosodic transfer from local languages (Rasier & Hiligsmann, 2007), which in turn are often more syllable-timed than British English and other Inner Circle Englishes.
2.2 A Synthesis of Empirical Research on Speech Rhythm Across the Three Circles A rhythmic difference between Outer and Inner Circle Englishes has not only been alleged in impressionistic descriptions, but also in studies such as Low et al. (2000)
4
R. Fuchs
and Deterding (2001), already mentioned above, which both focused on SinE. In order to substantiate the general claim of rhythmic differences between Outer and Inner Circle Englishes, a synthesis of existing empirical studies on a greater number of varieties of English is required and will be presented in the following (building on and extending the synthesis presented by Fuchs, 2016: 88–90). This synthesis considers all empirical studies published by early 2022, retrieved through a search on Google Scholar. Given the multitude of speech rhythm metrics, a broad synthesis would need to conduct separate comparisons of each rhythm metric and speaking style (read vs. spontaneous speech). Here, the scope of the analysis is restricted to studies applying the nPVI-V metric to read speech, which, along with VarcoV, is the most widely used reliable rhythm metric (for an earlier synopsis of results based on other rhythm metrics, see Fuchs, 2016: 88–92). Eighteen studies that fit these criteria were identified. Exact mean or median nPVI-V figures were used where available and otherwise estimated from charts. Results from studies that used a modified formula to compute the nPVI-V were transformed for comparability with the canonical formula of Low et al. (2000) (e.g. Behrman et al. [2019] appear to have omitted division by 2). In addition to Outer and Inner Circle varieties of English, results on Expanding Circle Englishes as well as a few languages other than English were included for comparison. Finally, the classification of a given variety of English as belonging to one or another group is not always straightforward. SinE is traditionally classified as an Outer Circle variety, but has been undergoing a shift towards being used as a first language. Moreover, Maori New Zealand English is classified here as an Outer Circle variety due to its origin as an indigenous strand variety (see Schneider, 2007) and associated historical and perhaps contemporary prosodic transfer from a syllable-timed first language, but could also be considered an Inner Circle variety.
2.3 Discussion Overall, the synthesis of the available studies clearly indicates more variability in vocalic durations for Inner Circle varieties than for Outer Circle varieties (see Figs. 1 and 2). Durational variability for Outer Circle Englishes is on par with results for languages such as Mandarin Chinese, French and Spanish, which are often referred to as syllable-timed. This pattern suggests that Inner Circle varieties, as a group, tend towards stress-timing and Outer Circle varieties tend towards syllable-timing. Expanding Circle varieties also appear to show lower durational variability than Inner Circle Englishes, and thus a tendency towards syllable-timing. The causal factor of a tendency towards syllable-timing in Outer Circle varieties is often identified as cross-linguistic influence (also sometimes called transfer) from locally spoken syllable-timed languages, rather than the status as Outer Circle Englishes. Expanding Circle Englishes are not necessarily thought to predominantly tend towards syllable-timing, but the present synthesis does point in this direction. Except for German, the first languages relevant in the studies are, for
A Synthesis of Research on Speech Rhythm in Native, Learner …
5
Fig. 1 Synthesis of results from 18 studies on rhythmic variation (nPVI-V) in read speech across Inner, Outer and Expanding Circle varieties of English as well as selected other languages. Higher scores indicate a tendency towards stress-timing, and lower scores a tendency towards syllabletiming. For varieties where several studies provide data, the author(s) is/are indicated in brackets (for variety labels and sources, see Appendix)
the most part, traditionally classified as syllable- or mora-timed. It seems plausible that researchers would prioritise investigating Expanding Circle Englishes for which there is reason to believe that they tend towards syllable-timing rather than those which are thought to tend towards stress-timing, so that the available studies might present a skewed picture due to a possible selection bias. For Outer Circle varieties, this bias is less likely to be relevant, because these are almost invariably impressionistically described as syllable-timed. Finally, publication bias, more narrowly
6
R. Fuchs
Fig. 2 Summary of results on rhythmic variation (nPVI-V) in read speech across Inner, Outer and Expanding Circle varieties of English
defined (Thornton & Lee, 2000), could have skewed the results for both Outer and Expanding Circle varieties in that researchers might not submit for publication, or otherwise struggle to publish, results that do not confirm an expected tendency towards syllable-timing in the variety of English they analysed. However, beyond questions relating to the strategic choice of research questions by researchers as well as publication bias, there is also a possible substantive explanation for a tendency towards syllable timing in Expanding Circle Englishes. Language learners have been found to rely on a relatively more syllable-timed rhythm in the acquisition of a stress-timed language, even if their L1 is relatively stress-timed (Ordin & Polyanskaya, 2015; Ordin et al., 2011). With increasing proficiency, timing patterns shift towards the more stress-timed rhythm of the target language. Consequently, it is conceivable that Expanding Circle Englishes really do exhibit a tendency towards syllable-timing, especially when beginning and intermediate proficiency learners are considered, while selection and publication bias might further contribute to available evidence pointing in this direction. Furthermore, if language learners generally tend towards a more syllable-timed rhythm, this factor might also reinforce the relatively syllable-timed rhythm of Outer Circle Englishes, in addition to prosodic transfer from local syllable-time L1s. While many of the speakers studied in the publications synthesised here are very proficient, a tendency towards syllable-timing might be a legacy of the early stages of the development of some Outer Circle Englishes, which, over time, might have been further reinforced by successive generations of speakers learning the local variety of English as an L2. Notwithstanding the general finding of a rhythmic contrast between Outer and Expanding Circle Englishes, on the one hand, and Inner Circle Englishes, on the other hand, this result needs to be qualified by several caveats. This synthesis relied on data for a single rhythm metric in a single speech style, and comparisons moving beyond these limits would help strengthen the robustness of the conclusions.
A Synthesis of Research on Speech Rhythm in Native, Learner …
7
Furthermore, there is considerable variation within and overlap between Inner and Outer Circle varieties in terms of durational variability, with Inner Circle varieties in particular showing a large degree of variation. On the one hand, such variation is expected both in terms of theory and methodology. The theoretical reason relates to the conceptualisation of speech rhythm as a continuum rather than distinct rhythm classes, such that individual languages and their varieties are expected to be distributed across the continuum (see Sect. 1). However, even for varieties for which data is available from several studies (e.g. American English [AmE]), there is considerable variation. Such variation is known to be caused by the choice of reading material and differences in the segmentation of vocalic and consonantal intervals, where a discrete boundary is imposed on the gradual transition between vowels and consonants (see Fuchs, 2016: 53–57). Finally, the considerable variation within Inner Circle varieties of English revealed in Figs. 1 and 2 could also partially be due to some of these varieties being genuinely (relatively) syllable-timed. This is most clearly the case for Welsh Valleys and Bristol English, which White and Mattys (2007) chose to study specifically because they were expected to be more syllable-timed than standard British English. It is conceivable that there are several other non-standard varieties spoken in Inner Circle countries that are likewise relatively syllable-timed. In fact, claims of a tendency towards syllable-timing relating to a particular (often Outer Circle) variety are usually made in an explicit or implicit comparison with standard British or American English. If, by contrast, British and American English are conceived of as the combination of all standard and non-standard varieties of English spoken in the United Kingdom and the United States, the alleged contrast between (relatively syllable-timed) Outer Circle Englishes and (relatively stress-timed) Inner Circle Englishes might be less clear-cut than previously thought (also see Sect. 2.2 above on the classification of Maori New Zealand English as an Outer Circle variety).
3 Advancing the State of the Art In summary, what emerges from this synthesis of studies on speech rhythm in varieties of English across Kachru’s three circles is that Outer Circle Englishes, and possibly also Expanding Circle Englishes, tend to be more syllable-timed than Inner Circle Englishes. At the same time, there is a clear need (i) for additional data on varieties of English that have so far not been investigated, (ii) for additional, independent studies on those varieties that have already received some attention, and (iii) for more research on the methodological underpinnings of speech rhythm research. The studies in the present volume all address one or more of these research desiderata. This volume is structured into three parts. The first part comprises three contributions on speech rhythm in Second Language, or Outer Circle, varieties of English. The second part comprises three chapters on speech rhythm in learner, or Expanding Circle, varieties of English, and the third part contains two chapters on methodological issues in the measurement of rhythm.
8
R. Fuchs
In their chapter entitled “Investigating (Rhythm) Variation in Indian English: An Integrated Approach”, Olga Maxwell and Elinor Payne explore patterns of convergence and divergence among speakers of Indian English with contrasting L1 backgrounds. They not only apply well-established rhythm metrics such as VarcoV and rPVI-C for holistic rhythm measures, but also investigate specific temporal measures such as durational contrasts between short (or lax) and long (or tense) vowels and stress-conditioned durational variation, as well as phrase-final lengthening, tone density and pitch accent type. While they provide evidence of several cases of L1-based prosodic variation, they also point to ‘overarching pan-Indian prosody properties’ that emerge from their study. In “Rhythmic Contrast in Marathi English and Telugu English”, Giuliana Regnoli explores rhythmic patterns in an English-speaking Indian diasporic community in Germany, a country where English is mostly used in the traditional EFL sense in order to facilitate international communication and where English, unlike in India, is rarely used in educational contexts. Using the two rhythm metrics %V and VtoV, she shows that speakers with L1 Telugu employ a more syllable-timed rhythm than speakers with L1 Marathi, further contributing to our understanding of L1-based variation in the Englishes spoken by Indians. Variation in speech rhythm is also the focus of the chapter “Rhythmic Patterns of Malaysian English Speakers”, contributed by Stefanie Pillai, Anussyia Muthiah, Abdul Rahman and Wan Ahmad Wan Aslynn. In this study, the authors investigate data from the three dominant ethnic groups in Malaysia, Malays, Chinese and Indians, who traditionally speak or spoke distinct languages (superseded by a shift to English in some cases), which might have given rise to variation in speech rhythm within Malaysian English. Using both read and spontaneous speech, the authors apply the rhythm metrics nPVI-V and VarcoV and find no clear evidence of ethnically based variation in speech rhythm in Malaysian English. The following chapter, entitled “Speech Rhythm, Length of Residence and Language Experience: A Longitudinal Investigation” marks a transition from the studies in part 1 on Second Language, or Outer Circle, varieties to part 2 on learner, or Expanding Circle, varieties of English. In this chapter, Donald White and Peggy Mok present a longitudinal investigation of the effects of length of residence and language experience on the speech rhythm of L1 Cantonese-speaking Hong Kongers during extended periods of residence in L1 English-speaking (ENL) countries. Having had (syllable-timed) Hong Kong English at their disposal in their linguistic repertoires before moving abroad, these speakers show evidence of an increase in durational variability (i.e. a move towards more stress-timing) as well as an increase in speech rate during their residence in ENL environments. The chapter entitled “(Re-)viewing the Acquisition of Rhythm in the Light of L2 Phonological Theories”, contributed by Lukas Sönning, explores to what extent the acquisition of speech rhythm by L1 German-speaking learners of English can be accounted for by established theories of L2 speech learning, specifically Major’s (2001) Ontogeny Phylogeny Model and James’ (1988) Linguistic Theory of L2 Phonological Development. Using the rhythm metrics nPVI-V, VarcoV and %V, as well as an innovative approach exploring rhythmic variation within sentences,
A Synthesis of Research on Speech Rhythm in Native, Learner …
9
he finds that neither of the two theories considered in this study provides a fully satisfactory account of the empirical evidence. In “Monolingual-Bilingual (Non-)convergence in L3 Rhythm”, Christina Domene Moreno and Barı¸s Kabak investigate potential convergence in rhythm in the speech of Turkish-German bilinguals and German monolinguals in their L3 English and their L1/L2 German. Using a range of rhythm metrics and pitch measures, they find evidence of cross-linguistic influence in the bilinguals’ speech from the background language Turkish on their German and English speech. However, crosslinguistic influence in this direction cannot account for all of the results, which the authors argue provides support for the view that multilinguals dispose of a combined language system in which all background languages are interconnected. Part 3 of the volume, concerned with ways of measuring rhythm, consists of two chapters. In “Rhythm Metrics and the Perception of Rhythmicity in Varieties of English as a Second Language”, Robert Fuchs asks to what degree durationbased rhythm metrics reflect listeners’ perception of rhythmicity. Results from a binary judgement task indicate that regularity ratings provided by listeners can be explained to a large degree by a combination of the rhythm metrics nPVI-V, VarcoV and VarcoC, as well as speech rate. The volume concludes with a chapter on “Novel Methods for Characterising L2 Speech Rhythm”, in which Chris Davis and Jeesun Kim compare measures of general timing properties in L2 speech that focus on speech energy, i.e. the SpectralAmplitude Modulation Phase Hierarchy (S-AMPH) model (Leong, 2012), Allan Factor (AF) analysis (Falk & Kello, 2017) and the Multiscale Coefficient of Variation (MSCV) analysis (Abney et al., 2017). These measures are applied to read speech from L1 Korean and L1 French speakers of English and investigated in conjunction with foreign accent ratings. Acknowledgements The author would like to thank Alina Scheller and Judith Szislo for their assistance in compiling and evaluating relevant previous studies for this chapter.
Appendix
Variety label
Variety long name
Circle
nPVIV
Source
AmE [Behrmann]
American English
Inner
51
Behrman et al. (2019)
AmE [Choe]
American English
Inner
66
Choe (2019)
AmE [Ding]
American English
Inner
69.48
Ding et al. (2020)
AmE [Li&Post]
American English
Inner
51.9
Li and Post (2014)
AmE [Liu&Takeda]
American English
Inner
64.8
Liu and Takeda (2021)
AusE [Kawase]
Australian English
Inner
72
Kawase et al. (2016)
AusE [Nguyen]
Australian English
Inner
57
Nguyen (2018) (continued)
10
R. Fuchs
(continued) Variety label
Variety long name
Circle
nPVIV
BrE [Ding&Xu]
British English
Inner
66.74
Source Ding and Xu (2016)
BrE [Fuchs 2016]
British English
Inner
61.3
Fuchs (2016)
BrE [Fuchs in press] British English
Inner
62.2
Fuchs (in press)
BrE [Low]
British English
Inner
78
Low et al. (2000)
BrE [White]
British English
Inner
73
White and Mattys (2007)
BristolE
Bristol English
Inner
41
White et al. (2007)
Dutch
Dutch
Other
82
White and Mattys (2007)
French
French
Other
50
White and Mattys (2007)
GermanE (high prof.)
German English
Expanding
45.5
Li and Post (2014)
GermanE (lower interm. prof.)
German English
Expanding
40.1
Li and Post (2014)
HKE (high prof.) [Law]
Hong Kong English
Outer
49.37
Law et al. (2020)
HKE (low prof.) [Law]
Hong Kong English
Outer
43.53
Law et al. (2020)
HKE [Setter]
Hong Kong English
Outer
53.5
Setter (pc), cited in Fuchs (2016: 88–90)
IndE
Indian English
Outer
55.6
Fuchs (2016)
JapE
Japanese English
Expanding
56.3
Liu and Takeda (2021)
JapE (exp.)
Japanese English
Expanding
57
Kawase et al. (2016)
JapE (inexp.)
Japanese English
Expanding
50
Kawase et al. (2016)
KorE
Korean English
Expanding
56
Choe (2019)
KorE (beginners)
Korean English
Expanding
51.3
White and Mattys (2007)
KorE (post-intervention)
Korean English
Expanding
52.22
Choe (2022)
KorE (pre-intervention)
Korean English
Expanding
54.68
Choe (2022)
Mandarin Chinese
Mandarin Chinese
Other
48.2
Ding and Xu (2016)
MandarinE (high prof.)
Mandarin English
Expanding
50.6
Li and Post (2014)
MandarinE (low/interm. prof.)
Mandarin English
Expanding
40.3
Li and Post (2014)
MandarinE [Ding&Xu]
Mandarin English
Expanding
48.81
Ding and Xu (2016)
MandarinE [Liu&Takeda]
Mandarin English
Expanding
59.07
Liu and Takeda (2021) (continued)
A Synthesis of Research on Speech Rhythm in Native, Learner …
11
(continued) Variety label
Variety long name
Circle
nPVIV
Source
Maori NZE
Maori New Zealand English
Outer
45.5
Szakay (2006)
NigE
Nigerian English
Outer
55.1
Fuchs (in press)
OrkneyE
Orkney English
Inner
70
White et al. (2007)
PakE
Pakistani English
Outer
54.8
Fuchs (in press)
Pakeha NZE
Pakeha New Zealand English
Inner
57.5
Szakay (2006)
PhiE
Philippine English
Outer
56.1
Fuchs (in press)
ScoE
Scottish English
Inner
73.45
Lowit andKuschmann (2012)
ShetlandE
Shetland English
Inner
77
White et al. (2007)
SinE
Singapore English
Outer
46
Low et al. (2000)
SpaE [Berhmann]
Spanish English
Expanding
51.5
Behrman et al. (2019)
SpaE [White]
Spanish English
Expanding
66
White and Mattys (2007)
Spanish
Spanish
Other
36
White and Mattys (2007)
ThaE
Thai English
Expanding
59.6
Sarmah et al. (2009)
Thai
Thai
Other
54.5
Sarmah et al. (2009)
Vietnamese
Vietnamese
Other
50
Nguyen (2018)
VietnE (advanced)
Vietnamese English
Expanding
56
Nguyen (2018)
VietnE (beginner)
Vietnamese English
Expanding
44
Nguyen (2018)
Welsh Valleys Eng
Welsh Valley English
Inner
42
White et al. (2007)
References Abney, D. H., Kello, C. T., & Balasubramaniam, R. (2017). Introduction and application of the multiscale coefficient of variation analysis. Behavior Research Methods, 49(5), 1571–1581. Alexandrou, A. M., Saarinen, T., Kujala, J., & Salmelin, R. (2018). Cortical tracking of global and local variations of speech rhythm during connected natural speech perception. Journal of Cognitive Neuroscience, 30(11), 1704–1719. Behrman, A., Ferguson, S. H., Akhund, A., & Moeyaert, M. (2019). The effect of clear speech on temporal metrics of rhythm in Spanish-accented speakers of English. Language and Speech, 62(1), 5–29. Bruthiaux, P. (2003). Squaring the circles: Issues in modelling English worldwide. International Journal of Applied Linguistics, 13(2), 159–178. Buschfeld, S. (2019). Children’s English in Singapore: Acquisition, properties, and use. Routledge. Choe, W. K. (2019). The realization of English rhythm by Busan Korean speakers. Phonetics and Speech Sciences, 11(4), 81–87. Choe, W. K. (2022). The effect of pronunciation teaching on the realization of English rhythm by Korean learners of English. Phonetics and Speech Sciences, 14(2), 19–28.
12
R. Fuchs
Crystal, D. (1995). Documenting rhythmical change. In J. Windsor Lewis (Ed.), Studies in general and English phonetics: Essays in honour of Professor J. D. O’Connor (pp. 174–179). Routledge. Daigmorte, C., Tallet, J., & Astésano, C. (2022). On the foundations of rhythm-based methods in speech therapy. In Proceedings of the 11th International Conference on Speech Prosody (pp. 47– 51). Deterding, D. (2001). The measurement of rhythm: A comparison of Singapore and British English. Journal of Phonetics, 29(2), 217–230, S009544700190138X. https://doi.org/10.1006/jpho.2001. 0138. Ding, H., Lin, B., Wang, L., Wang, H., & Fang, R. (2020). A comparison of English rhythm produced by native American speakers and Mandarin ESL Primary School learners. In Proceedings of Interspeech 2020 (pp. 4481–4485). Ding, H., & Xu, X. (2016). L2 English rhythm in read speech by Chinese students. In Proceedings of Interspeech 2016 (pp. 2696–2700). Edwards, A. (2016). English in the Netherlands: Functions, forms and attitudes. Benjamins. Edwards, A., & Fuchs, R. (2020). Varieties of English in the Netherlands and Germany. In R. Hickey (Ed.), English in the German-Speaking World (pp. 267–293). Cambridge University Press. Falk, S., & Kello, C. T. (2017). Hierarchical organization in the temporal structure of infant-direct speech and song. Cognition, 163, 80–86. Fuchs, R. (2016). Speech rhythm in varieties of English. Evidence from educated Indian English and British English. Springer. Fuchs, R. (To appear). Analysing the speech rhythm of New Englishes: A guide to researchers and a case study on Pakistani, Philippine, Nigerian and British English. In G. Wilson & M. Westphal (Eds.), New Englishes, new methods. Benjamins. Gibbon, D. (in press). The rhythms of rhythm. Journal of the International Phonetic Association. Goswami, U. (2019). Speech rhythm and language acquisition: An amplitude modulation phase hierarchy perspective. Annals of the New York Academy of Sciences, 1453(1), 67–78. Harding, E. E., Sammler, D., Henry, M. J., Large, E. W., & Kotz, S. A. (2019). Cortical tracking of rhythm in music and speech. NeuroImage, 185, 96–101. He, L. (2018). Development of speech rhythm in first language: The role of syllable intensity variability. The Journal of the Acoustical Society of America, 143(6), EL463–EL467. Hernandez, A., Yeo, E. J., Kim, S., & Chung, M. (2020). Dysarthria detection and severity assessment using rhythm-based metrics. In Proceedings of Interspeech 2020 (pp. 2897–2901). Holliman, A. J., Wood, C., & Sheehy, K. (2010). Does speech rhythm sensitivity predict children’s reading ability 1 year later? Journal of Educational Psychology, 102(2), 356–366. James, A. R. (1988). The acquisition of a second language phonology. Narr. Kachru, B. (1985). Standards, codification and sociolinguistic realism: The English Language in the Outer Circle. In R. Quirk & H. G. Widdowson (Eds.), English in the world: Teaching and learning the language and literatures (pp. 11–30). Cambridge University Press. Kawasaki, M., Yamada, Y., Ushiku, Y., Miyauchi, E., & Yamaguchi, Y. (2013). Inter-brain synchronization during coordination of speech rhythm in human-to-human social interaction. Scientific Reports, 3(1), 1–8. Kawase, S., Kim, J., & Davis, C. (2016). The influence of second language experience on Japaneseaccented English rhythm. In Proceedings of 8th International Conference on Speech Prosody (pp. 746–750). Law, W. L., Dmitrieva, O., & Francis, A. (2020). Convergence of L1 and L2 speech rhythm in Cantonese-English bilingual speakers. In Proceedings of the 10th International Conference on Speech Prosody (pp. 547–550). Leong, V. (2012). Prosodic rhythm in the speech amplitude envelope: Amplitude modulation phase hierarchies (AMPHs) and AMPH models. Doctoral dissertation, University of Cambridge. Li, A., & Post, B. (2014). L2 acquisition of prosodic properties of speech rhythm: Evidence from L1 Mandarin and German learners of English. Studies in Second Language Acquisition, 36(2), 223–255.
A Synthesis of Research on Speech Rhythm in Native, Learner …
13
Low, E. L., Grabe, E., & Nolan, F. (2000). Quantitative characterizations of speech rhythm: Syllabletiming in Singapore English. Language and Speech, 43, 377–401. Liss, J. M., White, L., Mattys, S. L., Lansford, K., Lotto, A. J., Spitzer, S. M., & Caviness, J. N. (2009). Quantifying speech rhythm abnormalities in the dysarthrias. Journal of Speech, Language and Hearing Research, 52(5), 1334–1352. Liu, S., & Takeda, K. (2021). Mora-timed, stress-timed, and syllable-timed rhythm classes: Clues in English speech production by bilingual speakers. Acta Linguistica Academica, 68(3), 350–369. Lowit, A., Kuschmann, A. (2012). Characterizing intonation deficit in motor speech disorders: An autosegmental–metrical analysis of spontaneous speech in hypokinetic dysarthria, ataxic dysarthria and foreign accent syndrome. Journal of Speech Language and Hearing Research, 55(5), 1472–1484. Magne, C., Jordan, D. K., & Gordon, R. L. (2016). Speech rhythm sensitivity and musical aptitude: ERPs and individual differences. Brain and Language, 153, 13–19. Major, R. C. (2001). Foreign accent: The ontogeny and phylogeny of second language phonology. Erlbaum. Meilán, J. J., Martínez-Sánchez, F., Martínez-Nicolás, I., Llorente, T. E., & Carro, J. (2020). Changes in the rhythm of speech difference between people with nondegenerative mild cognitive impairment and with preclinical dementia. Behavioural Neurology, 2020, 4683573. Mesthrie, R. (2008). Synopsis: The phonology of English in Africa and South and Southeast Asia. In R. Mesthrie (Ed.), Varieties of English. Africa, South and Southeast Asia (pp. 307–319). de Gruyter. Mok, P., & Dellwo, V. (2008). Comparing native and non-native speech rhythm using acoustic rhythmic measures: Cantonese, Beijing Mandarin and English. In Proceedings of the 4th Speech Prosody Conference (pp. 423–426). Nguyen, A. T. T. (2018). L2 English rhythm by Vietnamese speakers: A rhythm metric study (P. ˇ Robertson & B. Cubrovi´ c, Eds.). 12(1), 22–44. Nolan, F., & Jeon, H. S. (2014). Speech rhythm: A metaphor? Philosophical Transactions of the Royal Society B: Biological Sciences, 369(1658), 20130396. Oñate, K. C. (2019). Testing the effect of synchronous speech tasks in the production of L2 speech rhythm in learners of Spanish as a second language. Doctoral dissertation, Pontificia Universidad Catolica de Chile. Ordin, M., & Polyanskaya, L. (2015). Acquisition of speech rhythm in a second language by learners with rhythmically different native languages. The Journal of the Acoustical Society of America, 138(2), 533–544. Ordin, M., Polyanskaya, L., & Ulbrich, C. (2011). Acquisition of timing patterns in second language. In Proceedings of Interspeech 2011 (pp. 1129–1132). Ozernov-Palchik, O., & Patel, A. D. (2018). Musical rhythm and reading development: Does beat processing matter? Annals of the New York Academy of Sciences, 1423(1), 166–175. Peelle, J. E., & Davis, M. H. (2012). Neural oscillations carry speech rhythm through to comprehension. Frontiers in Psychology, 3, 320. Pereira, A. S., Kavanagh, E., Hobaiter, C., Slocombe, K. E., & Lameira, A. R. (2020). Chimpanzee lip-smacks confirm primate continuity for speech-rhythm evolution. Biology Letters, 16(5), 20200232. Prieto, P., del Mar Vanrell, M., Astruc, L., Payne, E., & Post, B. (2012). Phonotactic and phrasal properties of speech rhythm. Evidence from Catalan, English, and Spanish. Speech Communication, 54(6), 681–702. Rasier, L., & Hiligsmann, P. (2007). Prosodic transfer from L1 to L2. Theoretical and methodological issues. Nouveaux cahiers de linguistique française, 28(2007), 41–66. Ravignani, A., Dalla Bella, S., Falk, S., Kello, C. T., Noriega, F., & Kotz, S. A. (2019). Rhythm in speech and animal vocalizations: A cross-species perspective. Annals of the New York Academy of Sciences, 1453(1), 79–98. Sarmah, P., Gogoi, D. V., & Wiltshire, C. R. (2009). Thai English: Rhythm and vowels. English World-Wide, 30(2), 196–217.
14
R. Fuchs
Schneider, E. W. (2007). Postcolonial English: Varieties around the world. Cambridge University Press. Szakay, A. (2006). Rhythm and pitch as markers of ethnicity in New Zealand English. In Proceedings of the 11th Australasian International Conference on Speech Science & Technology, University of Auckland (pp. 421–426). Thornton, A., & Lee, P. (2000). Publication bias in meta-analysis: Its causes and consequences. Journal of Clinical Epidemiology, 53(2), 207–216. Tilsen, S. (2009). Multitimescale dynamical interactions between speech rhythm and gesture. Cognitive Science, 33(5), 839–879. Tilsen, S. (2019). Space and time in models of speech rhythm. Annals of the New York Academy of Sciences, 1453(1), 47–66. Torgersen, E. N., & Szakay, A. (2012). An investigation of speech rhythm in London English. Lingua, 122(7), 822–840. Van Maastricht, L., Krahmer, E., Swerts, M., & Prieto, P. (2019). Learning direction matters: A study on L2 rhythm acquisition by Dutch learners of Spanish and Spanish learners of Dutch. Studies in Second Language Acquisition, 41(1), 87–121. Van Maastricht, L., Zee, T., Krahmer, E., & Swerts, M. (2021). The interplay of prosodic cues in the L2: How intonation, rhythm, and speech rate in speech by Spanish learners of Dutch contribute to L1 Dutch perceptions of accentedness and comprehensibility. Speech Communication, 133, 81–90. Vaysse, R., Farinas, J., Astésano, C., & André-Obrecht, R. (2021). Automatic extraction of speech rhythm descriptors for speech intelligibility assessment in the context of head and neck cancers. In Proceedings of Interspeech (pp. 1912–1916). Westphal, M., & Wilson, G. (2020). New Englishes, new methods: Focus on corpus linguistics. Anglistik: International Journal of English Studies, 31, 47–65. White, L., & Mattys, S. L. (2007). Calibrating rhythm: First language and second language studies. Journal of Phonetics, 35(4), 501–522. White, L., Mattys, S. L., Series, L., & Gage, S. (2007). Rhythm metrics predict rhythmic discrimination. In Proceedings of the 16th International Congress of Phonetic Sciences (pp. 1009–1011). Young, N. J. (in press). The sociolectal and stylistic variability of rhythm in Stockholm. Language and Speech, 0023830920969727.
Second Language Varieties of English
Investigating (Rhythm) Variation in Indian English: An Integrated Approach Olga Maxwell
and Elinor Payne
I believe it’s time for us to have an Indian variety which is considered to be a standard variety or a normal variety, not as something as a second rated or third rated, which is at par with Australian, American or British or New Zealand [English]. Participant
Abstract We consider heterogeneity versus homogeneity in a range of temporal and f0-related measures associated with prominence marking and the percept of rhythm, in IndE for speakers of different L1s. A complex picture emerges: while some features show a degree of pan-Indian convergence, IndE rhythm is not monolithic in nature, but variegated through L1 influence. We find (i) convergence of low vocalic and consonantal variability except for the L1-Tamil sub-variety; (ii) convergence on a clear but more subtle marking of lexical prominence alternations, but achieved through different means (with most speakers employing vowel modification, and L1-Tamil speakers employing consonant lengthening); (iii) pan-Indian phrase final lengthening of vowels, but an absence of compensatory shortening of coda consonants in L1-Tamil speakers; (iv) pan-Indian convergence on greater accentual density, but L1-mediated heterogeneity in pitch accent types, with L1 Indo-Aryan speakers showing more rises, and differences in tonal alignment for Tamil speakers. Thus, while there appears to be pan-Indian functional convergence, the phonetic exponent of this is L1-sensitive. We discuss also how convergent features may arise from either the uniform adoption of characteristic English features (e.g. durational marking of lexical stress) or result from prior convergence of areal features in L1s (e.g. greater accentual density).
O. Maxwell School of Languages and Linguistics, University of Melbourne, Babel Building (139), Parkville, VIC 3010, Australia e-mail: [email protected] E. Payne (B) Phonetics Lab, University of Oxford, 41, Wellington Square, Oxford OX1 2JF, UK e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 R. Fuchs (ed.), Speech Rhythm in Learner and Second Language Varieties of English, Prosody, Phonology and Phonetics, https://doi.org/10.1007/978-981-19-8940-7_2
17
18
O. Maxwell and E. Payne
Keywords Prosody · Indian English · Speech timing · Rhythm · Varieties of English · L1 influence · Intonation · Macro-rhythm
1 Introduction The term ‘Indian English’ (IndE) is conventionally applied to the variety(/ies) of English as used by speakers in the Indian subcontinent as well as by the Indian diaspora around the world. English is one of the official languages of India and is used in a wide range of domains, including the media, education, business, government, literary writing and many more. While it is mostly considered to be an L2 variety, it may also be claimed as an L1 for some speakers, “especially those who are displaced from their regions” (Sailaja, 2012, p. 360), and end up, for example, working and living in an environment where the L1 of their background heritage is not widely spoken. Despite the wide use and generalised application of the term, the notion of what is described as IndE is somewhat ‘elusive’ (Schneider, 2007) owing to a variety of factors, including the vast linguistic diversity and complex multilingualism to be found within India (Gargesh, 2006). These factors are also shaped by rapidly changing socio-economic conditions, for example through patterns of internal migration from rural communities to urban centres which bring about new instances of language contact, through increased access to education and through the use of social media, which extends knowledge of other L1s and English (Maxwell et al., 2021). Previous research on IndE phonological features indicates strong evidence for the influence of indigenous languages (e.g. Bansal, 1970; Maxwell & Fletcher, 2009; Thundy, 1976; Wiltshire, 2005, 2020), potentially leading to the existence of multiple sub-varieties of IndE, depending on a speaker’s L1. However, these substrate features may themselves be convergent (in the sense of being identifiable as ‘areal features’ in segmental phonology (Masica, 2005) and prosody (Khan, 2016)). There may be other unified target features that emerge in a process of standardisation of IndE (Sirsa & Redford, 2013) due in part to its ‘self-replicating nature’, where English is taught to Indians by Indians (Wiltshire & Harnsberger, 2006). To add to this complexity, educational background plays a significant role in distinguishing basilectal and acrolectal speakers of IndE (Pandey, 2016). As we can see in the literature, there is an ongoing debate regarding the status and nature of IndE (Mukherjee, 2007; Sailaja, 2012, 2021; Schneider, 2007) and whether we are dealing with essentially a single panIndian variety or multiple sub-varieties (Fuchs, 2016; Gargesh, 2004; Maxwell, 2014; Mukherjee, 2007; Sailaja, 2012, 2021; Wiltshire, 2020; Wiltshire & Harnsberger, 2006). This complex picture calls for more fine-grained phonetic research that considers various strata and substrata of Indian society and includes a wide range of L1 backgrounds. Although it has been established that what we might term ‘varieties of IndE’ have certain shared phonological features irrespective of L1, this does not necessarily mean that IndE is a single homogenous variety. Recent experimental research has
Investigating (Rhythm) Variation in Indian English: An Integrated …
19
found evidence for the variable influence of specific L1s (Maxwell & Fletcher, 2009; Payne & Maxwell, 2019; Puri, 2013, 2018; Wiltshire & Harnsberger, 2006) with the added complexity that the extent of L1 influence may also vary depending on the feature in question (Maxwell, 2014; Sirsa & Redford, 2013). In contrast to the fairly large number of studies on the segmental phonology of IndE, research on prosodic aspects has been more limited, especially with regard to features involving duration and timing. Earlier work on prosody was often preliminary, only auditory and/or based on a limited number of L1 backgrounds and/or speakers (i.e. Bansal, 1990; Latha, 1978). Additionally, there is a relative dearth of empirical studies on intonational features, such as post-lexical prominence and rise alignment (see Maxwell, 2014; Maxwell & Payne, 2018; Maxwell et al., 2018; Moon, 2002; Pandey & Féry, 2016). Work on rhythmic properties has claimed that IndE is more ‘syllable-timed’ (a misnomered term for expressing, in brief, lesser variability in vocalic and consonantal intervals), across various L1s and locations (Fuchs, 2016; Krivokapi´c, 2013) although, as with rhythm research on other languages, results differ across metrics and speech style, and other parameters such as f0 and intensity are also reported to be important (Fuchs, 2016). The pull and push between convergence and L1-mediated divergence are relevant for prosody too. There is evidence that IndE, depending on register and variety, may show certain temporal properties characteristic of longstanding varieties of English (such as British English) (Fuchs, 2016; Maxwell & Payne, 2018; Payne & Maxwell, 2018), and the L1 influence may be more pronounced for some of the features but not all (Maxwell, 2014; Sirsa & Redford, 2013). In addition, any uniform approach to intonation and prosody more generally, whether based on prosodic features of a native English variety or not, is less likely to be taught explicitly compared with segmental features. Given this relative lack of conscious training towards a prescribed norm, we may plausibly expect prosody to show greater L1 influence than other linguistic features. The goal of this chapter is to add to the body of research on IndE prosody and examine the possible effects of L1 on rhythm in IndE, as spoken by speakers of diverse L1 backgrounds (Telugu, Tamil, Bengali and Hindi). The present study employs a broad integrated approach to the investigation of speech rhythm and includes a wide range of parameters to better capture variation in IndE prosody and determine both pan-Indian features and those features that contribute to heterogeneity. It examines several temporal phenomena and global tonal patterns (the regularity of high and low alternations in the intonational contour) while acknowledging that other phonetic cues to prominence alternations (e.g. vowel quality, intensity), not investigated here, may also play a role in the rhythm percept. The structure of the chapter is as follows. Section 2 presents our approach to the analysis of speech rhythm and provides a rationale for the use of this particular prosodic aspect as a case study. Section 3 provides an overview of previous work, with Sect. 3.1 describing prosodic features for the L1s of the speakers in this study and Sect. 3.2 reporting on previous findings for the relevant features of IndE prosody. This is followed by predictions (Sect. 4) and methods applied in this study (Sect. 5). Section 6 is devoted to our analysis of holistic rhythm measures and ‘tonal’ rhythm. Section 7 summarises the findings and discusses those in relation to previous work
20
O. Maxwell and E. Payne
on IndE, L1s and our approach to the analysis of rhythmic properties in IndE. The concluding section provides the key points of the discussion and highlights a few challenges and directions for future research.
2 Approach 2.1 Rationale for Using Rhythm/Prosody as a Case Study 2.1.1
Multi-systemic Approach to Rhythm
Varieties of British (BrE) and American English (AmE) have been described as having relatively ‘uneven-timed’ rhythm, with prosodic prominences typically associated with durational cues including, but not limited to, greater durational variability at the syllable level, and durational marking of prosodic heads and boundaries (Dauer, 1983; Grabe & Low, 2002; Prieto et al., 2012; Roach, 1982). The rhythmic percept resulting from this kind of timing pattern has often been referred to in the literature as ‘stress-timed’, as opposed to ‘syllable-timed’ (following, e.g. Abercrombie, 1967; Pike, 1945; see White & Mattys, 2007 for an overview). These terms are based on the (now discredited) notion of perceived rhythmic differences of this kind being rooted in different types of unit-based isochrony, with inter-stress isochrony and syllable-based isochrony being (mis)-credited for giving rise to stress-timed and syllable-timed rhythm, respectively. Despite the lack of evidence for either type of unit-based isochrony in the acoustic signal (see Arvaniti, 2012), the terms have persisted as short-hand for broad groupings of percept, usually with the acknowledgement that the properties giving rise to these percepts are multiple and not limited to durational characteristics, and that languages do not fall into neat typological classes with respect to these properties. In this vein, recent experimental studies on IndE (e.g. Fuchs, 2016; Krivokapi´c, 2013; Sirsa & Redford, 2013) have described IndE as being more ‘syllable-timed’, although results are inconsistent (Fuchs, 2013, 2016; Krivokapi´c, 2013; Payne & Maxwell, 2018) (see Sect. 3.1.1). The use and interpretation of rhythm metrics, initially developed for capturing presumed typological differences in the rhythm percept (cf. Grabe & Low, 2002; Nazzi et al., 1998) have attracted much critical scrutiny (cf. Arvaniti, 2012; Prieto et al., 2012 on final lengthening; Payne, 2021; Post & Payne, 2018) that questions underlying assumptions about the source of the rhythm percept, in particular (i) the implied primacy, or even uniqueness, of the role of duration; and (ii) the validity of categorising cross-language variation into rhythm ‘classes’, whatever the basis for the percept. Nevertheless, in capturing temporal variability, the metrics do tell us something about prosodic variability and the signalling of prominence across speech styles and across languages, and about at least one aspect of the speech signal that influences the rhythm percept. Furthermore,
Investigating (Rhythm) Variation in Indian English: An Integrated …
21
the resulting scores are continuous in nature, and do not easily lend themselves to a categorical classification of rhythm in the first place. Following Post and Payne’s multi-systemic approach (2018; see also Payne, 2021), we retain that the so-called rhythm metrics, when applied and interpreted with caution and in the context of the specific speech context being investigated, can still provide a useful measure of cross-linguistic or cross-variety variation in the signalling of prominence. As such, they can form part of a broader approach to evaluating the structure and phonetics of prominence, as has been advocated by critics of the classification approach to rhythm based on metrics (cf. Arvaniti, 2012; Jun, 2014). This broader approach is to be understood as ‘multi-systemic’ insofar as it acknowledges the role of multiple sources of variation (be these phonotactics and syllable structure, lexical stress markers, contrastive length, prosodic prominence and phrasing). It is also ‘multi-parametric’ in the sense that it incorporates a range of phonetic parameters, and not just temporal, in the signalling of prominence alternations, e.g. vowel quality, intensity and f0 (see Arvaniti, 2012). Furthermore, it recognises that the mapping of structural properties onto phonetic cues may differ between different levels (e.g. word vs. phrase, cf. Beckman & Edwards, 1994; Sluijter & van Heuven, 1996; Suomi et al., 2003) and that this mapping may differ in significant ways crosslinguistically (e.g. the use of duration vs. f0 for domain-initial strengthening). Finally, the source of variation is not limited to linguistic/phonological factors, but can, and indeed should, incorporate performance factors, thereby explicitly acknowledging the large roles that speech style and context and task-type can play (cf. Payne et al., 2010, for variation between Adult Directed Speech and Child Directed Speech; and Kochanski et al. (2005), for variation due to task differences).
2.2 Broad Approach to Rhythm That Includes f0-Based Macro-rhythm Within this broad, multi-systemic framework, our approach to examining the extent of rhythmic variation in IndE thus naturally incorporates alternations in pitch, and therefore necessitates an intonational analysis. In the Autosegmental-Metrical (AM) model of intonational phonology (Beckman & Venditti, 2011; Ladd, 2008; Pierrehumbert, 1980), an intonational tune is composed of pitch accents (pitch movements marking post-lexical prominence) and boundary tones (pitch movements marking the edges of prosodic units). The prosodic typology proposed in Jun (2005) and based on the AM theory has been used to describe a number of typologically distinct languages focusing on prominence and phrasing. These two main parameters are considered at both word- and phrase-level prosody: where at the word level, languages are categorised as tone, stress or neither; and at the phrase level, the prominence marking is linked to head-, edge- or head- and edge-prominence.
22
O. Maxwell and E. Payne
In other words, this earlier model (Jun, 2005) combined two traditions of prosodic typology—speech rhythm and word prosody—with phrasal prosody. The ‘traditional’ typology of speech rhythm was reflected in the types of lexical prosodic units, such as feet, morae and syllables, and was based on purely durational indices. The model did not, however, allow for the comparison of “purely tonal aspects of prosody” (Jun, 2014, p. 521) and could not account for differences and similarities across languages/dialects on the basis of more global pitch patterns or perceived rhythm based on regular pitch movements (e.g. Andreeva et al., 2007; Dilley & McAuley, 2008). The parameter of macro-rhythm, proposed by Jun (2012, 2014), refers to the rhythm created by a prosodic unit larger than a word and hence functions at a phrase level, similar to the parameter of prominence. The degree of macro-rhythm (strong, weak or medium) is determined on the basis of (a) the number of phrase medial pitch movement types that correspond to pitch accents or phrase tones (languages with a smaller inventory have less variability in the range of pitch contours and hence are considered more macro-rhythmic, e.g. Egyptian Arabic with a single pitch accent L + H* has stronger macrorhythm than Lebanese Arabic which has four types of pitch accents (Chahal & Hellmuth, 2014)); (b) the type of most common phrase medial pitch accent or phrase tone (e.g. languages that employ rising or falling tones are more macro-rhythmic than those with level pitch accents or phrasal tones); (c) the frequency of sub-tonal units (languages where every word is accented or each phrase forms a smaller prosodic unit marked by boundary tone/s are more macro-rhythmic). Macro-rhythm is of particular interest to the study of IndE rhythm where we expect some transfer of L1 tonal features, and will help throw light on the relationship between the rhythmic properties of word- and phrasal-level prosody. One of the most salient features shared across a number of South Asian languages, specifically those that belong to Indo-Aryan and Dravidian language families, is a repetitive rising pitch contour on accented words (see Keane, 2014—Tamil; Hayes & Lahiri, 1991; Khan, 2014—Bengali; Patil et al., 2008—Hindi). These rising pitch movements are associated with a smaller prosodic unit, the size of a prosodic word, where the high tone demarcates the right edge. Essentially, every smaller prosodic unit has a rising pitch. Following Jun’s typology, these languages can be classified as head and edge prominence with a strong macro-rhythm. This is in comparison to some varieties of British, American and Australian English where a phrase-medial rise in English corresponds to a rising pitch accent (Arvaniti & Garding, 2007; Ladd, 2008; Pierrehumbert, 1980). In addition, these varieties have a large pitch accent inventory (not limited to a rise) and variable accentuation patterns and hence can be classified as having medium macro-rhythm. To add to this complexity, the question of whether lexical stress exists in such languages as Hindi, Bengali and Tamil also remains a moot point in the current literature. For Tamil, Keane (2006) found duration and intensity not to be reliable
Investigating (Rhythm) Variation in Indian English: An Integrated …
23
correlates of stress. Word-level prominence was found to be marked mostly with f0, indicating either that Tamil does not distinguish prominence on a word level or that the initial syllable has an abstract lexical stress serving as an anchor for the pitch accent. For Hindi, Ohala (1999) suggests that stress is not contrastive, is phonetically weak and plays a marginal role. We would expect that in languages with phonetically weak stress, pitch will play a more important role in marking word-level prominence and will be as important to consider as duration and amplitude in languages that have distinct phonetic cues to mark lexical stress (such as English). In summary, our approach combines durational and pitch-related measures to investigate variation in prominence signalling in IndE and evaluate the extent to which this could be attributed to contact effects, i.e. the transfer of properties in L1 languages. A lack of L1-based variation, on the other hand, would suggest a panIndian standardised variety that is homogeneous even at the level of fine prosodic detail. In order to evaluate this, we first need to consider key prosodic properties, including both global rhythmic indices and individual aspects that contribute to the rhythm percept, in the relevant L1s. We also consider the existing body of research on rhythm in IndE.
3 Previous Work 3.1 L1s The purpose of this study is to investigate a set of temporal and tonal phenomena that are well-documented in long-standing varieties of English across a set of IndE varieties, in particular as differentiated by L1: Hindi, Bengali (Indo-Aryan), Telugu and Tamil (Dravidian). For other, segmental aspects of the phonology of these languages, see Khan (2010) for Bengali; for Hindi, Ohala (1999); for Telugu, Bhaskararao and Ray (2017); and for Tamil, Keane (2004). Our hypothesis is that there will be observable transfer effects of properties specific to the L1s in question leading to diversification (sub-varieties) of IndE. The four L1s in question are phonologically distinct, both segmentally and prosodically. The variable of L1 is of particular interest in light of an ongoing debate in the literature concerning L1 effects on IndE (see Fuchs, 2016; Maxwell, 2014; Maxwell & Payne, 2018; Payne & Maxwell, 2018; Sirsa & Redford, 2013; Wiltshire, 2020). Shared ‘areal’ features in phonology are documented, such as retroflexion and gemination (Masica, 2005), and more specifically to prosody, a preponderance of tonal rises and smaller prosodic constituents (Keane, 2006; Khan, 2016). Nevertheless, it is known that South Asian languages vary considerably among themselves, which should be of little surprise given that they span different language families and are spoken over a huge geographical area. Furthermore, the vast majority of IndE speakers are bilingual or multilingual with IndE as an L2 or L3. Thus, IndE is in constant contact with many other structurally and phonetically divergent languages and is exposed to the influence of their prosody. In this section, we outline the relevant
24
O. Maxwell and E. Payne
features of the L1 backgrounds under investigation here (Hindi, Bengali, Telugu and Tamil), and consider how these might influence IndE. We look first at holistic rhythm measures for the relevant L1s and then consider individual structural and phonetic properties.
3.1.1
L1 Rhythm
A handful of studies report analyses of rhythm in some but not all of the languages in question: Hindi has been described as ‘syllable-timed’ (Dauer, 1983), Telugu as ‘mora-timed’ (Murty et al., 2007), while Tamil has been variably described as ‘stresstimed’ (Grabe & Low, 2002; Marthandan, 1983), ‘syllable-timed’ (Ravisankar, 1994), as neither ‘stress’- nor ‘syllable-timed’ (Balasubramanian, 1972), or as ‘mora’-timed (Ramus et al., 1999). As Keane (2006) suggests, the diversity of reported rhythm may in part be due to the existence of different varieties and registers of spoken Tamil. Keane (2006) reports greater ‘stress-timing’ for formal Tamil, at least with respect to scores for %V and consonantal variability. This brings to the fore yet again the confound of speech style (c.f. Arvaniti, 2012; Kochanski et al., 2005) with rhythm measures being sensitive to changes in style and task, and thus not uniquely determined by ‘inherent’ structural properties of a language or language variety. When conducting a cross-variety analysis, factors such as speech style and task need to be held constant as far as possible in order to reveal any variation which is truly language-specific.
3.1.2
Individual Properties of L1s
From existing work on the L1s in question, we can gauge cross-language and cross-language family differences with respect to ‘stress’ (or prominence) placement, phonological quantity and some related phonotactic properties. In contrast to English, the notion of ‘stress’ is a phonetically elusive phenomenon in many South Asian languages, with native speakers reporting difficulty in its auditory identification with disagreement over the presence and type of acoustic cues, and over the location of prominence. These languages lack lexically contrastive stress—a major cue of which in English is variation in duration. However, something akin to stress or prominence appears to serve what may be considered a post-lexical role in both Indo-Aryan and Dravidian languages. The L1s are typically characterised as having smaller prosodic constituents than English, so the marking of prosodic boundaries (via tonal, durational or other) is an important source of prominence variation, especially in the absence of lexically contrastive prominence. There are also broad differences between the Indo-Aryan and Dravidian languages investigated. The placement of stress is said to be derived from syllable structure and weight in Hindi (Hussain, 1997), but scholars disagree over whether stress is acoustically cued (see Dyrud, 2001; Nair, 2001; Ohala, 1991). For Bengali, Das (2001) and Shaw (1984) suggest that lexical stress depends on syllable weight. However, this is a minority view, and
Investigating (Rhythm) Variation in Indian English: An Integrated …
25
most studies on Bengali agree that stress at the word level is phonetically weak, and completely predictable, as it always falls on the first syllable (Hayes & Lahiri, 1991; Kawasaki & Shattuck-Hufnagel, 1988; Khan, 2014; Michaels & Nelson, 2004; Selkirk, 2006). Despite its phonetically weak realisation, word-level stress is phonologically significant (Khan, 2014) because it plays an important role in intonational phonology. However, there is potentially substantial dialectal variation in prosody in Bengali with some varieties reportedly showing final prominence and even tone (see Gope & Mahanta, 2016, with regard to Sylheti). Although there is no lexically contrastive stress or pitch in Telugu (Bhaskararao & Ray, 2017), non-lexical prominences are discernible, the phonetic cues of which are vowel and syllable duration, together with pitch (Balusu, 2001). In terms of the placement of these prominences, these are said to be derived from phonological properties of syllable weight and position, although precise claims vary (see Kolachina, 2016, for a summary). For example, Sailaja (1985) reports that stress falls on the first syllable of non-compound words unless the vowel in the first syllable is short, when it falls on the second syllable (see also Sitapati, 1936). This would be consistent with evidence (Prabhakar Babu, 1971) that Telugu speakers place stress on the first syllable also in their English. However, this is not consistent with findings from perceptual or judgement studies about stress placements. For example, Krishnamurti (2003) found variation in judgements for Telugu trisyllabic stems with light syllables (CVCVCV), with twice as many listeners perceiving stress on the second syllable. In Tamil too, there is disagreement over the placement of prominence and reportedly no durational cues. For example, prominence-lending f0 cues to initial syllables have been reported by Andronov (1973) and Keane (2006), while non-initial syllables have been reported as showing spectral reduction. Another potential source of temporal variability is phonologically contrastive consonant and/or vowel quantity, a characteristic of many South Asian languages. Here, there is a potentially important distinction between the Indo-Aryan and Dravidian languages. In Bengali and Hindi, only consonants have a contrastive quantity (length distinctions), while in Tamil and Telugu, both vowels and consonants do. Hence, all other things being equal, we would expect greater vocalic variability in Telugu and Tamil than in Hindi and Bengali. A further difference occurs with Hindi: although Hindi vowels lack contrastive length, duration is a phonetic exponent of some quality contrasts (as in English) (see Ohala & Ohala, 1992) leading to the prediction of greater vocalic variability in Hindi than in Bengali. Similarly, there are greater restrictions on contrastive vowel length in Telugu than in Tamil, with vowels in Telugu always phonologically short before a geminate consonant. Hence, we would predict the greatest vocalic variability in Tamil. All the languages under investigation here exhibit sandhi processes involving either gemination, vowel syncope or epenthesis, further complicating the potential influence on IndE. There is little work on the existence or extent of phrase-final lengthening in South Asian languages, although Sirsa and Redford (2013) report greater phrase-final lengthening in Telugu than in Hindi (along with evidence for this being replicated in the associated variety of IndE).
26
O. Maxwell and E. Payne
Limited experimental research is available on intonation in South Asian languages. Hindi is possibly the most described (Féry & Kentner, 2010; Genzel & Kügler, 2010; Harnsberger, 1994; Moore, 1965; Nair, 2001; Ohala, 1986; Patil et al., 2008), closely followed by Tamil (especially Keane, 2006, 2007, 2014; Ravisankar, 1994) and Bengali (Hayes & Lahiri, 1991; Khan, 2014; Michaels & Nelson, 2004). Despite being one of the major languages spoken in the subcontinent, to our knowledge no work has been conducted on intonation in Telugu. Recent experimental research within the Autosegmental-Metrical framework suggests that South Asian languages (especially Indo-Aryan and Dravidian) have a number of shared salient features. As mentioned earlier, an utterance has a characteristic repetitive rising contour, where the L and the H tones demarcate the edge of the minor prosodic unit, and L* is the most commonly used pitch accent. Unlike in long-standing varieties of English (e.g. BrE), where the nuclear-accented word is always right-headed, the strongest word is always a left-most non-clitic word. Moreover, pragmatic focus has little reflection on the prosodic structure and does not necessarily lead to differences in accent placement or phrasing. Furthermore, in South Asian languages, the prosodic word constitutes a lower level prosodic constituent (a phonological/accentual phrase—e.g. Khan (2014), Keane (2014)). This is in contrast to BrE and AmE, where a lower level prosodic constituent may include several words. It remains to be seen how this could influence phrasing in IndE. It is, however, important to mention that there are potentially a number of differences in the intonational phonologies of these languages that could be reflected in the prosody across the speakers of IndE. First, Tamil has a smaller tonal inventory with only two types of pitch accents (Keane, 2014) compared to Bengali (Hayes & Lahiri, 1991; Khan, 2014), which also includes a rising pitch accent (L + H* in Kolkata Bengali and L* + H in Bangladeshi Bengali) and has a much larger inventory of boundary tones marking the edges of intonational phrases. Bengali may also have an additional level of prosodic phrasing, an intermediate phrase (Khan, 2014). Second, the canonical representation for the rising gesture on accented words in Bengali, Hindi and Tamil is modelled as an L* pitch accent followed by a high phrase tone (e.g. Harnsberger, 1994; Hayes & Lahiri, 1991; Keane, 2014). However, recent work by Khan (2016) on South Asian languages indicates differences in the phonetic alignment of the rise, most likely as a result of language-specific syllable structure, segmental composition, phonological vowel and other processes, which could also lead to differences in the phonetic or phonological representation of the rise, at least for some of the languages. Khan found that in Tamil and Telugu, the H tones were timed earlier compared to Nepali, Bengali, Hindi and Assamese, with Tamil exhibiting the earliest peak alignment. Table 1 summaries the aspects discussed in this section, outlining the similarities or salient features in South Asian languages and differences across the four L1s.
Investigating (Rhythm) Variation in Indian English: An Integrated …
27
Table 1 Summary of similarities and differences in L1/language family Indo-Aryan
Dravidian
Stress placement and concept
Bengali Word-initial prominence (Hayes & Lahiri, 1991; Khan, 2014) Hindi Acoustic cues, weight-sensitive (Dyrud, 2001; Nair, 2001)
Telugu Percept varies, penultimate or final syllable, word/vowel length-dependent Tamil Disagreement; no durational cues but f0 gives prominence to initial syllables (e.g. Andronov, 1973; Keane, 2006)
Quantity and syllable structure
Bengali/Hindi Consonants only Hindi Duration is a phonetic exponent of some vowel quality contrasts
Tamil/Telugu Consonants and vowels Telugu Vowels are short before a geminate
Pitch accents and tonal inventory
Bengali A more expanded tone and pitch accent inventory compared to the other L1s (Hayes & Lahiri, 1991; Khan, 2014)
Tamil A relatively small inventory of tones and pitch accents; earlier peaks on accentual rises in Tamil (Khan, 2016)
Salient features
Post-lexical role of prominence where a smaller prosodic constituent often has a characteristic rise; repetitive rises (Keane, 2014; Khan, 2016)
Sandhi phenomena
Gemination; vowel syncope and epenthesis
3.2 Indian English 3.2.1
Rhythm in IndE
As mentioned above, IndE has often been classified as more ‘syllable-timed’ than, e.g. BrE (see Bansal, 1970; Chaudhary, 1989; Fuchs, 2016; Gargesh, 2004; Krivokapi´c, 2013; Nihalani et al., 1979; Trudgill & Hannah, 2008; Wells, 1982). In addition, and in line with this reported percept, vowels retain their quality and frequently also their duration in weak positions, thus presenting different temporal patterns than in most long-standing varieties of English. This section summarises the findings of more recent experimental studies that have examined rhythmic properties in IndE. Exploring a range of phonetic parameters, Fuchs (2013, 2016) investigated variability in loudness, intensity, f0, syllable durations, vocalic intervals and the variability of voiced and sonorant durations in IndE. His findings confirm that ‘educated’ IndE can be thought of as more ‘syllable-timed’ than BrE. Combined variability in intensity and duration was found to be smaller in IndE, prompting Fuchs to suggest that duration may play a less important role in IndE than BrE when marking prominence. Furthermore, IndE speakers do not vary the average intensity in longer utterances to the same extent BrE speakers do. Fuchs suggests this could potentially be the feature that contributes to the perception of IndE, for some listeners at least, as
28
O. Maxwell and E. Payne
being more monotonous. The rhythmic properties analysed by Fuchs also showed differences across metrics and between read and spontaneous speech styles. Results on the variability of average and maximum intensity of vocalic intervals showed a lack of support for the idea that variability in intensity is greater for BrE. Despite the fact that unstressed vowels were shorter than stressed vowels in IndE speech, vowel elision in unstressed syllables was less frequent and unstressed vowels were also longer for IndE speakers, leading to the impression of syllable-timing. Vocalic intervals at the edges of intonational phrases were not lengthened to the same degree as those in BrE. Krivokapi´c (2013) compared the rhythmic properties of IndE with AmE. Even though both varieties showed mixed results, her findings indicated a more ‘syllable’timed rhythm for IndE. Sirsa and Redford (2013) specifically investigated the effect of L1 on IndE and examined phrase-final lengthening, several timing metrics (nPVI, %V and /\C), and speech rate in IndE and L1s for the same speakers with two L1 backgrounds (Hindi and Telugu). Their findings show no effect of L1 on any of the rhythm and speech rate measures in English productions but indicate that the patterns in IndE were distinct from both Hindi and Telugu, and from the rhythmic properties of BrE. The following section provides a brief review of previous work on phrasal prominence (accentuation) and tonal patterns in IndE intonation.
3.2.2
Prominence and Tonal Patterns in IndE
Intonation in IndE is one of the most understudied aspects of this variety/ies (Gargesh, 2004; Grice et al., 2021; Maxwell, 2014). While recent studies have made significant contributions in this area, they have been based on a small number of L1 speakers and often presented preliminary results (Moon, 2002; Maxwell, 2014; Puri, 2013, 2018; Wiltshire & Harsnberger, 2006). Important features such as phrasing, prosodic structure, and tune and tone inventory have received little attention to date. One of the most noticeable features of IndE is the use of phrasal prominence and accentuation. Earlier claims, mostly based on auditory or descriptive research, state that IndE speakers accent most words in an utterance, highlighting both content and function words (Bansal, 1970, 1990; Gargesh, 2004; Gumperz, 1982; Latha, 1978). Researchers examining varieties of IndE based on L1, however, found that accentuation patterns and nuclear accent placement may reflect speakers’ L1s or regions. For example, while Rajasthani (Dhamija, 1976) and Punjabi speakers (Sethi, 1980) prefer placing greater prominence on the last word in a phrase, Telugu speakers place greater prominence on the first word (Prabhakar Babu, 1971). Maxwell (2014) found that in simple declaratives, Kannada L1 speakers of English placed accents on a greater number of words than Bengali L1 speakers. However, the size sample was small with only 4 speakers for each L1 and may not be generalisable to a larger sample and for different tasks and styles. Further, Maxwell’s (2014) findings pointed to a similarity among speakers that has been noted for IndE generally (Bansal, 1990; Gumperz, 1982; Latha, 1978; Wiltshire & Harnsberger, 2006), wherein accentuation
Investigating (Rhythm) Variation in Indian English: An Integrated …
29
patterns in IndE differ from those reported for BrE, AmE or Australian English. So, while research indicates higher accentual density for IndE, earlier claims that IndE speakers accent every word and place accents on all function words may not be the case. Moon (2002) examined the phonetic cues to accentual and focal prominence in English spoken by IndE speakers (Hindi and Telugu L1s). The findings on duration suggested that, unlike AmE speakers, IndE speakers of both L1 groups did not rely on duration as a cue to focal prominence. Also, f0 results showed differences between AmE and IndE as well as between the two L1 groups. While similar values were recorded for maximum f0 across Telugu-English, Hindi-English and AmE, greater lowering of f0 at the beginning of the accented vowel was reported in HindiEnglish compared to Telugu-English and AmE. These findings potentially indicate differences in phonological categories (a rising pitch accent vs. a high pitch accent) or differences in the phonetic realisation of the pitch accent (tonal alignment and scaling) for IndE speakers of these L1 backgrounds. Maxwell (2014) reported that L1 speakers of Bengali and Kannada (Dravidian) distinguished accentual and nuclear focal prominence but despite accent and focus distinction, differences between the two L1 groups, and when compared with other IndE varieties, were found both across the phonetic parameters and in terms of the extent of manipulation for each parameter. For all speakers, syllable duration was a reliable cue to accentual and focal prominence, unlike for the Telugu-English and Hindi-English speakers in Moon’s (2002) study. Duration was closely followed by f0 height as an indicative prominence marker, with f0 excursion being a more accurate measure than the absolute f0 height, especially for Kannada-English speakers. Contrary to the long-standing varieties of English, vowel quality was not found to be a reliable cue to focal prominence for both L1 backgrounds. Moreover, even the distinction between stressed and accented syllables appeared to be rather marginal, being maintained for some vowels but not others, with a large degree of inter-speaker variation regardless of L1. Of particular relevance to our study are the pitch accent types in IndE, namely their phonological representation, phonetic realisation and proportional distribution. Previous research suggests that speakers of IndE whose L1 is Indo-Aryan (Hindi, Bengali and Gujarati) show more frequent use of rising pitch movements on accented words than speakers of L1 Dravidian languages (Tamil, Telugu and Kannada) (Maxwell, 2014; Maxwell & Fletcher, 2014; Maxwell & Payne, 2018; Moon, 2002; Wiltshire & Harnsberger, 2006). Maxwell (2014) also reported L1based differences in the distribution of accentual rises depending on the prosodic context, nuclear versus prenuclear. There were fewer pitch accents and frequent nuclear rising accents in IndE spoken by L1 Bengali speakers compared to the nuclear high accents of L1 Kannada speakers. These findings contributed to differences in nuclear tunes between the two L1 groups. Variation in pitch accent type has also been attributed to the age of L2 acquisition. Puri (2013, 2018) found that late Hindi-English bilinguals used rises on every non-final word, while simultaneous Hindi-English bilinguals used a wider pitch accent inventory in prenuclear contexts, including H* and H* + L accents.
30
O. Maxwell and E. Payne
In summary, previous research has found IndE to be more ‘syllable-timed’ than BrE or AmE, and in line with previous work on rhythm, reports differences across metrics and speech styles. Further, unstressed vowels were found to be shorter than stressed vowels in IndE speech, with a less frequent vowel elision in unstressed syllables, contributing to the impression of ‘syllable-timing’ in IndE. In addition, accentual density was reported to be higher compared to BrE or AusE. Previous studies also revealed a number of differences based on the speakers’ L1 background, indicative of heterogeneity. Speakers of IndE with an Indo-Aryan language background such as Hindi, Gujarati and Bengali have been suggested to have more frequent use of rising pitch on accented words, while speakers of L1 Telugu produce shallow rises on nuclear focal accents (in comparison to sharp rises for L1 Hindi speakers), suggesting potential differences in pitch accent types and the preference for a nuclear falling tune.
4 Predictions As described above, our approach to investigating rhythm follows that of Post and Payne (2018) which identifies multiple sources and phonetic exponents of prominence variability. Within this broad framework, we also seek to incorporate (tonal) macro-rhythm (after Jun, 2014) with the aim of building an integrated account of prosodic prominence alternation. This presupposes that multiple structural and phonetic properties in any given language (variety) will give rise to a holistic rhythmic percept, which may be further conditioned by performance factors (such as style, register, speech task and speaker-specific characteristics). Against this theoretical backdrop, we expect that any identified differences between L1s will exert some influence on the speakers’ L2 IndE, and thus find heterogeneity in IndE rhythm and prominence alternation more generally. In addition to examining holistic temporal measures (the so-called rhythm metrics), we also analyse specific properties likely to contribute to global timing variability in the production of IndE. The temporal phenomena selected for investigation in this study are (a) Phrase-final lengthening; (b) The duration of ‘tense’ and ‘lax’ vowels; (c) Stress-conditioned durational variation. We predict different outcomes for these three features, based on their presence or absence in L1s. Phrase-final lengthening (PFL) is a widely attested phenomenon cross-linguistically, although it is particularly marked in long-standing varieties of English. Little has been reported for the L1s investigated here, although given the different degrees of PFL reported for Telugu and Hindi, we expect to find evidence of a different degree of lengthening also across different varieties of IndE more generally. The use of duration as a phonetic exponent of a broader vowel quality distinction in the production of the English vowel system, e.g. vowels that are phonetically long
Investigating (Rhythm) Variation in Indian English: An Integrated …
31
in Standard Southern British English (SSBE), such as in ‘beat’, ‘cart’, ‘ought’ and ‘suit’, and vowels that are phonetically short in SSBE, such as in ‘bit’, ‘met’, ‘cat’, ‘sun’, ‘put’ and ‘hot’, may vary under the influence of L1 vowel systems. Because Telugu and Tamil have contrastive vowel lengths, we expect L1 speakers of these languages to use duration to mark these vowel quality contrasts. L1 Hindi speakers may also use duration given that, although Hindi does not have a quantity contrast in vowels, vowel quality differences may be accompanied by durational differences, as in English. L1 Bengali speakers, lacking either allophonic or contrastive vowel duration, are predicted to be the least likely to use duration in signalling English vowel differences. Since none of the L1s concerned has lexically contrastive stress, we do not expect speakers of any particular language background to exhibit stress-conditioned reduction more than any other. However, given the different placement of non-lexical prominence in the L1s, which may or may not have temporal exponents, there may be some variation with duration more likely used to signal lexical stress. More generally, we expect vocalic variability to be greater in the IndE with Dravidian language L1s, especially Tamil-English (since Tamil has fewer phonotactic restrictions on long vowels than Telugu). One source of difference is the presence versus absence of durational vowel reduction. Although there is evidence of some degree of spectral and durational vowel reduction in at least some Indian L1s, e.g. Tamil (see, for example, Keane, 2006, for a summary of relevant work), where, it would seem, it may perform a demarcative function for word boundaries, the evidence is somewhat inconsistent, at least for duration, and Keane concludes that the durational marking of initial syllables in Tamil is not a ‘general phenomenon.’ By contrast, unstressed vowel reduction in (long-standing varieties of) English is very marked and systematic. Insofar as IndE speakers do phonetically cue lexically stressed syllables, we expect their use of durational cues to be less robust, contributing to a lower vocalic variability than, for example, in SSBE. The availability of long vowels in Dravidian languages may induce IndE speakers with a L1 Tamil or Telugu background to adopt these vowels to produce the longer vowels in English (such as /i:/). However, we might therefore conceivably expect the Dravidian IndE versions of such vowels to be even longer, since long vowels are contrastive in their L1s, thus requiring a more robust phonetic differentiation with short vowels. Thus, even though the phonotactic properties of the text analysed are the same (in that the same passage is used), we expect to find an elevated vocalic variability and %V for Tamil and Telugu-English. In all four L1s, the prosodic structure (phonological/accentual phrase) predetermines accentuation on every prosodic word. This could explain earlier claims that speakers of IndE accent every word, including function words. This may not be entirety accurate as we may have more of a hybrid prosodic system, in view of more recent experimental research (see Sect. 3.2). We predict that speakers will not accent every word in an utterance, but will exhibit patterns different from American or British English varieties, leading to a higher accentual density. Given our understanding of the historic development of IndE, therefore, there should be no differences in accentual density as a function of L1.
32
O. Maxwell and E. Payne
In light of more recent experimental research, we also expect that speakers will use a range of pitch patterns on accented words, not necessarily limited to the rise, and we may see differences based on L1 and language family. Previous research has shown that speakers of Indo-Aryan L1s tend to use rises on accented words more frequently. We predict therefore that this will be the case for speakers of both Hindi and Bengali L1s. Conversely, L1 Tamil speakers may have fewer pitch accents in their inventory compared to L1 Bengali speakers, since, unlike Bengali, Tamil has only one pitch accent, L* (Keane, 2014). Telugu is the only L1 in question without a description of its intonational phonology, hence any predictions about Telugu tonal inventory would be premature. As for the tonal alignment of the accentual rise, we predict that the intonational phonology of IndE will have a closer resemblance to the long-standing English varieties than to the L1s, at least for the speakers in this study. Given the speakers’ educational and linguistic backgrounds, they are more likely to produce a rising pitch accent/s, and not a low pitch accent followed by a high phrase tone—the latter being more indicative of an L1-like system. Limited work has been done on the phonetic alignment of accentual rise in the L1s, even though it is a feature that contributes to cross-linguistic variation in South Asian languages (Khan, 2016). Therefore, we can only make tentative predictions about the phonological presentation of the rising gesture on accented words for all four L1s. Bengali-English (Maxwell, 2014; Maxwell & Fletcher, 2014) and Hindi-English (Moon, 2002) may show a more delayed peak alignment compared to the other L1 groups, also taking into account the alignment patterns in Tamil and Telugu. Further, we would expect early timing of f0 peaks for L1 Tamil speakers based on H alignment in Tamil (Khan, 2016). In summary, assuming that L1s influence the phonetic detail and prosody of IndE, we can make a series of predictions about the extent and type of heterogeneity in IndE rhythm and prominence alternation. For temporal measures, given the proficiency of the speakers, we expect speakers from all L1 backgrounds to make some temporal distinction in their production of English (allophonically) long and short vowels. However, we expect this to be more robustly implemented by speakers of Dravidian L1 languages, since these languages have a phonological contrast between long and short vowels. Among Indo-Aryan L1s, we expect more robust implementation of this for L1 Hindi speakers of IndE than L1 Bengali, because Hindi has more marked allophonically variable duration in vowels. Again because of their proficiency, we expect speakers from all L1 backgrounds to make some use of temporal marking of stressed versus unstressed syllables, although not to the same extent as, e.g. SSBE. We also predict that this aspect will not show any variation as a function of L1 on account of contrastive lexical stress, and therefore the temporal marking of lexical stress, is absent in all L1s under investigation. Evidence, albeit limited, of variation in the extent of phrase-final lengthening between L1s (Sirsa & Radford, 2013), informs a prediction that degree of phrase-final lengthening will vary in IndE between speakers of different L1s, and in particular that those with L1 Telugu may have greater lengthening than those with L1 Hindi. In terms of holistic temporal measures (i.e. the socalled ‘rhythm metrics’), we predict little difference from reported scores for SSBE,
Investigating (Rhythm) Variation in Indian English: An Integrated …
33
since %V is arguably most strongly influenced by phonotactics, and the phonotactics of SSBE and IndE are the same. However, %V is also arguably dependent on actual vowel durations, and given the existence of contrastively long vowels in Telugu and Tamil, we might also predict some subtle variation between L1 backgrounds for %V, with higher scores for Dravidian L1 backgrounds, and especially Tamil, which has fewer phonotactic restrictions on long vowels. With regard to vocalic variability, although we predict IndE speakers to employ durational cues to prosody, we expect this to be to a lesser extent than in SSBE, and therefore for VarcoV and nPVI-V to be lower. Among IndE with different L1 backgrounds, we expect L1 Tamil speakers to have higher global vocalic variability, on account of the predicted use of longer vowels in general (as argued above). We predict consonant variability to be relatively high, compared to ‘syllable-timed’ languages, and close to SSBE measures, because this is determined in large part by the phonotactics of English. With regard to ‘tonal’ rhythm, we also expect to find evidence for both homogeneity across the L1s and variability as a function of L1. This is substantiated by the findings of more recent experimental research within the AM model of intonational analysis on IndE and the L1s (at least for Hindi, Bengali and Tamil). Accent placement is one of the parameters that is expected to show the least variation across the four L1 backgrounds. This could be explained by the fact that the prosodic structures of the L1s in question include a smaller level prosodic unit, the size of a prosodic word, which may in turn contribute to a similarity in the patterns of accent placement in IndE. Related to this, we also predict higher accentual density in IndE compared to the patterns commonly found in BrE or AmE. However, more recent work on intonation in IndE suggests that this variety has more of a hybrid system (Maxwell, 2014; Maxwell & Payne, 2018; Puri, 2013, 2018), where some but not all function words (i.e. pronouns or prepositions that do not have any pragmatic emphasis) may be accented, leading to somewhat higher macro-rhythm. Further, unlike in the speakers’ L1s, rising pitch movements on accented words (accentual rises) are more likely to represent a rising pitch accent category (either L* + H or L + H*), found in other long-standing varieties of English, rather than a low pitch accent followed by a high phrase tone, more characteristic for the L1s in this study. As for potential differences, intonational contours produced by the L1 speakers of the two Indo-Aryan languages (Bengali and Hindi) are expected to show a higher frequency of accentual rises as compared to the L1 speakers of other L1s (in view of the findings reported for IndE), especially in nuclear position. In addition, Bengali L1 speakers may have a wider pitch accent inventory (different shape types on accented words). Finally, limited work is available on the precise phonetic alignment of the accentual rise in South Asian languages, but from what we know, compared to Hindi and Bengali, the H tone of the rise in Tamil is timed earlier in relation to the segmental material. This difference in alignment could lead to earlier peaks and potentially a different mapping of the rise in English for the speakers with L1 Tamil background.
34
O. Maxwell and E. Payne
5 Method 5.1 Speakers Six female and two male speakers of IndE were recorded at the University of Hyderabad, India, in 2017. All speakers were enrolled in a university degree at the time of data collection, had started learning English at the age of 4–7 years, identified as bi- or multilingual, and were aged 22–30 years. None of the speakers had ever lived outside of India. Participants represented four L1 backgrounds (2 speakers each): Tamil, Telugu, Hindi and Bengali. The speakers with Dravidian L1s came from the same cities (i.e. Telugu—Tirupathi in Andhra Pradesh; Tamil—Madurai in Tamil Nadu). As for the Indo-Aryan L1 speakers, while these spoke a form of standard Bengali/Hindi, there were some biographical details that could have influenced their speech. The L1 Hindi speakers came from different parts of the Hindi Belt (Mathura and Patna). For L1 Bengali, both speakers were from West Bengal but one of the speakers was born and had spent a significant period of time also in Assam (Guwahati) (Fig. 1).
5.1.1
Materials and Analysis
The speakers were asked to read “The North Wind and the Sun” passage three times in a neutral voice, i.e. as if telling a story. The part of the story analysed consisted Fig. 1 Places of birth represented by speakers’ L1 (represented by different colours)
Investigating (Rhythm) Variation in Indian English: An Integrated …
35
of 87 intonational phrases (IPs) for Telugu-English, 79 IPs for Tamil-English, 68 IPs for Bengali-English, and 77 IPs for Hindi-English (differences across variety resulting from differences in prosodic phrasing in delivery, which may potentially be influenced by the size of accentual phrases in the L1 concerned). The speakers were recorded in a quiet room using a Zoom H4nSP audio recorder with an external lapel microphone. The recordings were made at a sampling rate of 44.1 kHz. For each speaker, the sound file deemed to have the most natural reading was selected for further analysis and converted into mono.wav files. The selected recordings were segmented and annotated using the WebMAUS services (Munich Automated Segmentation web platform; Wilkenmann et al., 2017). The segmentation was manually corrected and annotated in Praat (Boersma & Weenink, 1992–2017), following standard criteria (see, for example, Peterson & Lehiste, 1960) identifying acoustic cues to segment boundaries provided by discontinuities in amplitude, periodicity and frequency, as evidenced in the waveform and spectrogram. Extra care was taken at phrase boundaries, particularly those followed by a pause, where final segments may not finish abruptly, but acoustic energy can be seen to ‘tail off’. In these cases, a subjective judgement was made based on the overall levels of acoustic energy present, on the assumption that at the extreme end of such ‘tailing off’, the energy present would not be perceptually salient and therefore could be discounted for our purposes. Additional Praat annotation included the following information on a series of tiers: a CV tier with consonantal and vocalic intervals; a syllable tier with syllable boundaries and degrees of prosodic prominence; an intonational phrase tier marking the boundaries of intonational phrases (IPs); and a tonal tier with pitch accents and boundary tones. Three degrees of prosodic prominence were identified for labelling: unstressed (U), stressed (S) and nuclear stressed (SS). ‘U’ and ‘S’ were guided by English dictionary-definitions of prominence pattern, although allowing for adjustment where the prominence pattern was clearly different in IndE, or if intonational phrasing placed prominences elsewhere. ‘SS’ was identified as being the most prominent syllable phrase-finally in the intonational phrase, as judged by the authors (one of whom is a native speaker of British English, and the other is a near-native speaker of Australian English). Syllables and segmental intervals at the beginning and end of IPs were also marked. Duration measurements were automatically extracted, using a script for consonantal and vocalic intervals. The following values were then calculated for each speaker’s productions: (i) %V; (ii) normalised variability in vocalic intervals: VarcoV and nPVI-V; (iii) non-normalised variability in consonantal intervals: rPVI-C. Duration measurements were also extracted for vowels and syllables, and ANOVAs carried out in SPSS with each of these as the dependent variable, and the following independent variables: (i) STRESS (unstressed; stressed); (ii) VOWEL TARGET IN SBE (‘short’; ‘long’; two abutting short vowels);
36
O. Maxwell and E. Payne
(iii) PHRASE POSITION (‘final’; ‘non-final’). The tonal tier was used for intonational analysis, following the AM framework (Ladd, 2008; Pierrehumbert, 1980). Auditory and visual analyses were performed to identify pitch movements corresponding to pitch accents, phrase accents and boundary tones. On the tonal tier, the following additional acoustic landmarks were identified for each accentual rise: syllable onset, syllable offset, low (L) and high (H) turning points associated with the accented syllable and word offset. The H target was identified as the first highest peak in the vicinity of the accented syllable. The L target was labelled just before the elbow of the rising gesture, located either at the onset or within the vowel of the accented syllable. Although scaling (the magnitude of the rise) was not examined in this dataset, pitch movements annotated as rises excluded any of the instances with a shallow rise, thus distinguishing rises from simple high tones (H*). The first set of analyses on tonal phenomena included an examination of f0 movements on accented words to determine pitch accent types and their proportional distribution for each speaker. The second set of analyses focused on the rising gesture associated with the accented word. In order to examine the tonal alignment of the accentual rise, several temporal distances were obtained relative to the accented syllable and word boundaries. These measurements are illustrated in Fig. 2 and are specified below. Five linear mixed effects models (LMM) with the post-hoc Tukey tests were used to predict the location of the tone targets in relation to the segmental landmarks (a-e). We used the lme4 package (Bates et al., 2016) with fixed variables for L1 and WORD TYPE (number of post-accented syllables, from 0 to 2), and random variables for TARGET WORD and SPEAKER. Except for the measurement of f0 peak alignment relative to word offset, likelihood ratio tests showed that the interaction between L1 and WORD TYPE could be dropped without loss of fit. In addition, the fixed variable of WORD TYPE was not included when modelling the L timing relative to syllable onset (as it was not relevant). (a) distance between the L and the accented syllable onset (SyllOn); (b) distance between the H and the accented syllable onset (SyllOn); (c) distance between the H and the accented syllable offset (SyllOff); (d) distance between the H and the accented word offset (WOff); (e) distance between the L and the H targets (f) combined duration of the accented syllable and post-accented syllable or syllables.
Fig. 2 Schematic representation of durational measurements relative to syllable onset, syllable and word offset
Investigating (Rhythm) Variation in Indian English: An Integrated …
37
6 Results 6.1 Temporal Measures 6.1.1
Holistic Rhythm Metrics
Table 2 gives the mean metric scores for IndE across L1s spoken and, also, for ease of comparison, for Spanish (a stereotypical ‘syllable-timed’ language) and Southern Standard British English (SSBE), conventionally described as ‘stress-timed’. The most evident differences between Spanish, at one end of the timing spectrum, and SSBE, at the other end, are a higher %V in Spanish (purported to arise at least in part from the simpler syllable structure in Spanish) and much greater variability in both vocalic and (non-normalised) consonantal intervals in SSBE. Looking first at %V, we see IndE has a lower score than Spanish, irrespective of background L1. On the whole, as predicted, scores are similar or slightly above reported scores for SSBE, with the slightly elevated scores suggesting some vowels at least may be proportionately longer for IndE speakers of these L1s. IndE with L1 Hindi has a lower score than SSBE which may be due to a lack of global variability in vocalic duration (as evidenced by a low VarcoV), although if that were the case we would also expect a low %V also for IndE with L1 Bengali, something we do not find. Looking at vocalic variability more closely, there are distinctions to be drawn between ‘global’ and ‘sequential’ variability. As predicted, the global measure (VarcoV) shows low variability for IndE, with scores even markedly lower than for Spanish, across all L1s except L1 Tamil, also as predicted. Curiously, nPVI-V, which reflects variability in successive intervals, is higher than for Spanish, although still much lower than SSBE. IndE with L1 Tamil stands out as being very similar to SSBE for the global measure of vocalic variability, VarcoV (though not for successive variability). nPVI-V is sensitive to unstressed vowel reduction, which tends to alternate in (near-)successive sequences, hence the relatively low scores for IndE suggest less unstressed vowel reduction than in SSBE, as per our prediction. For the global measures to be high, there needs to be some other, non-alternating source of vocalic variability for IndE with L1 Tamil. Table 2 Mean metric scores for IndE as a function of L1 spoken **Spanish
IndE (L1 Bengali)
IndE (L1 Hindi)
IndE (L1 Tamil)
IndE (L1 Telugu)
**SSBE
%V
48
41
34
43
38
38
Varco V
41
33
34
65
34
64
nPVI-V
36
50
54
55
52
73
rPVI-C
43
54
55
65
58
70
**
From Grabe and Low (2002)
38
O. Maxwell and E. Payne
Fig. 3 Mean vowel duration (s) by target type in IndE across L1s, showing 95% confidence intervals
Looking at (non-normalised) consonantal variability, this is higher than in Spanish, though typically not as high as for SSBE, as predicted, with the exception of IndE with L1 Tamil. It should be noted that researchers have expressed scepticism as to the usefulness and reliability of metrics based on consonantal interval variability (cf. White & Mattys, 2007), reflecting as they supposedly do properties that are more purely phonotactic in nature than dependent on prosodic structure, but we report them here for completeness. Furthermore, given that the phonotactics of IndE do not differ from those of SSBE, the lower variability in IndE points to this measure being sensitive to something other than phonotactics. Arguably consonantal measures are potentially of relevance to the rhythm percept, insofar as they are not necessarily ‘blind’ to prosodic structure (e.g. domain-initial strengthening).
6.1.2
Specific Temporal Measures
Short Versus Long Vowels Figure 3 shows the mean duration of vowels in IndE for different L1s. ‘V’ (N = 664) is a target short vowel in SSBE, e.g. ‘bit’; ‘V:’ (N = 189) is a target long vowel, e.g. ‘beat’; and ‘VV’ (N = 84) is a sequence of abutting short vowel targets (i.e. across a word boundary in the same IP) in SSBE, e.g. ‘the other’. There is a significant main effect of segment type across L1s (F(2, 29.6), p < 0.001), with V significantly shorter than V: and VV for all L1 backgrounds (p < 0.001 for each L1 background). The mean difference is greatest for speakers with an L1 background in Hindi, Telugu and Tamil, as we predicted. Further, spectral analysis is needed to examine whether these IndE speakers use durational distinctions as the only or most robust means to signal the vowel contrast. Interestingly, there is also a significant interaction of L1*segment type (p < 0.05; F(6, 2.22)). L1 speakers of Bengali and Hindi make no durational distinction between target V: and target VV. For L1 speakers of Dravidian L1s, however, target V: appears
Investigating (Rhythm) Variation in Indian English: An Integrated …
39
Fig. 4 Mean syllable duration (s) for different degrees of stress/prominence across L1s, showing 95% confidence intervals
to be a different durational category; a long vowel, though distinct from a short vowel, is nevertheless significantly shorter than two abutting short vowel targets (VV) (L1 Telugu: p < 0.05; L1 Tamil: p < 0.001), with the difference particularly noteworthy for L1 Tamil. We hypothesise that this is due to the influence of a distinct phonological category for long vowels in Telugu and Tamil.
Stress-Conditioned Durational Variation Across L1 backgrounds, there is a main effect of stress (F(2, 210.8), p < 0.001). Looking at individual L1 backgrounds, as predicted, unstressed syllables are significantly shorter than stressed syllables in IndE for all L1s (English with L1 Bengali: p < 0.001; with L1 Hindi: p < 0.001; with L1 Telugu: p < 0.001; Tamil-English: p < 0.001). Looking specifically at vowel duration, we find that there is not only a main effect of stress there too (F(2, 15.6), p < 0.001), but also a significant interaction of stress*L1 (F(6, 2.2), p < 0.5). Pairwise comparisons reveal vowels are shorter when unstressed (Bengali-English: p < 0.001; Hindi-English: p < 0.001; Telugu-English: p < 0.001) except for IndE with L1 Tamil (see Fig. 5). We conjecture that this difference between syllable and vowel patterns in IndE for L1 Tamil speakers, with stressed syllable lengthening being achieved mostly by consonantal lengthening, rather than vocalic lengthening, could be due to different phonotactic constraints in Tamil. If we compare what is measured in a stressed context with those with nuclear prominence, we see that few speakers make a durational distinction in IndE. When the unit of measurement is vowel duration, there is a durational difference between stressed and nuclear prominence only for speakers with L1 Telugu; when the unit is syllable duration (see Fig. 4), it is only for speakers with L1 Tamil. IndE speakers with an Indo-Aryan L1 do not distinguish these levels of prominence durationally (if at all). At first sight, it would appear from these findings that, as predicted, speakers of IndE use durational distinctions at the syllable level to mark lexical stress, much as in long-standing varieties of English such as BrE, albeit the means by which speakers
40
O. Maxwell and E. Payne
Fig. 5 Mean vowel duration (s) for different degrees of stress/prominence across L1s, showing 95% confidence intervals
(a)
(b)
(c)
(d)
Fig. 6 a–d Mean syllable duration (s) by position in phrase and stress for each L1 background (a—Hindi, b—Bengali, c—Telugu, and d—Tamil), showing 95% confidence intervals
from different language backgrounds achieve this differ (i.e. vocalic or consonant lengthening). Thus, at least on the surface, prominence marking (and the lack of lexical stress) in the relevant L1 does not appear to strongly influence prominence marking in IndE. However, a more nuanced picture emerges when we break down measurements for unstressed and stressed syllable duration according to position in the phrase (see Fig. 6a–d). Although in the IndE of speakers with all 4 L1 backgrounds stressed syllables are longer than unstressed syllables for all positions in the phrase, there is an interaction of
Investigating (Rhythm) Variation in Indian English: An Integrated …
41
phrase position and stress (F(6, 6.1), p < 0.001), with a reduced difference in phraseinitial position in IndE with Indo-Aryan L1 background (Fig. 6a, b). Furthermore, in IndE with Bengali L1 (Fig. 6b), unstressed syllables, while shorter than stressed syllables, are nevertheless longer when phrase-initial than when phrase-medial (p < 0.001). Given that in South Asian languages each prosodic word is said to form a small phrase (see Sect. 3.1.2), and that word-level prominence in Bengali is initial, this pattern of behaviour at the phrase-level in Bengali-English could be interpreted as paralleling the word-initial prominence of Bengali. That L1 speakers of Bengali lengthen word-initial syllables even when unstressed in IndE suggests the transfer of a prosodic patterning from Bengali. There is a similar pattern for unstressed syllables in IndE with L1 Hindi (Fig. 6a), although a reverse pattern for stressed syllables (which are longer when phrase-medial), leading to a clearer marking of stress in this position (stressed syllables longer and unstressed syllables shorter, than in initial position). If we compare IndE with Dravidian L1s (Fig. 6c, d), we see that, although stressed syllables are longer than unstressed syllables in all positions, there is no discernible difference in the contrast made between the initial and medial positions, and the longest duration for unstressed syllables is phrase-finally. Thus, while the IndE of speakers from all L1 backgrounds shows the temporal marking of lexical stress and final position, there are interactions between these two which appear to show a subtle influence of L1s that merits closer investigation, including close analysis of the L1s in question. IndE with a Bengali background would appear to advantage in the marking of phrase-initial marking perhaps at the expense of a robust lexical stress marking in this position, while IndE with a Hindi background would appear to suppress the stress contrast in the initial position. A further potential confound in this is an apparent difference in phrasing, which has consequences for the location of the IP boundaries and the presence of post-nuclear stresses (hence, the noteworthy lack of phrase-final unaccented stressed syllables in Tamil-English).
Phrase-Final Lengthening The final source of durational variation investigated was the presence of phrase-final lengthening, known to be a marked feature of British and American varieties of English. For vowel duration, there is evidence of phrase final lengthening in the IndE of all speakers, regardless of L1 (see Fig. 7a) (main effect of phrase position, F(2, 26.1), p < 0.001 s). However, if we look at syllable duration as a whole (Fig. 7b), while there is a tendency for final syllables to be longer for all speakers, the difference is only statistically significant for IndE speakers with L1 Tamil. We conjecture that there is possibly a mechanism of compensatory shortening in coda consonants for the other speakers which attenuates the lengthening effect on the syllable as a whole, and we conjecture further that this may be due to durational constraints in syllable codas in the L1s spoken. Tamil, on the other hand, permits long vowels to be followed by long consonants, so a Tamil L1 background would not be expected to exert the same
42 Fig. 7 Mean a vowel duration (ms) and b syllable duration (s) by phrase position and L1, showing 95% confidence intervals
O. Maxwell and E. Payne
(a)
(b)
influence. This mirrors the finding reported in section “Stress-Conditioned Durational Variation” above that consonant lengthening is what marks the distinction between unstressed and stressed syllables in IndE with L1 Tamil. In summary, our results show comparable %V scores for IndE and SSBE, and considerably lower vocalic variability scores for IndE, with the marked exception of global variability (VarcoV) for IndE with L1 Tamil, which shows as much variability as SSBE. Consonant variability is also higher for IndE with L1 Tamil (although lower than for SSBE). IndE speakers use duration to mark allophonically short versus long vowels in English, but speakers with L1 Tamil or Telugu use a markedly different durational category than for abutting VV, which we have suggested is due to the presence of a separate phonological category for long vowels in these languages. IndE speakers from all L1 backgrounds mark lexical stress through increased syllable duration, although this appears to be achieved through consonant lengthening for L1 Tamil, rather than vowel lengthening (for all other L1s). A similar pattern holds for the durational marking of phrase-final syllables. There is also evidence of possible L1 influence on how temporal cues for lexical stress and position in the phrase interact. Further investigation is needed of the L1s in question to verify whether this is indeed due to L1 influence, and if so, determine the precise nature of the influence.
S+U
60
80
A
35%
38%
33%
33%
35.5%
42.5%
38%
39%
DS Beng-Eng
SP Beng-Eng
GN Hin-Eng
HM Hin-Eng
DN Tam-Eng
KR Tam-Eng
BR Tel-Eng
BN Tel-Eng
0
20
40
Fig. 8 Proportion of speakers’ accented syllables (A—grey) in relation to proportion of stressed and unstressed syllables combined (S + U—blue), presented as a function of L1
43
100
Investigating (Rhythm) Variation in Indian English: An Integrated …
6.2 ‘Tonal’ Rhythm 6.2.1
Pitch Accent Types and Accentuation
Accentuation Density and Patterning Figure 8 illustrates the ratio of accented syllables versus combined stressed and unstressed syllables. The results in this subsection are presented by speaker to accurately capture inter-speaker differences in accentuation and the use of pitch shapes on accented words. As predicted, IndE speakers in this study did not place accents on every word. However, there were many instances where prepositions and pronouns were accented, indicative of the differences in accentual density between IndE and such long-standing varieties as BrE or AmE, suggestive of higher accentual density. There was no consistent pattern based on L1; the data only showed subtle individual differences. Figure 8 shows a marginal but not significant difference for Hindi L1 (the lowest ratio of accented syllables—33%). It is important to note that speaker KR (L1 Tamil), who has the highest ratio of accented words, started learning English much later than the other speakers. This resonates with previous research that the age of onset of learning English is relevant when looking at the prosodic features (Puri, 2018; Sirsa & Redford, 2013).
Pitch Accent Types The results also confirm our prediction that speakers of IndE will use a number of pitch accents (not limited to the rising pitch movement on accented words commonly found in their L1s). First, unlike in the speakers’ L1s, four types of pitch accents have been observed in the data, two of which were monotonal (L* and H*), and two bitonal (L* + H and H* + L). Figure 9 presents a proportional distribution of pitch accent types for each speaker as a function of L1. The rising (L* + H) and the high
80
L*+H
Pitch accent types 40 60
Fig. 9 Proportional distribution of pitch accent types by speaker and L1
O. Maxwell and E. Payne 100
44
L*
20
H*+L
0
H*
DS SP Bengali L1
GN
HM
Hindi L1
BR BN Telugu L1
DN
KR
Tamil L1
(H*) pitch accents seem to be the most common and are produced by all speakers. The H* + L and L* pitch accents occur less frequently with the L* often restricted to a nuclear position. The falling pitch accent H* + L is distinguished from the H* by an early peak, generally realised within the onset of the accented syllable and a sharp fall in f0 throughout the accented vowel. As shown in Fig. 9, all speakers produce L* + H and H* accents. However, the proportion of high (H*) pitch accents is greater for both L1 Tamil speakers (DN— 49%, KR—52%) and one L1 Telugu speaker (BN—62%) in comparison to one L1 Bengali and both L1 Hindi speakers (SP—30%, GN—16%, and HM—45%). The most frequent use of the L* + H accent is observed for L1 Bengali speaker SP (61%) and L1 Hindi speaker GN (68%), as predicted for Indo-Aryan language backgrounds. Differences in pitch accent types and distribution between the two L1 Bengali speakers could be due to sociolinguistic factors (speaker DS grew up in Assam, with exposure to Assamese and different Bengali dialects spoken in the North-East, in addition to the exposure to other languages since early childhood). As for L1 Hindi, it is not entirely clear what could be contributing to the differences in pitch accept types for speaker HM (higher use of H* and the lack of L*), whose patterning is closer to that for the speakers of Dravidian languages. Possible factors could be the length of residency in Hyderabad (~5 years), as well as greater mobility across India compared to the other L1 Hindi and L1 Bengali speakers. A further examination of distributional differences in prenuclear versus nuclear positions in the data shows a much more consistent pattern across the L1 groups. The results reveal that L1 speakers of Hindi and Bengali use the rising pitch accent (L* + H) in both prenuclear and nuclear positions in their IndE, while for the L1 speakers of Dravidian languages, especially DN and KR (Tamil L1) and BR (Telugu L1), this pitch accent is mostly restricted to prenuclear positions, with a predominant use of the high pitch accent on nuclear-accented words. The findings also suggest differences in pitch accent inventory across the speakers and potentially L1s. Five of the speakers (IndE with L1 Telugu and IndE with L1 Tamil speakers as well as speaker HM— L1 Hindi) do not produce L*, while for speaker BN (L1 Telugu) the pitch accent inventory seems to include only two pitch accents. Further, there is a possibility that
Investigating (Rhythm) Variation in Indian English: An Integrated …
45
some speakers use their L1 inventory with different mapping. This would at least explain the high frequency of the H* accent in IndE with L1 Tamil. These findings reveal not only differences in pitch accent types, and their distribution depending on the position in the phrase, but also suggest potential differences in nuclear tune types (a combination of a nuclear pitch accent and boundary tones) in declarative intonation as a function of L1. L1 Bengali and Hindi speakers show a consistent and frequent use of nuclear rise-falls, in their IndE, compared with nuclear falls used by the other speakers. The following section will examine the phonetic alignment of the rising pitch accent in order to justify its phonological category (L* + H) posited for the speakers of IndE in this study.
6.3 Tonal Alignment of the Rise Figure 10a–d illustrates the alignment of the L target, the H target and the rise (temporal interval between the low the high tone targets) relative to several segmental landmarks and presented by the speakers’ L1 (purple colour—Telugu, aqua blue— Tamil, green—Hindi, and pink—Bengali). When examining the interval between the accented syllable onset and the L target, the results show a positive effect of L1 on the alignment of the L tone target relative to the accented syllable onset (Fig. 10a) in IndE. The likelihood ratio test between the full and null models predicting the effect of L1 on L alignment reached significance (χ2 (3) = 12.19, p < 0.001). However, post-hoc tests confirm significant differences between two groups only: speakers with L1 Hindi and speakers with L1 Tamil (z = −2.598, p < 0.01). The L tone targets produced by IndE speakers with L1 Tamil occurred earlier in relation to the syllable onset, often timed with the first consonant in the accented syllable onset. For some of the speakers with L1 Hindi, Bengali and Telugu, there was a substantial degree of variation in the realisation of L targets relative to syllable onset, as shown by wide whiskers in Fig. 10a. This was especially pronounced for the speakers of IndE with L1 Telugu. Interestingly, the figure clearly shows greater consistency in the timing of the low tone for IndE speakers with L1 Tamil. Despite some patterning based on L1, the results for the measures of the H tone relative to syllable onset (Fig. 10b) and rise duration (the temporal interval between the L and the H targets—Fig. 10d) show no significant effect of L1. This could reflect inter-speaker variation in the duration of segmental intervals, and the syllabic structure of the accented syllables, since these were not controlled for segmental composition. As illustrated in Fig. 10d, the median values for rise duration (L to H interval) across the speakers with Hindi, Tamil and Bengali L1 are similar. This is despite L1-based differences in the alignment of the L and the H targets examined separately (Fig. 10a, b). In contrast, the temporal interval between the f0 peak (H tone) and the accented syllable offset (Fig. 10c) shows a more distinctive pattern; and this measure is of
46 (a)
(c)
O. Maxwell and E. Payne (b)
(d)
Fig. 10 a–d Boxplots and whiskers for the distribution (in ms) of a the L tone target relative to syllable onset, b the H tone target relative to syllable onset, c the H tone target relative to syllable offset, and d rise duration, presented as a function of L1
particular interest in determining a pitch accent category of the rise (L* + H vs. L + H*). The vertical line next to the ‘0’ point on the x-axis indicates the end of the accented syllable, and values to the left of the zero represent H peak alignment within the accented syllable, while the values to the right correspond to the realisation of the peak in the post-accented syllable. The likelihood ratio test between the full and null models predicting the effect of L1 and WORD TYPE (number of postaccented syllables) on peak alignment has reached significance (χ2 (2) = 23.59, p < 0.001). Post-hoc tests reveal that L1 Tamil speakers produced significantly earlier peaks compared to the IndE speakers with L1 Bengali (z = 3.387, p = 0.004) and L1 Telugu (z = 3.15, p = 0.009). As can be seen in the figure, there was a lot of variation for both L1 Hindi speakers and a lack of a clear pattern. This needs to be examined in a larger sample of data. Finally, in order to determine whether the H tone in the rising gesture is part of a bitonal pitch accent (L* + H or L + H*) or is a phrase accent, demarcating the edge of a prosodic unit (similar to the speakers’ L1s—L* + Hp), we examined peak alignment relative to word offset. The results confirm our prediction: the temporal interval between H alignment and word offset strongly correlates with the combined length of accented and post-accented syllable/s (R2 = 0.56, p < 0.0001), thus demonstrating that the longer the post-accented material is, the greater the difference is between the
100 150 200 250
47
R2=0.56
0
50
p 0.01). London English VtoV values were 0.4% lower than those of Marathi English and 7% lower than those of Telugu English (see Fig. 5). Notwithstanding 0.4
VtoV
Fig. 3 VtoV and %V measurements for London English (triangles), Marathi English (circles) and Telugu English (crosses)
0.2
0.0 38
41
44
47
%V
Fig. 4 Mean and standard deviation of %V measurements for London English, Marathi English and Telugu English
50
45
40 London English
Marathi English
Telugu English
50
Rhythmic Contrast in Marathi English and Telugu English Fig. 5 Mean and standard deviation of VtoV measurements in London English, Marathi English and Telugu English
69
0.4
0.2
0
London English
Marathi English
Telugu English
the absence of significant differences, this result is in keeping with the previously observed trend for more syllable-timed varieties tend to have higher VtoV values as a result of their more complex consonantal clusters (Pettorino & Pellegrino, 2016).
5.2 Hypothesis II: Rhythmic Contrast Between Marathi English and Telugu English Hypothesis II predicted that Telugu English would be more syllable-timed than Marathi English. VtoV and %V scores for individual speakers are shown in Fig. 6, with circles for Marathi English and crosses for Telugu English. As in Sect. 5.1, the separation between the two groups is rather clear-cut. One Marathi English speaker (0.272) and one Telugu English speaker (0.269) have higher VtoV values than the other members of that group, indicating slower speech rates.5 Two Marathi English speakers had very similar scores (AK: VtoV 0.206, %V 42.7; PC: VtoV 0.207, %V 42.7) so it appears that there are only four circles in the figure. %V values were significantly lower in the Marathi English group (43.0) than in the Telugu English one (46.3, p < 0.0001). Average %V values were 7.6% lower in Marathi English than in Telugu English; results reveal that Telugu English is significantly more syllable-timed than Marathi English (Fig. 7). The VtoV measurements of Telugu English (0.227) are 6.5% higher than those of Marathi English (0.213). Although not statistically significant (p > 0.01), as in 5.1, this result is in line with previous work, according to which more syllable-timed languages have higher VtoV values than stress-timed ones, as a result of their more
5
In keeping up with the terminology used in speech rhythm research, the term ‘speech rate’ has been used here as a measure indicating how many articulatory units (such as vocalic intervals and syllables) are realised per time unit excluding pauses. However, research focusing on non-native proficiency has shed light on the distinction between ‘articulatory rate’ and ‘speech rate’, with the former being used in the sense of articulatory units per time unit excluding pauses and the latter including pauses. For further details, see Gut (2009).
70 0.4
VtoV
Fig. 6 VtoV, %V measurements for Marathi English (circles) and Telugu English (crosses)
G. Regnoli
0.2
0.0 38
41
44
47
50
%V
Fig. 7 Mean and standard deviation of %V measurements for Marathi and Telugu English
50
45
40 Marathi English
Telugu English
complex consonantal clusters (Pettorino & Pellegrino, 2016; Pettorino et al., 2013). The average VtoV values are shown in Fig. 8.
Fig. 8 Mean and standard deviation values of VtoV measurements in Marathi and Telugu English
0.4
0.2
0 Marathi English
Telugu English
Rhythmic Contrast in Marathi English and Telugu English
71
6 Discussion The results of this study concur with earlier descriptions of Outer Circle varieties of English as more syllable-timed than Inner Circle varieties (Fuchs, 2016; Gut, 2005; Krivokapi´c, 2013). The duration-based metrics used here indicate that Marathi English and Telugu English have a more syllable-timed rhythm than London English. Specifically, with regard to inter-group variation, London English has lower %V and VtoV values than Marathi English and Telugu English. Crucially, the difference in %V and VtoV average values of Telugu English and Marathi English was higher than the difference between Marathi English and London English. The comparison suggests that timing in IndE varieties is significantly different than timing in BrE varieties. Moreover, the %V mean value (41.1) for London English corresponds to previous results for BrE (38, White & Mattys, 2007; 41.1, Grabe & Low, 2002). Intra-group variation in London English is minimal for both metrics (st.dev. VtoV: 0.01). The results also shed new light on Telugu English, which appears to be more syllable-timed than Marathi English. VtoV measurements for Marathi English are lower than those of Telugu English and the proportion of vocalic durations over total utterance duration was also significantly lower in Marathi English than in Telugu English. The comparison of the %V mean scores of Telugu English (46.38) with those of Telugu (51.9, Pettorino & Pellegrino, 2016; 51.2 Sirsa & Redford, 2013) may support the assumption of first language influence on the timing patterns of this variety. However, this conclusion is controversial. As noted in Sect. 2, earlier investigations have either suggested that similarities across Indian languages may account for similarities in the IndE produced by speakers with different backgrounds (Maxwell & Fletcher, 2009; Pickering & Wiltshire, 2000; Wiltshire & Harnsberger, 2006), or that L1 effects on IndE are minimal and may reflect the influence of sociolinguistic factors (Sirsa & Redford, 2013; Wiltshire & Moon, 2003). Since the present study controlled for the age of acquisition of the target language and focused on educated speakers who had attended, with some exceptions, English-medium private schools, it is unlikely that the differences between the Telugu and Marathi English groups can be attributed to divergent proficiency levels. As a matter of fact, a sizeable minority of Indians is first exposed to English at the age of six in primary school and this seems to be early enough for them to acquire an ‘accentless’ variety of English (Flege & Fletcher, 1992; Long, 1990). Thus, the present results prompt two questions: do regional varieties of IndE represent separate varieties with a distinct and stable phonology? Or are subtle L1 effects caused by regional identities and regional variation? Although the results of the present study do not provide answers to these questions, there seems to be evidence that such differences are due to sociolinguistic factors including identity (Regnoli, 2021), regional variation and regional socio-political pressures (Sirsa & Redford, 2013; Wiltshire, 2005, 2020; Wiltshire & Moon, 2003). With regard to VtoV values, the results of the experiments were statistically too weak to support any major conclusions concerning classification of the speech rate
72
G. Regnoli
of Marathi English and Telugu English. Nevertheless, it would appear that more syllable-timed varieties exhibit higher VtoV values (corresponding to slower speech) as a result of their more complex consonantal clusters. Most probably, this is due to the fact that the measure is not adjusted for speech rate (see also Pettorino & Pellegrino, 2016; Pettorino et al., 2013). Moreover, both groups included two Indians who had been living in Heidelberg long enough to learn German to some degree (A2–B1 level). Interestingly, the %V values of the L2 German speakers were the lowest in each group (PD 42.4, AK 42.7, Marathi English; DV 45.4, PG 45.6). This result may suggest some German influence in the timing patterns of those Marathi English and Telugu English speakers, although further research is needed to test this hypothesis. Finally, the timing differences between Marathi and Telugu English found in this study could account for the folk perception in the community (described in Sect. 3.2), according to which Southern Indians have a distinctive accent.
7 Conclusion and Outlook The results of the present small-scale study on rhythmic contrasts in IndE suggest that Telugu English is significantly more syllable-timed than Marathi English and significantly more syllable-timed than London English. Crucially, the rhythmic difference between the two IndEs is larger than the difference between Marathi English and London English. This result shows that the %V/VtoV model captures an important aspect in the rhythmic classification of English varieties. Significantly, it also implies that the Telugu English group exhibits a distinctive pattern at the rhythmic level in relation to Marathi English. Although further research is needed, this may confirm the folk perception in the community according to which Southern Indians have a distinctive accent in speaking English. Furthermore, considering the background of the speakers, in this study as university students, it is likely that they speak a nearstandard variety of IndE. As regards the IndE/London English rhythmic contrast, the results also substantiate previous descriptions of IndE as more syllable-timed than BrE (Fuchs, 2014, 2016; Sirsa & Redford, 2013). Future research may investigate the linguistic stereotypes in this student community by taking into account more integrated models that include parameters other than duration (i.e. intensity, loudness and sonority, Fuchs, 2016; Galves et al., 2002; Low, 1998) and by extending it to pitch, F0 and other stress correlates (Cumming, 2010; Fuchs, 2016; He, 2012). Moreover, starting from speakers’ folk descriptions of Southern IndE varieties (see Sect. 3.2), an investigation of rhythm in their first languages would hopefully (i) shed new light on the acquisition of L1 transfer and (ii) contribute to the controversial debate on the rhythmic classification of Indian languages.
Rhythmic Contrast in Marathi English and Telugu English
73
Appendix A: Details on Subjects Marathi Speakers Speaker
English education
Years in HD
Age
Sex
Faculty
German
AC
English-medium Government schools
1
24
F
Applied Computer Science
No
PC
Marathi-medium Private schools
2
26
M
International Business and Engineering
No
PD
English-medium Government schools
0.8
24
M
International Business and Engineering
Yes
SN
English-medium Private schools
0.1
26
M
Engineering
No
AK
Marathi until 5th grade then English-medium Private schools
2
27
M
Engineering
Yes
Telugu Speakers Speaker
English education
Years in HD
Age
Sex
Faculty
German
DS
English-medium Private schools
1
24
F
Information Technology
No
JD
English-medium Private schools
2
25
M
Information Technology
No
SA
English-medium Private schools
1
25
M
Information Technology
No
DV
English-medium Private schools
2
27
M
Mechanical Engineering
Yes
PG
English-medium Private schools
2
23
M
International Business and Engineering
Yes
London Speakers Speaker
English education
Age
Sex
Faculty
KJ
Public School
29
F
MA Applied Imagination (Arts and Management)
SMD
Public School
27
M
Politics and International Studies (continued)
74
G. Regnoli
(continued) Speaker
English education
Age
Sex
Faculty
LJ
Public School
27
M
Jazz Institute
MV
Public School
25
M
Political Science of International Studies
DJ
Public School
27
M
Manufactured Engineering
Appendix B Reading passage, excerpt from Desai (2001: 5 [1980]) “That is the risk of coming home to Old Delhi”, she announced in the hard voice that had started up the prickle of distrust that ran over the tips of the hairs of Tara’s arms, rippling them. “Old Delhi does not change. It only decays. My students tell me it is a great cemetery, every house a tomb. Nothing but sleeping graves. Now New Delhi, they say is different. That is where things happen. The way they describe it, it sounds like a nest of fleas. So much happens there, it must be a jumping place. I never go. Baba never goes. And here, here nothing happens at all. Whatever happened, happened long ago—in the time of the Tughlaqs, the Khiljis, the Sultanate, the Moghuls—that lot.” She snapped her fingers in time to her words, smartly. “And then the British built New Delhi and moved everything out. Here we are left rocking on the backwaters, getting duller and greyer, I suppose. Anyone who isn’t dull and grey goes away—to New Delhi, to England, to Canada, the Middle East. They don’t come back”.
References Abercrombie, D. (1967). Elements of general phonetics. Edinburgh University Press. Arvaniti, A. (2009). Rhythm, timing and the timing of rhythm. Phonetica, 66(1–2), 46–63. https:// doi.org/10.1159/000208930 Bertinetto, P. M. (1977). “Syllabic Blood”, ovvero l’italiano come lingua ad isocronismo sillabico. Studi di Grammatica Italiana, 6, 69–96. Boersma, P., & Weenink, D. (2017). Praat: Doing phonetics by computer [Computer Programme]. Version 6.0.29. Retrieved May 24, 2017, from http://www.praat.org Carter, P. M. (2005). Quantifying rhythmic differences between Spanish, English, and Hispanic English. In R. Gees (Ed.), Theoretical and experimental approaches to romance linguistics: Selected papers from the 34th linguistic symposium on romance languages (pp. 63–75). John Benjamins. CIEFL. (1972). The sound system of Indian English monograph (Vol. 7). CIEFL. Chaudhary, S. C. (1989). Some aspects of the phonology of Indian English. Jayaswal Press. Consulate General of India. Retrieved July 28, 2017, from http://www.cgimunich.com/pages.php? id=12618
Rhythmic Contrast in Marathi English and Telugu English
75
Crystal, D. (1995). Documenting rhythmical change. In J. Windsor Lewis (Ed.), Studies in general and English phonetics: Essays in honour of Professor J. D. O’Connor (pp. 174–179). Routledge. Cumming, R. E. (2010). The language-specific integration of pitch and duration. Ph.D dissertation, University of Cambridge. Dauer, R. M. (1983). Stress-timing and syllable-timing reanalized. Journal of Phonetics, 11, 51–62. Dellwo, V., Diez, F. G., & Gavalda, N. (2009). The development of measurable speech rhythm in Spanish speakers of English. Actes de XI Simposio Internacional de Comunicacion Social, 594–597. Desai, A. (2001 [1980]). Clear light of day. Vintage. Flege, J. E., & Fletcher, K. L. (1992). Talker and listener effects on the perception of degree of foreign accent. Journal of the Acoustical Society of America, 91, 370–389. Fought, C. (2003). Chicano English in context. Palgrave/Macmillan. Fuchs, R. (2014). Integrating variability in loudness and duration in a multidimensional model of speech rhythm: Evidence from Indian English and British English. Speech Prosody, 7, 290–294. Fuchs, R. (2016). Speech rhythm in varieties of English. Springer. Galves, A., Garcia, J., Duarte, D., & Galves, C. (2002). Sonority as a basis for rhythmic class discrimination. In Proceedings of Speech Prosody 2002, Aix-en-Provence (pp. 323–326). Gargesh, R. (2004). Indian English: Phonology. In E. W. Schneider, K. Burridge, B. Kortmann, R. Mesthrie, & C. Upton (Eds.), A handbook of varieties of English (Vol. 1, pp. 992–1002). Mouton de Gruyter. German Federal Statistical Office. Retrieved September 28, 2017, from https://www.destatis. de/DE/Publikationen/Thematisch/Bevoelkerung/MigrationIntegration/AuslaendBevoelkerung. html?nn=68748 Ghatage, M. M. (2013). Pronunciation problems of the marathi speakers. Language in India, 13, 107–115. Grabe, E., & Low, E. L. (2002). Durational variability in speech and the rhythm class hypothesis. In C. Gussenhoven & N. Warner (Eds.), Laboratory phonology (Vol. 7, pp. 515–546). Mouton de Gruyter. Gut, U. (2005). Nigerian English prosody. English World-Wide, 26(2), 153–177. Gut, U. (2009). Non-native speech. A corpus-based analysis of phonological and phonetic properties of L2 English and German. Peter Lang. He, L. (2012). Syllabic intensity variation as a quantification of speech rhythm: Evidence from both L1 and L2. In Speech Prosody, 6th International Conference, Shangai (pp. 466–469). HISA, Heidelberg Indian Students Association. Retrieved November 11, 2017, from http://hisahe idelberg.com Kachru, B. B. (1983). The Indianization of English. The English language in India. Oxford University Press. Kachru, B. B. (1986). The alchemy of English. University of Illinois Press. Kayte, S., Mundada, M., & Kayte, C. (2015). Marathi text-to-speech synthesis using natural language processing. IOSR Journal of VLSI and Signal Processing, 5(6), 63–67. Keane, E. (2004). Illustrations of the IPA: Tamil. Journal of the International Phonetic Association, 34, 111–116. Khan, S. D. (2006). The intonation of South Asian languages: Towards a comparative analysis. In M. Menon & S. Syed (Eds.), Proceedings of FASAL-6, Amherst (pp. 23–36). Krishnamurti, B., & Gwynn, J. P. L. (1985). A grammar of modern Telugu. Oxford University Press. Krivokapi´c, J. (2013). Rhythm and convergence between speakers of American and Indian English. Laboratory Phonology, 1, 39–65. Long, M. H. (1990). Maturational constraints on language development. Studies in Second Language Acquisition, 12(3), 251–285. Low, E. E. K. (1998). Prosodic prominence in Singapore English. Ph.D. dissertation, University of Cambridge. Marcus, S. M. (1981). Acoustic determinants of Perceptual-Center (P-Center). Perception and Psychophysics, 30(3), 405–408.
76
G. Regnoli
Maxwell, O. (2014). The intonational phonology of Indian English. University of Melbourne. Maxwell, O., & Fletcher, J. (2009). Acoustic and durational properties of Indian English vowels. World Englishes, 28, 52–70. Mohanan, T. (1989). Syllable structure in Malayalam. Linguistic Inquiry, 20, 589–625. Morton, J., Marcus, S. M., & Frankish, C. (1976). Perceptual-Centres (P-Centers). Psychological Review, 83(5), 405–408. Murty, L., Otake, T., & Cutler, A. (2007). Perceptual tests of rhythmic similarity: I. Mora rhythm. Language and Speech, 50(1), 77–99. Nespor, M., Shukla, M., & Mehler, J. (2011). Stress-timed vs. syllable-timed languages. In M. von Oostendorp, C. J. Ewen, E. Hume, & K. Rice (Eds.), The Blackwell companion to phonology. Five volumes (pp. 1147–1159). Wiley-Blackwell. Nishihara, T., & van de Weijer, J. (2011). On syllable-timed rhythm and stress-timed rhythm in World Englishes: Revisited. Bulletin of Miyagi University of Education, 46, 155–163. Pettorino, M., Maffia, M., Pellegrino, E., Vitale, M., & De Meo, A. (2013). VtoV: A perceptual cue for rhythm identification. In P. Mertens & A. C. Simon (Eds.), Proceedings of the ProsodyDiscourse Interface Conference 2013 (IDP 2013), Leuven (pp. 101–106). Pettorino, M., & Pellegrino, E. (2016). %V and VtoV: An acoustic perceptual approach to the rhythmic classification of languages. In C. Bardel & A. De Meo (Eds.), Parler les langues romanes/Parlare le lingue romanze/Hablar las lenguas romances/Falando línguas românicas (pp. 13–28). Il Torcoliere. Pickering, L., & Wiltshire, C. (2000). Pitch accent in Indian-English teaching discourse. World Englishes, 19, 173–183. Pike, K. L. (1945). The intonation of American English. University of Michigan Press. Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in the speech signal. Cognitorium, 73, 265–292. Rao, G. U. M. (1996). A nonlinear analysis of syllable structure and vowel harmony in Telugu. PILC Journal of Dravidic Studies, 6, 55–84. Reddy, N. K. (1979). Problems of syllable-division in Telugu. Work in Progress, 11, 135–140. Regnoli, G. (2016). Indexicality and contextualisation. Linguistic, cultural and social stances of Indian English speakers in Heidelberg. MA dissertation, University of Naples L’Orientale. Regnoli, G. (2021). Accent variation in Indian English: A folk linguistic study. Peter Lang. Roach, P. (1982). On the distinction between “stress-timed” and “syllable-timed” languages. In D. Crystal (Ed.), Linguistic controversies (pp. 73–79). Edward Arnold. Sailaja, P. (1999). Syllabic structure of Telugu. In J. J. Ohala, Y. Hasegawa, M. Ohala, D. Granville, & A. C. Bailey (Eds.), Proceedings of the XIVth International Congress of Phonetic Sciences (ICPhS), San Francisco (pp. 743–746). Sailaja, P. (2009). Indian English. Edinburgh University Press. Sailaja, P. (2012). Indian English: Features and sociolinguistic aspects. Language and Linguistics Compass, 6(6), 359–370. Savithri, S. R. (2009). Speech rhythm in Indian languages. Talk at 41st ISHACON, Pune. http://ish aindia.org.in/web_020209/Rathna_oration.pdf Sharma, D. (2005). Dialect stabilization and speaker awareness in non-native varieties of English. Journal of Sociolinguistics, 9, 194–224. Sirsa, H., & Redford, M. A. (2013). The effects of native language on Indian English sounds and timing patterns. Journal of Phonetics, 41, 393–406. Takeyasu, H., & Hattori, N. (2011). Language discrimination using low-pass filtered songs: Perception of different rhythm classes. International Conference of Phonetic Sciences, XVII, 1946–1949. Hong Kong. Wiltshire, C. (2005). The “Indian English” of Tibeto-Burman language speakers. English WorldWide, 26(3), 291–303. Wiltshire, C., & Harnsberger, J. D. (2006). The influence of Gujarati and Tamil L1s on Indian English: A preliminary study. World Englishes, 25, 91–104.
Rhythmic Contrast in Marathi English and Telugu English
77
Wiltshire, C., & Moon, R. (2003). Phonetic stress in Indian English vs. American English. World Englishes, 22, 291–303. White, L., & Mattys, S. L. (2007). Calibrating rhythm: First language and second language studies. Phonetics, 35(4), 501–522. Wiltshire, C. (2020). Uniformity and variability in the Indian English accent. Cambridge University Press. Yardi, V. V. (1998). English pronunciation for Marathi speakers. Saket Publication.
Rhythmic Patterns of Malaysian English Speakers Stefanie Pillai , Anussyia Muthiah, and Wan Ahmad Wan Aslynn
Abstract Previous research on Malaysian English (MalE) has indicated that there are differences in the way that different ethnic groups produce some segments in English possibly due to different first languages. However, thus far, there has not been any published study on the rhythmic patterns of different ethnic groups in Malaysia. The present study examines the rhythmic properties of speakers from three ethnic groups: Malay, Chinese and Indians. Since studies have shown that speaking contexts can affect rhythm, this study also investigates the extent to which different speaking styles (read and spontaneous speech) affect rhythm in MalE. The data comprised audio recordings of 12 female speakers from three different ethnic groups in Malaysia: Malays, Chinese and Indians. The speakers who were between 40 and 45 years old were all fluent speakers of English based on their educational and professional backgrounds. The speakers were recorded in two speaking contexts. In the first one, they read a passage, and in the second context, they talked about themselves and their families. Two metrics were used to examine rhythm in both these speaking contexts: A normalised Pairwise Variability Index (nPVI) and VarcoV (the standard deviation of vocalic intervals divided by their means). The results were compared across the three ethnic groups. Based on the two metrics, there were no significant differences among the three groups. There were also no significant differences between the two speaking contexts for all three groups. The findings suggest that there may be a common rhythmic pattern in MalE that cuts across ethnic groups. S. Pillai (B) Faculty of Languages & Linguistics, Universiti Malaya, 50603 Kuala Lumpur, Malaysia e-mail: [email protected] A. Muthiah Centre for Foundation Studies, Universiti Tunku Abdul Rahman, 43000 Kajang, Selangor, Malaysia e-mail: [email protected] W. A. Wan Aslynn Department of Audiology and Speech-Language Pathology, Kulliyyah of Allied Health Sciences, International Islamic University Malaysia, 25200 Kuantan, Pahang, Malaysia e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 R. Fuchs (ed.), Speech Rhythm in Learner and Second Language Varieties of English, Prosody, Phonology and Phonetics, https://doi.org/10.1007/978-981-19-8940-7_4
79
80
S. Pillai et al.
Keywords Ethnic groups · Chinese · First language · Indians · Malay · Malaysian English · Pairwise Variability Index (nPVI) · Rhythm · Speaking contexts · VarcoV
1 Introduction Malaysia is a multilingual country with over 90 languages, and most Malaysians are at the very least bilingual, speaking, for example, a regional Malay dialect (e.g. Penang Malay) and/or their heritage languages (e.g. Hokkien). For many Malaysian Chinese, Mandarin is becoming the first language (Sim, 2012). For some urban Malaysia Indians, English has become their first language mainly as a result of Englishmedium education, which began during the British administration of the country and continued until the early 1970s (David, 2005; Schiffman, 1995). Because English is taught in schools, many Malaysians can speak English albeit at varying degrees of proficiency. This is despite them having learnt English from primary school. Many Malaysians speak the colloquial form of English. This colloquial variety of English, sometimes referred to as Manglish, tends to be seen as a deviant or ‘bad’ form of English. There is, in fact, a tendency to use the term Malaysian English (MalE) to refer solely to the colloquial variety of MalE rather than as an umbrella term comprising the various sub-varieties of spoken and written forms of English used in Malaysia (Pillai & Ong, 2018).
2 Malaysian English Perhaps, because of this perception, most descriptions of spoken MalE focus on the colloquial and learner varieties of MalE rather than on the acrolectal variety. There is still a dearth of studies on the variety used by Malaysians who are dominant or first language (L1) speakers of English. In general, there still appears to be an assumption that the speech of fluent speakers does not or should not differ much from standard spoken British English. In terms of research on spoken MalE, studies on the pronunciation of MalE have mainly focused on the segmental and auditory descriptions of its vowels and consonants (e.g. Baskaran, 2004; Phoon & Maclagan, 2009; Pillai, Mohd. Don, Knowles & Tang, 2010; Tan & Low, 2010). In contrast, the prosodic features of MalE are still relatively under-researched. This includes the rhythm of MalE. Whilst, previous studies have shown that there are differences in the way that Malaysians with different language backgrounds produce English sounds (e.g. Phoon et al., 2013), whether this extends to rhythmic patterns remains understudied. Traditionally, MalE is described as being a syllable-timed language (Baskaran, 2008; Tan & Low, 2014; Tongue, 1974). Context-wise, Baskaran (2008) suggests that informal spoken MalE is likely to be more syllable-timed than the variety used in more formal contexts (e.g. on television news). However, this has not been fully explored.
Rhythmic Patterns of Malaysian English Speakers
81
Thus far, published studies have mainly focussed on the rhythmic patterns of Malay speakers of MalE, and not on the other ethnic groups in Malaysia. The present study attempts to fill this gap by examining the rhythmic patterns of MalE among the three main ethnic groups in Malaysia: Malay, Chinese and Indian. Ethnic categories are state-defined, with Malays (along with indigenous groups) making up around 69% of the population. Malaysian Chinese and Indians comprise 23% and 7% of the total population of about 32.4 million people (www.dosm.gov.my). The majority of Malaysian Chinese and Indians came to Malaysia in the early nineteenth century. The Chinese sub-groups include Hokkien, Cantonese, Hakka and Teochew, while the majority of Malaysian Indians are Tamils from South India. Because studies have shown that particular speaking contexts can affect rhythm, this study also investigates rhythmic patterns in two speaking contexts: Read text and spontaneous speech. In the following sections, we explore previous research on rhythm with a focus on some of the metrics used to measure rhythm, on the main languages spoken by the ethnic groups (e.g. Malay, Mandarin, Cantonese, Tamil and English), as well as, studies on the rhythm of second language (L2) speakers. We then present the research methods and findings of this study. This will be followed by a discussion of the findings. The paper concludes by summarising the findings and recommends further studies on the rhythm of MalE.
3 Measurements of Rhythm Ramus et al. (1999) focused on the durational variability of consonants and vowels. They measured the percentage of total vocalic intervals over the entire duration of an utterance (%V) and the standard deviation of vocalic (/\V) and consonantal (/\C) intervals in an utterance. According to Ramus et al. (1999), the values for both /\C and %V are acceptable indicators to describe the rhythmic patterns of any given language as stress-timed languages were found to have high /\C and low %V, and syllable-timed one the inverse. These metrics, however, did not account for differences in speaking rates. Further, as Arvaniti (2009), points out, Grabe and Low (2002) showed that languages can be categorised differently when different metrics are used. Grabe and Low (2002) proposed a Pairwise Variability Index (PVI) which is obtained by calculating the difference in duration between successive intervals. Using the PVI, they found higher nPVI-V scores for languages like English and German, which are considered as being stress-timed. Lower scores were found for typically syllable-timed languages like French and Spanish. The issue was with languages that were placed between stress- and syllable-timed languages, such as Malay and Tamil. Based on this finding, Grabe and Low (2002) suggested that there may be no outright distinction between the two categories of rhythm. The categorisation of languages as stress- or syllable-timed is also affected by the speaking context used to elicit data. Arvaniti (2009), for instance, argues that metrics like the nPVI, are affected by whether the data comprises read (e.g. Low, 1994, 1998;
82
S. Pillai et al.
Low et al., 2000) or spontaneous speech (Deterding, 2001; Nokes & Hay, 2012; Torgersen & Szakay, 2012). Other metrics were developed in an attempt to address the problem of categorising rhythm. Among them are the normalised standard deviation of consonantal or vocalic interval durations divided by the mean consonantal (VarcoC) or vocalic durations (VarcoV) respectively which was proposed by Dellwo (2006). VarcoC and VarcoV aim to control for varying speech rates across speakers. White and Mattys (2007) report that VarcoV is a good predictor of the difference between stress- and syllable-timed languages. Knight (2011) also found that vowelbased metrics were more reliable compared to those that are based on consonants in a study that employed seven metrics (nPVI-V, rPVI-C, /\V, /\C, VarcoV, VarcoC and %V on the same material). Fuchs (2016), on the other hand, went further and included other acoustic features (e.g. duration, loudness and sonority) in a multidimensional model which included both production and perception data. The significance of this model is that it provides insights into the various acoustic correlates of rhythm beyond vocalic and consonantal intervals that appear to be more prominent in a language. For example, Fuchs (2016) found that loudness and duration are used together in British English (BrE) as cues to prominence, whereas Indian English (IndE) used one over the other. For example, “increases in duration are less often accompanied by increases in loudness” (Fuchs, 2016: 209). This multidimensional model presents a way forward for future research on rhythm especially for comparisons among varieties of a language and among different languages.
4 Rhythm in Malay The rhythmic pattern of Malay is usually described as being syllable-timed (Maris, 1980; Mohd. Onn, 1980; Teoh, 1994). However, Grabe and Low (2002) could not position Malay in either category of timing. This finding, which was based on PVI measurements taken from the speech sample of a single speaker of Malay, is contrary to other observations. For example, these findings differ from those of Wan Aslynn (2012) who found that the rhythm of Malay was on the syllable-timed end of the continuum based on recordings of twenty Malay speakers reading a list of ten Malay sentences. Deterding (2011) also found that Malay is more syllable-timed compared to, for instance, BrE. He does, nevertheless, express concerns about the use of PVI to come to this conclusion due to issues of syllabification, determining vowel onset and offset and speaking rate.
5 Rhythm in Chinese Studies on rhythm in Chinese languages have mainly focused on Mandarin and Cantonese. Mandarin (Lin & Wang, 2007; Low & Grabe, 2002) and Cantonese (Mok & Dellwo, 2008) have been categorised as syllable-timed. For instance, Mok
Rhythmic Patterns of Malaysian English Speakers
83
and Dellwo (2008) compared recordings of six native speakers each of Hong Kong Cantonese and Beijing Mandarin reading and re-telling (semi-spontaneous speech context) the North Wind and the Sun (NWS) story. They found that all the rhythmic measures they used indicated that both Cantonese and Mandarin were syllable-timed with the former being more syllable-timed. The %V values, for example, were higher for both these Chinese languages compared to English, German, Italian and French. The %V values were also lower for the semi-spontaneous speech context compared to read speech for both Cantonese and Mandarin.
6 Rhythm in Tamil Unlike descriptions of Cantonese and Mandarin, previous accounts of rhythm in Tamil have been inconclusive. Keane (2006) points out that this is indicative of the challenge of categorising a language as either stress- or syllable-timed. Keane (2006: 309) used a variety of metrics to examine the rhythm of colloquial and formal Tamil. Keane (2006) found that the standard deviation (SD) and PVI of consonantal intervals were significantly higher for formal Tamil. However, there was no significant difference for the nPVI-V. Keane (2006) suggests that these findings may be due to cross-word consonant clusters and the slower speech rate in formal Tamil. Similar to Malay, Low and Grabe (2002) found that Tamil had a vocalic nPVI value that placed it among languages that were neither clearly syllable- or stress-timed.
7 Rhythm in Varieties of English The issues raised in relation to the discrepant findings on rhythm by different metrics can be seen in attempts to classify different varieties of English. ‘New’ varieties of English, including MalE, tend to be classified as syllable-timed compared to ‘native’ varieties (typically the ‘inner circle’ varieties) (Mesthrie & Bhatt, 2008). Fuchs (2016), for example, found educated IndE to be more syllable-timed compared to BrE. InE had lower nPVI-V and VarcoV values for both read and spontaneous speech, and there were significant differences in the values between the two varieties of English for both speaking contexts. Among the characteristic that led to the perception of IndE being more syllable-timed was that there was less variation in vocalic durations compared to BrE. Similarly, in Singapore English (SgE), successive vowels were found to not differ much in terms of duration (Low et al., 2000). SgE was found to have smaller PVI values compared to BrE (Low et al., 2000). Deterding (2001) and Tan and Low (2014) also found that SgE was syllable-timed. However, Tan and Low (2014) found that MalE, which is perceived to be similar to SgE, was even more syllable-timed based on their PVI values. There was less vowel reduction in MalE compared to SgE in both read and spontaneous speech. Thus, there were differences in terms of how syllable-timed these two groups of Malay
84
S. Pillai et al.
speakers were. Both group of speakers were undergraduate students in their home countries, and were similar in age, but what distinguished them was their medium of instruction in schools, i.e. English in Singapore and Malay in Malaysia. Tan and Low (2014: 211) suggest that “the difference in the educational and social environments of Malaysia and Singapore” could account for the different rhythmic patterns in MalE and SgE found in their study. Based on their findings, it was felt that the PVI was better at distinguishing differences in rhythmic patterns between these two varieties of English. The PVI was also able to distinguish between Maori English being more syllable-timed than Pakeha English in New Zealand English (Szakay, 2006). In other varieties of English, the results are not always conclusive. Mok and Dellwo (2008) found contradictory results for Cantonese English and Mandarin English using different metrics to measure rhythm. With VarcoC and %V, the values suggested that these two varieties of English were more syllable-timed. Other metrics, such as /\C and nPVI-V, indicated that they were more stress-timed. These discrepancies were attributed to a lower speaking rate said to be common among L2 speakers and the lengthening of syllables (Mok & Dellwo, 2008). In general, it does appear that Cantonese speakers of English are likely to sound more syllable-timed because of the lack of vowel reduction and differentiation between the durations of syllables (Setter, 2006). Other studies have indicated the possible influence of one’s L1 on L2. Sarmah et al. (2009) found that L2 speakers of American English whose L1 was Thai had rhythmic values (based on PVI-V and %V values) that were very different from those of L1 English speakers. Gut (2003) also found that speakers’ L1 seemed to influence the production of German as an L2. However, Gut (2012) cautions that different speakers may yield different results when it comes to distinguishing native and nonnative varieties of the same language because speakers who are acquiring an L2 may or may not transfer the rhythmic patterns of their L1 on their L2. In fact, Li and Post (2014) point out that research in L2 prosody fails to give consistent evidence for rhythmic differences in L2 speech. This may in part be due to the metrics used.
8 Methods The following sections describe the method used in this study to examine the rhythmic patterns among three ethnic groups in two speaking contexts.
8.1 Speakers Twelve female Malaysian speakers from three different ethnic groups were selected for this study: Four Malays, four Chinese and four Indians. These ethnic categories are based on state-defined ones. The reason for selecting these ethnic groups is that they are the three major ethnic groups in Malaysia.
Rhythmic Patterns of Malaysian English Speakers
85
The average age of the speakers was 43 years. Purposeful sampling was used to select the speakers based on the following criteria: . Born and grew up in Malaysia. . Primary and secondary education in national schools in Malaysia. . Fluent speakers of English based on their tertiary qualifications (Bachelor’s degree and higher in English language teaching or English Literature, and Linguistics), job experience (English language lecturers at a university) and frequency of using English in different contexts. Based on these criteria, a total of 14 out of 30 potential speakers who were identified agreed to be recorded. Almost all of the potential speakers were females, and the 14 speakers who consented to participate in the study were all females. For this study, we only selected those who identified their ethnic group as Malay, Indian and Chinese, and hence, two of the 14 speakers were removed. The L1 of all four Malay speakers was Malay. Cantonese was the L1 of two of the Chinese speakers, while Hokkien was the L1 of the remaining two Chinese speakers. Language shift to English among Malaysian Indians was mentioned in the Introduction section and thus, not surprisingly, three of the Indian speakers said that English was their L1. Only one of the speakers said that Tamil was her L1. English was the home and dominant language for the speakers who declared English as their L1 (see Pillai, 2006). These speakers grew up speaking English rather than their South Indian heritage languages and are not fluent in these languages (Tamil for two of them and Malayalam for the other). They are the first generation in their families to use English as an L1. For all the other speakers, especially the non-Malay speakers, English was used extensively at home with their children. All twelve speakers of this study are similar in terms of educational backgrounds and professions as all of them are English language lecturers with more than 10 years of experience of teaching English at the time of the recordings.
8.2 Data Since differences have been reported in the prosodic features of read speech and conversational speech (e.g. Howell & Kadi-Hani, 1991), this study examined rhythm in two speaking contexts: read text and spontaneous speech. For the read speech context, the subjects were recorded reading the NWS passage while the spontaneous speech consisted of a short interview with the speakers about themselves and their families.
86
S. Pillai et al.
8.3 Procedure All recordings were carried out in a quiet room. The Kay Elemetrics Computerized Speech Lab (CSL) Model 4500 was used to record the speakers at a sampling rate of 44,100 Hz using a high-quality dynamic microphone placed a few inches from the mouth of the speakers. Written consent was obtained from all speakers prior to the recordings. For the read text, speakers were provided with the text to read through once so that they were familiar with it before being recorded. As for the spontaneous speech context, questions were posed to the speaker to elicit responses about themselves and their families.
8.4 Segmentation of Speech The NSW passage consisted of five sentences, 113 words and 142 syllables. All five sentences were analysed from the read text of each speaker. These recordings, which had been transcribed orthographically, were further examined and annotated using Praat version 5.3.82 (Boersma & Weenink, 2014). The recordings for each speaker were segmented in text grids using Praat into the following elements: Text, vocalic and consonantal units. As shown in Fig. 1, tier 1 contains the text and tiers 2 and 3 show the segmented consonantal and vocalic elements respectively. The location of boundaries for vocalic and consonantal units were identified and labelled based on the wideband spectrograms (White & Mattys, 2007). The duration was first measured from left-to-right for vocalic and intervocalic intervals. Vowels in sequences, such as in fricative-vowel and vowel-nasal, were identified following the criteria used by Grabe and Low (2002) where applicable. In fricative-vowel sequences, such as in the word ‘sun’, the vowel was measured when the noise pattern of the voiceless fricative /s/ ended. Vowel-nasal sequences (e.g. disputing) were segmented by observing the formant movement of both the nasal and vowels. Similar to Deterding (2001) and Tan and Low (2014), phrase final syllables were excluded for both read and informal spontaneous speech to avoid the effect of phrase-final syllable lengthening on the PVI measurements. Further, pauses between intonational phrases were excluded from the analysis. The 12 interviews consisted of 33 utterances each. These utterances were determined from the time speakers started speaking until the point that they paused or were silent. The utterances produced by the speakers could be a word, a short phrase, or a sentence. The whole interview was analysed since these interviews were short with an average of 1 min and 49 s per interview. The recording for each speaker was segmented similar to the read speech. Other considerations included taking into account contracted forms, for example, the contraction of the words they are to they’re (Deterding, 2001). Further, in spontaneous speech, it is common for speakers to hesitate (e.g. silent or filled pauses), but hesitations were not measured. Repetitions of lexical items in a stretch of speech were also excluded from the analysis of the current study. The first correct lexical item produced by the speaker was taken into
Rhythmic Patterns of Malaysian English Speakers
87
Fig. 1 Screenshot of annotations in Praat for read speech
consideration. Other forms of interruptions, such as laughter and also silent pauses of three hundred milliseconds and above, were also excluded from the analysis.
8.5 Measurements The vocalic intervals and syllable durations were measured based on the spectrograms and the auditory segmentation. In this study, two metrics, nPVI-V and VarcoV, were used to measure vocalic variability between the two speaking styles produced by the speakers from the three ethnic groups. As previously explained, the PVI is used to measure the differences in duration of successive intervals. The nPVI is the mean of the differences between successive intervals divided by the sum of the same intervals where nPVI-V is the normalised Pairwise Variability Index for vocalic intervals. VarcoV is calculated as the standard deviation of vocalic interval duration divided by mean vocalic interval duration and then multiplied by 100. These two metrics were selected as they have been found to be “robust to variation in speech rate and relatively robust to variation in sentences, speakers and transcribers” (Fuchs, 2016: 56).
9 Findings Table 1 presents the overall nPVI-V and VarcoV values for the Malay, Chinese and Indian speakers in the two speaking contexts. A one-way ANOVA indicated that there were no significant differences between the mean nPVI-V values among the
88
S. Pillai et al.
three ethnic groups for read [F(2, 9) = 0.64, p = 0.55] and spontaneous speech [F(2, 9) = 0.38, p = 0.69]. As shown in Table 1, the average nPVI-V values were slightly higher in read speech for the Chinese speakers. The Indian speakers had lower average nPVI-V values in spontaneous speech. Consistent with these findings, no significant differences were found between the nPVI-V for read (M = 56.46, SD = 2.87) and spontaneous speech (M = 54.48, SD = 5.36): t(11) = 1.22, p = 0.25 based on a two-tailed correlated samples t-test. A one-way ANOVA indicated that there were no significant differences between the mean VarcoV values among the three ethnic groups for read (F(2, 9) = 2.08, p = 0.18) and spontaneous speech [F(2, 9) = 0.59, p = 0.57]. As can be seen in Table 2, the difference between the average VarcoV values for both speech contexts among Chinese speakers was 7.93 with a higher VarcoV value in spontaneous speech. Higher VarcoV values were also found for the other two groups with a difference of 1.07 for the Indian group and 8.76 for the Malay one. No significant differences were found in terms of vocalic variability for read (M = 62.82, SD = 5.96) and spontaneous speech (M = 68.73, SD = 10.04): t(11) = 1.95, p = 0.08 based on a two-tailed correlated samples t-test. Figures 2 and 3 show the cross comparisons of VarcoV and nPVI-V for read and spontaneous speech respectively. The two metrics, nPVI-V and VarcoV, were not significantly correlated in both read (r = 0.01, p = 0.97) and spontaneous speech (r = 0.56, p = 0.06).
Table 1 Average nPVI-V and standard deviation values Speakers
Read speech
Spontaneous speech
nPVI-V
VarcoV
nPVI-V
VarcoV
Chinese 1
53.56
77.70
50.39
71.80
Chinese 2
56.00
62.12
57.09
65.80
Chinese 3
55.62
59.22
56.82
67.05
Chinese 4
51.25
62.51
56.83
88.62
Average
54.11 (2.19)
65.39 (8.34)
55.28 (3.27)
73.32 (10.53)
Indian 1
57.68
64.99
43.88
49.01
Indian 2
59.22
66.63
59.88
80.24
Indian 3
60.46
66.23
54.40
66.31
Indian 4
61.30
61.19
58.03
67.74
Average
59.67 (1.57)
64.76 (2.48)
54.05 (7.15)
65.82 (12.84)
Malay 1
56.38
58.53
50.87
68.53
Malay 2
54.28
61.97
48.41
70.07
Malay 3
53.84
53.34
52.59
56.73
Malay 4
57.92
59.36
64.61
72.92
Average
55.61 (1.90)
58.30 (3.61)
54.12 (7.20)
67.06 (7.13)
Overall average
56.46 (54.96)
62.82 (5.96)
54.48 (5.60)
68.73 (10.04)
Note Standard deviations (SD) are in parenthesis
Rhythmic Patterns of Malaysian English Speakers
89
Table 2 nPVI-V and VarcoV values in some varieties of English Variety
nPVI-V Read
VarcoV Spontaneous speech
Read
Spontaneous speech
BrE (Fuchs, 2016)
61.3
58.3
53.2
51.7
IndE (Fuchs, 2016)
55.6
52.4
46.3
45.7
MalE (present study)
56.46
54.48
62.82
68.73
MalE (Tan & Low, 2014)
41.21
−
37.49
−
SgE (Tan & Low, 2014)
47.3
−
44.15
−
Fig. 2 VarcoV and nPVI-V for read speech for all speakers
Fig. 3 VarcoV and nPVI-V for spontaneous speech for all speakers
90
S. Pillai et al.
10 Discussion The analysis of nPVI-V and VarcoV values indicate no significant differences among the three ethnic groups for both read text and spontaneous speech which suggests that the three groups had similar rhythmic patterns. There was also no significant difference between the two speaking contexts suggesting that both had similar rhythmic patterns. There was a tendency for the nPVI-V and VarcoV values to be slightly lower (but not significantly) in the spontaneous speaking context indicating less vocalic variability in this speech context compared to the read text. As previously mentioned, there is a tendency for ‘new’ varieties of English to be categorised as syllable-timed compared to a variety like BrE suggesting that there is less variability in these varieties. However, the values for both speaking contexts were higher than those reported in previous studies. Tan and Low (2014), for example, reported much lower PVI and VarcoV values for Malaysian and Singapore English (see Table 2). The PVI-V value for IndE for both read and spontaneous speech (Fuchs, 2016), on the other hand, is closer to the ones in this study. However, this was not the case for VarcoV. The PVI-V values for IndE reported by Fuchs (2016) are lower that what he reports for BrE. While it is acknowledged that Fuch’s results are not based on exactly the same type of data as the one in this study (see Table 2), they can provide an overview of where MalE is positioned in the stress- and syllable-timed continuum. The figures in Table 2 suggest that MalE has a similar rhythmic quality to IndE based on nPVI-V values, indicating that like IndE, MalE has less vocalic variation than BrE, and is therefore, more syllable-timed. The difference between the current study and other studies on MalE are the age group of the speakers, educational backgrounds, dominant home language and professions. These factors may have affected the rhythmic patterns found in both speech styles in this study. Firstly, the speakers of Tan and Low (2014) were students in their twenties whereas the speakers of the current study are English lecturers with an average age of 43. Secondly, the dominant home language for each speaker is different as some use English as their home dominant language. Thirdly, they are from different educational and professional backgrounds. The results of the present study suggest that despite different declared L1s, including English, the Malaysian speakers in this study exhibited similar rhythmic patterns.
11 Conclusion In sum, it can be concluded that all three ethnic groups exhibited similar rhythmic patterns to each other, suggesting a common pattern for MalE. There was also no discernible difference in the rhythmic patterns of the two speaking contexts: Speakers seemed to have similar patterns when reading a text and speaking spontaneously. Despite the higher PVI and VarcoV values than a previous study, MalE can still be classified as being more syllable-timed.
Rhythmic Patterns of Malaysian English Speakers
91
However, the two metrics used in this study were not able to capture the prominent features in the acoustic signal which were regulated across the data in order to describe the rhythmic property of MalE. In line with Arvaniti (2009), a more robust speech feature which is able to withstand the influence of those independent variables might be more accurate in characterising the rhythm of any languages. This may mean that the researchers might need to go beyond segmental duration measurements. Wan Aslynn (2012) and Fuchs (2014) suggest that features like syllable duration, interintensity minima duration and single vocalic duration are worth looking into in order to underpin the regularity across speakers and sentences. In addition, the results of this study need to be treated carefully due to the size of sample and the speaking contexts used to elicit the data. The status of English for the speakers (i.e. fluent bi-/multilinguals and dominant speakers of English) may also have influenced the findings. Thus, future studies on rhythm in Malaysian English should include a bigger and more diverse sample. For example, the sample should look at whether there are differences between fluent speakers and English language learners or speakers for whom English is an L1.
References Arvaniti, A. (2009). Rhythm, timing and the timing of rhythm. Phonetica, 66, 46–63. Baskaran, L. M. (2004). Malaysian English. Morphology and syntax. In B. Kortmann, K. Burridge, R. Mestrie, E. W. Schneider, & C. Upton (Eds.), A handbook of varieties of English (Vol. 2, pp. 1073–1085). Mouton de Gruyter. Baskaran, L. M. (2008). Malaysian English – Phonology. In R. Mesthrie (Ed.), Varieties of English: Africa, South and Southeast Asia (pp. 278–291). Mouton de Gruyter. Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer (version 5.3.82). Retrieved from http://fon.hum.uva.nl/praat/. David, M. K. (2005). Reasons for language shift in Peninsular Malaysia. Journal of Modern Languages, 15(1), 1–11. Dellwo, V. (2006). Rhythm and speech rate: A variation coefficient for deltaC. In P. Karnowski, & I. Szigeti (Eds.), Language and language processing: Proceedings of the 38th Linguistic Colloquium (pp. 231–241). Piliscsaba 2003. Frankfurt: Peter Lang. Deterding, D. (2001). The measurement of rhythm: A comparison of Singapore and British English. Journal of Phonetics, 29(2), 217–230. Deterding, D. (2011). Measurements of the rhythm of Malay. Proceedings of the 17th International Congress of Phonetic Sciences, Hong Kong, 17–21 August 2011, 576–579. Fuchs, R. (2014). Integrating variability in loudness and duration in a multidimensional model of speech rhythm. Evidence from Indian English and British English. In Proceedings of speech prosody 7. Dublin (pp. 290–294). Retrieved from https://www.isca-speech.org/archive_v0/Spe echProsody_2014/pdfs/47.pdf Fuchs, R. (2016). Speech rhythm in varieties of English—Evidence from educated Indian English and British English. Springer. Grabe, E., & Low, E. L. (2002). Durational variability in speech and the rhythm class hypothesis. In C. Gussenhoven, & N. Warner (Eds.), Laboratory phonology (Vol. 7, pp. 515–546). Mouton de Gruyter. Gut, U. (2003). Prosody in second language speech production: The role of the native language. Fremdsprachen Lehren Und Lernen, 32, 133–152.
92
S. Pillai et al.
Gut, U. (2012). Rhythm in L2 speech. In D. Gibbon (Ed.), Speech and language technology (pp. 83– 94). Cambridge University Press. Howell, P., & Kadi-Hani, K. (1991). Comparison of prosodic properties between read and spontaneous speech material. Speech Communication, 10(2), 161–169. Keane, E. L. (2006). Rhythmic characteristics of colloquial and formal Tamil. Language and Speech, 49(3), 299–332. Knight, R. A. (2011). Assessing the temporal reliability of rhythm metrics. Journal of the International Phonetics Association, 41(3), 271–281. Li, A., & Post, B. (2014). L2 Acquisition of prosodic properties of speech rhythm: Evidence from L1 Mandarin and German learners of English. Studies in Second Language Acquisition, 36(2), 223–255. Lin, H., & Wang, Q. (2007). Mandarin rhythm: An acoustic study. Chinese Linguistics and Computing, 17(3), 127–140. Low, E. L. (1994). Intonation patterns in Singapore English. Unpublished master’s thesis. Department of Linguistics, University of Cambridge, Cambridge, United Kingdom. Low, E. L. (1998). Prosodic prominence in Singapore English. Unpublished Ph.D. thesis. University of Cambridge, Cambridge, United Kingdom. Low, E. L., Grabe, E., & Nolan, F. (2000). Quantitative characterizations of speech rhythm: ‘Syllable-timing’ in Singapore English. Language and Speech, 43(4), 377–401. Maris, Y. (1980). The Malay sound system. Kuala Lumpur: Fajar Bakti. Mesthrie, R., & Bhatt, R. M. (2008). World Englishes: The study of new linguistic varieties. Key topics in sociolinguistics. Cambridge University Press. Mohd. Onn, F. (1980). Aspects of Malay phonology and morphology: A generative approach. Universiti Kebangsaan Malaysia. Mok, P., & Dellwo, V. (2008). Comparing native and non-native speech rhythm using acoustic rhythmic measures: Cantonese, Beijing Mandarin and English. Speech Prosody 2008, Campinas/Brazil, 6–9 May 2008, 423–426. Nokes, J., & Hay, J. (2012). Acoustic correlates of rhythm in New Zealand English: A diachronic study. Language Variation and Change, 24(1), 1–31. Phoon, H. S., & Maclagan, M. (2009) The phonology of Malaysian English: A preliminary study. In L. J. Zhang, R. Rubdy, & A. Lubna (Eds.), Englishes and literatures-in-English in a globalised world, Proceedings of the 13th International Conference on English in Southeast Asia (pp. 46–60). Singapore: National Institute of Education, Nanyang Technological University. Phoon, H. S., Abdullah, A. C., & Maclagan, M. (2013). The consonant realizations of Malay-, Chinese- and Indian-Influenced Malaysian English. Australian Journal of Linguistics, 33(1), 3–30. Pillai, S. (2006). Malaysian English as a first language. In M. K. David (Ed.), Language choices and discourse of Malaysian families: Case studies of families in Kuala Lumpur, Malaysia (pp. 61–75). Petaling Jaya: SIRD. Pillai, S., & Mohd. Don, Z., Knowles, G., & Tang, J. (2010). Malaysian English: An instrumental analysis of vowel contrasts. World Englishes, 29(2), 159–172. Pillai, S., & Ong, L. T. (2018). English(es) in Malaysia. Asian Englishes, 20(2), 147–157. Ramus, R., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in the speech signal. Cognition, 73(3), 265–292. Sarmah, P., Gogoi, D. V., & Wiltshire, C. (2009). Thai English: Rhythm and vowels. English World-Wide, 30(2), 196–217. Schiffman, H. (1995). Language shift in the Tamil communities of Malaysia and Singapore. Southwest Journal of Linguistics, 14(1–2), 151–165. Setter, J. (2006). Speech rhythm in World Englishes: The case of Hong Kong. TESOL Quarterly, 40(4), 763–782. Sim, T. W. (2012). Why are the native languages of the Chinese Malaysians in decline? Journal of Taiwanese Vernacular, 4(1), 63–95.
Rhythmic Patterns of Malaysian English Speakers
93
Szakay, A. (2006). Rhythm and pitch as markers of raciality in New Zealand English. In P. Warren, & C. Watson (Eds.), Proceedings of the 11th Australasian International Conference on Speech Science & Technology (pp. 421–426). Auckland: University of Auckland. Tan, R. S. K., & Low, E. L. (2010). How different are the monophthongs of Malay speakers of Malaysian and Singapore English? English World-Wide, 31(2), 162–189. Tan, R. S. K., & Low, E. L. (2014). Rhythmic patterning in Malaysian and Singapore English. Language and Speech, 57(2), 196–214. Teoh, B. S. (1994) The sound system of Malay revisited. Kuala Lumpur: Dewan Bahasa Pustaka. Tongue, R. K. (1974). The English of Singapore and Malaysia. Eastern Universities Press. Torgersen, E., & Szakay, A. (2012). An investigation of speech rhythm in London English. Lingua, 122(7), 822–840. Wan Aslynn, W. A. (2012). Instrumental phonetic study of the rhythm of Malay. Unpublished Ph.D thesis. Newcastle University, United Kingdom. White, L., & Mattys, S. L. (2007). Calibrating rhythm: First language and second language studies. Journal of Phonetics, 35(4), 501–522.
Learner Varieties of English
Speech Rhythm, Length of Residence and Language Experience: A Longitudinal Investigation Donald White and Peggy Mok
Abstract This study is an ongoing longitudinal investigation of second language (L2) speech rhythm in newly arrived immigrants. Seven Cantonese-first-language (L1)-L2 English students lived abroad for two years in English-speaking countries, including Canada (3), the United States (2), the United Kingdom (1), and Australia (1). The ages of the students at immigration (age of arrival) ranged from 16 to 20 (two students were 16, four were 17, one was 18, and one was 20). All seven participants were raised in Hong Kong and attended the same secondary school there, in which Cantonese was the medium of instruction. Additionally, all of the participants had studied English continuously from early childhood, which is compulsory in the Hong Kong education system. The students have been recorded five times over a two-year period (once before emigration, and then at approximately six-month intervals). In these recordings, the participants read two passages (The Rainbow, and The North Wind and The Sun), a collection of 14 sentences containing target words, recited target words in carrier sentences, and engaged in casual conversations with the first author. In addition, the participants were surveyed on their use of L1 and L2 speech during their time abroad. Eight speech rhythm metrics were used to analyse the data. The results of these measurements suggest that significant increases in durational variability and speech rate occurred during the first year abroad. Keywords Speech Rhythm · Second Language Acquisition · Cantonese · English · Longitudinal
1 Foreign Accent and L2 Acquisition Most people who learn a second language (L2) speak with a foreign accent (FA) that persists even after many years of practice and successful communication. Among D. White · P. Mok (B) The Chinese University of Hong Kong, Hong Kong, China e-mail: [email protected] D. White e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 R. Fuchs (ed.), Speech Rhythm in Learner and Second Language Varieties of English, Prosody, Phonology and Phonetics, https://doi.org/10.1007/978-981-19-8940-7_5
97
98
D. White and P. Mok
researchers of L2 speech acquisition, there is a large body of work that focuses on the nature of FA and its segmental phonetic correlates. Flege’s Speech Learning Model (SLM) (1995) asserts that the cognitive mechanisms used in L1 speech acquisition are available throughout one’s life for the acquisition of additional languages. According to the SLM, the cross-linguistic influence of L1 phonological categories is the main source of FA, and a large number of segmental investigations have supported this hypothesis to varying degrees, including Munro and Derwing (2008), Tsukada et al. (2005), and Flege et al. (1997), among many others. The SLM does not address the role of suprasegmentals, which is perhaps one reason why there have been comparatively fewer prosodic studies of FA. Despite their relatively smaller number, these studies collectively indicate that prosodic cues are crucial to the comprehension and intelligibility of L2 speech, and the perception of L2 accents. Several of the earliest studies in this area focused on intonation (deBot, 1983; Grover et al., 1987; Wayland, 1997). In Munro (1995), for example, after native and non-native (Chinese) English utterances were altered with a low-pass filter to the point of incomprehensibility, native English listeners were still able to distinguish between the L1 and L2 accents. At least part of this ability was attributed to durational differences between the two accents; however, because the intonational contours were not manipulated, it is impossible to determine how much of this ability was based strictly upon rhythmic cues. Since the late 1990s, speech rhythm has increasingly been examined as an isolated factor in studies of L2 speech. Several of these studies have manipulated durational rhythmic patterns in order to determine their importance. While the designs and results have varied, they collectively suggest that speech rhythm plays a vital role in the perception of L2 speech. One of the earliest and most elegant examples is Tajima et al. (1997), which found that the perception of L2 English spoken by a Chinese L2 speaker was significantly facilitated by artificially adjusting the temporal patterns of vowels and consonants to match those of a native English speaker. Conversely, when the L2 temporal patterns were applied to a recording of the native English speaker, the intelligibility of this speech diminished significantly among native English listeners. A similar rhythm-switching technique was also used in Boula de Mareüil and VieruDimulescu (2006), which found that the duration of stressed vowels contributes to foreign accent, even in rhythmically and typologically similar languages such as Spanish and Italian. The role of speech rhythm in perception of foreign accent was also examined in Fuchs (2015). Sentences spoken by speakers of L1 British English and L2 Indian English were manipulated to test the relative strength of three perceptual cues: segments, intonation, and speech rhythm. Although speech rhythm seemed to play the smallest role among these three cues, its effect was nevertheless significant in the discrimination of the two accents, as judged by both L1 British and L2 Indian English speakers. Taken together, these findings suggest that it is desirable for L2 speakers to adjust the durational patterns of their speech, especially among those who wish to reduce FA and augment intelligibility. While most L2 speakers share a common desire to make themselves understood, this motivation is perhaps most crucial for new immigrants living in an L2ambient environment. L2 speakers in this situation have been studied extensively. As
Speech Rhythm, Length of Residence and Language Experience: …
99
mentioned above, this work was originally focused on segmentals and global accents rather than prosody; however, an increasing number of studies have investigated prosody more recently. In previous segmental work, researchers demonstrated the correlation of three factors with L2 pronunciation among new immigrants: Length of Residence (LOR), Language Experience (LE), and Age of Arrival (or Age of Acquisition) (AOA). LOR measures the duration that learners have lived in an L2-ambient environment. LE measures the quantity and quality of L2 interaction among new immigrants. AOA is the age of L2 speakers when they first began to speak their L2 and/or when they first immigrated to the L2-ambient environment. (There is some ambiguity in the meaning of this acronym. Because the arrival of new immigrants so often coincides with their first L2 acquisition, previous studies are not always in agreement about what the second “A” stands for. In the present study, all of the participants began studying English as a second language before beginning primary school in Hong Kong; therefore, “AOA” will refer to their age of arrival.) In previous work, the strongest effects of AOA are evident in children before the age of twelve (Baker & Trofimovich, 2006; Flege, 1988; Flege et al., 2006; Tsukada et al., 2005). Although the influence of AOA has been demonstrated clearly and strongly by these and other previous studies (Asher & Garcia, 1969; Oyama, 1976; Tahta et al., 1981), its effect is expected to be marginal among the participants of the present study, who range in age from 16 to 20.
1.1 Foreign Accent and Speech Rhythm In L2 speakers older than twelve, Flege (1988) found that the effects of LOR were limited to a short initial improvement that leveled off sometime during the first year after immigration. This initial burst of improvement has subsequently been observed in a number of phonetic investigations (Flege et al., 1995; Riney & Flege, 1998; Winitz et al., 1995). While it is not yet clear whether a similar first year “burst” effect of LOR exists at the suprasegmental level of speech, there has been a recent upsurge in studies investigating this phenomenon in the domain of L2 speech rhythm. Quene and Orr (2014) tracked the rhythmic development of 18 students at the University of Utrecht, where English is the lingua franca for all students. The participants came from several different L1 backgrounds including Dutch, Russian, Vietnamese, German, and English. The intensity contours of the students were compared in sentences recited five times over the course of three years. Generally, the results indicated that the rhythmic patterns of all students became more alike during their time at the university, suggesting that mutual comprehensibility was facilitated by this adjustment. In a cross-sectional study by Saito (2015), stress, intonation, and speech rate were found to affect both intelligibility and FA in L2 English spoken by Japanese L1 speakers living in Canada. The findings suggested that the effects of LOR on prosody extended well beyond the first year after immigration. Stress placement was significantly more accurate among speakers who had lived in Canada for more than five years compared to participants with a shorter LOR; a similar effect
100
D. White and P. Mok
was observed with speech rate. Saito (2015) also found that both of these prosodic domains contributed to increased intelligibility. Polyanskaya et al. (2016) resynthesized L2 English utterances spoken by L1 speakers of French to determine whether speech rate or speech rhythm had a greater effect on FA. Native English speakers rated a wide variety of these resynthesized sentences, which in some cases controlled for the idiosyncrasies of L2 speech rhythm and in others for the idiosyncrasies of L2 speech rate. They found that while both of these prosodic factors influenced the perception of FA among native English speakers, the effect of speech rhythm was stronger. Kawase et al. (2016) used durational speech rhythm metrics to investigate the L2 English of four Japanese immigrants to Australia. The LOR of two of the participants was less than six months while for the other two, it was more than a year. The long-LOR speakers were found to have significantly higher durational variability in both pairwise and global measurements. Furthermore, the long-LOR participants were closer than the short-LOR speakers to the rhythmic scores of two Australian L1 English speakers who were also measured. Finally, in a recent study of L2 rhythm, Maastricht et al. (2018) compared L2 rhythmic development in Spanish speakers of Dutch with that of Dutch speakers of Spanish. In both groups, the rhythmic effects of the L1 were evident; however, the effects were stronger for the Spanish L1 speakers than for the Dutch L1. The authors’ titular conclusion, “Learning Direction Matters”, entails that the effect on FA and comprehensibility may be higher for speakers acquiring an L2 with greater stress-timing than their L1. Taken together, these studies of L2 prosody suggest that the effects of LOR may result in a suprasegmental “burst” during the first year, similar to the segmental one proposed by Flege.
1.2 Foreign Accent and Language Experience As discussed in Sect. 1, LOR alone does not sufficiently account for changes in FA and intelligibility when L2 speakers live abroad. LOR is a simple, one-dimensional quantification of time. It does little to explain how time was spent during the time abroad, or the kind of language experience (LE) that people had during that time. In fact, LOR is nearly meaningless unless LE is also taken into consideration: because of the various cultural enclaves in international cities, it is possible for immigrants to carry on speaking their L1 in the new countries as though they had never left the old ones (in fact, this seems to have been an important factor for several participants in the present study). Collentine and Freed (2004) identify three typical contexts in which an L2 is learned: At Home (AH) denotes when an L2 is learned as a foreign language in one’s own country, typically in a classroom setting; Immersion (IM) is usually a summer program that takes place, like AH, in an L1-ambient environment, but in a context where everyone has agreed to use the L2 completely and intensively over a long period of time; finally, “Study Abroad” (SA) describes the situation when learners go overseas to learn the language in an L2-ambient environment. For all of the participants in the present study then, the beginning of the observation period marks
Speech Rhythm, Length of Residence and Language Experience: …
101
the end of AH and the beginning of a SA context (however, please see Sect. 2 for a brief clarification on the use of “SA” to describe their learning context). Comparisons of these two contexts in the literature suggest that there is an advantage to learning in an SA context, but this advantage is dependent upon several factors related to LE. In order to quantify the influence of these factors, Freed et al. (2004) authored the “Language Contact Profile”, a survey for participants of SA investigations. This survey was designed to gauge holistic language development and not the phonetic correlates associated with SA; however, there are many questions in the survey that address spoken interaction between the participants and L1 speakers both inside and outside of the classroom. Freed and her collaborators usually measure spoken language in terms of global fluency and speech rate (words per minute). The effect of language contact on these variables is, therefore, what these studies are focused on rather than on segmental or suprasegmental phenomena. The effects of contact with L1 speakers have not been conclusive in the literature, but several studies have examined them. In a comparison of SA and AH contexts, Segalowitz and Freed (2004) found that L1 English speakers of L2 Spanish who spent a semester studying in Spain had an advantage over their counterparts who studied Spanish at a university in the United States. In particular, oral proficiency of the SA group was significantly improved at the end of the semester, particularly in terms of fluency, and the elimination of non-Spanish “fillers” from their speech (such as “like, I mean, ah, um” etc.). When the results of the “Language Contact Profile” were also taken into consideration, however, there was no correlation between this advantage and contact with L1 Spanish speakers outside of the classroom. Conversely, Hernandez (2010) conducted a similar study of American students learning Spanish in Spain for a semester, and found a significant correlation between contact with L1 Spanish speakers and improved fluency. Dewey et al. (2012), which examined 204 American L2 Japanese speakers studying in Japan, contained similar findings. In a self-report survey of language contact and L2 proficiency gains, there was a significant correlation between interaction with L1 Japanese friends and improved performance in L2 Japanese. There has been some discussion of LE in phonetic studies of L2 acquisition. Two findings have relevant implications for the present study. First, Purcell and Sutter (1980) investigated L2 English acquisition by L1 Japanese, Thai, Persian, and Arabic speakers living in New York City. There was a significant inverse correlation between FA and the amount of workplace and/or school communication in the L2. Secondly, in a study of Korean children and adults learning L2 English in North America, Flege et al. (2006) posited an advantage for the children related to LE. Both groups of participants had studied English extensively before leaving Korea, but the adult participants tended not to continue their studies after they had immigrated. As Flege et al. (2006) point out, “…the [Korean] adults’ longer study of English in Korea may have represented a disadvantage with respect to the [Korean] children. The [Korean] adults may also have been disadvantaged somewhat by the fact that they had received somewhat less education in English-medium schools in North America than the [Korean] children had” (161).
102
D. White and P. Mok
To the best of our knowledge, no study to date has examined the relationship between LE and L2 Speech Rhythm in a SA context. Furthermore, there have been very few longitudinal studies that have tracked the rhythmic development of new immigrants with respect to LOR. To address this gap in the literature, the present study will investigate the rhythmic patterns of seven participants during their first two years abroad, and survey their use of language during this time. There are three main research questions and corresponding hypotheses that have motivated the present study. Research Question 1: Does immigration to an English-ambient environment correlate with significant changes in speech rhythm patterns among L2 English speakers from Hong Kong? Hypothesis 1: It is expected that there will be significant changes but it is unclear which rhythmic correlates will be affected the most. Research Question 2: In which direction will these changes in rhythmic patterns manifest? Hypothesis 2: It is expected that the rhythmic patterns will change in the direction of stress-timing, i.e. all Varco and PVI metrics are expected to increase, and Percent V is expected to decrease. In addition, speech rate (not a speech rhythm metric per se) is expected to increase. Research Question 3: How do these changes in rhythmic patterns correlate with communication patterns after immigration? Hypothesis 3: It is expected that the rhythmic changes will correlate positively with communication in L2 English, and that communication in L1 Cantonese will inhibit these rhythmic changes.
2 Method 2.1 Participants All seven participants attended the same Hong Kong Aided secondary school prior to emigration, which occurred between 2011 and 2015. (“Aided” schools are managed by charitable and/or religious organizations, and funded by the government.) The school was located in Kowloon (a district of Hong Kong) and used Cantonese as a medium of instruction. The participants did not attend the same primary school; however, Cantonese was also the main medium of instruction during their primary education. Table 1 details their ages at the time of emigration, and their destination. In every case, the reason for leaving Hong Kong was to continue their studies at secondary schools or universities overseas. The details of each participant’s living situation will be discussed in turn. It is important to note that the term “SA” does not precisely describe the learning context of the present study. Strictly speaking, the studies discussed in Sect. 1 used “SA” to denote a situation in which students go abroad for the express purpose of
Speech Rhythm, Length of Residence and Language Experience: … Table 1 AOA and destination of the participants
103
Participant
AOA (years; months)
Destination
CANGirl
17;9
Markham, Canada
CANGirl 2
17;10
Toronto, Canada
CANBoy
17;11
Markham, Canada
CANUSABoy
18;5
Comox, BC, Canada*
USAGirl
16;7
Wausau, WI, USA*
AUSBoy
20;5
Sydney, Australia
UKBoy
16;11
Cambridge, UK*
* Moved
at the end of first year; see details below
learning their L2. In contrast, the participants in the present investigation went abroad to study in schools where L1 English was the medium of instruction. Their main goal, in every case, was to enroll in tertiary institutions (eventually or immediately) outside of Hong Kong. Improvement in their L2 English was certainly a consideration for most if not all participants; however, this goal was secondary and, in some cases, incidental.
2.1.1
CANGirl
CANGirl left Hong Kong in 2011 and immigrated to Markham, Ontario, Canada. During the two-year observation period she lived with a friend of her mother who had immigrated to Canada several years earlier. The woman’s two daughters were also living in the house. CANGirl rented a room and shared the common areas with these three women. She attended secondary school in Markham, where she completed her secondary school diploma in 2013. (Demographically, one relevant fact about Markham is that it is home to a high concentration of Hong Kong immigrants in Canada. At the time CANGirl moved there, Markham’s population was just over 300,000, 16% of whom were native Cantonese speakers, and 45% of whom were ethnically Chinese (Statistics Canada, 2012)).
2.1.2
CANGirl 2
CANGirl 2 left Hong Kong in 2013 and immigrated to Scarborough in the east end of Toronto, Canada. For the entire observation period, she lived in an apartment with her older brother and sister who had immigrated to Canada a few years previously. During the observation period she attended a community college in Toronto.
104
2.1.3
D. White and P. Mok
CANBoy
CANBoy left Hong Kong in 2011 and immigrated to Markham, Ontario, Canada. In his first year abroad, he boarded with a Mandarin-speaking woman and her three bilingual children who spoke English for the most part. During this time, he attended secondary school in Markham. In the second year, he moved in with a Cantonesespeaking family, but the language of the household was largely English, especially among the children. (In both households, there was little communication between CANBoy and his hosts.) He received his secondary diploma in 2013.
2.1.4
CANUSABoy
CANUSABoy immigrated to Comox, British Columbia, Canada in late 2013. He lived with an English-speaking family during the first year of observation and attended secondary school. There was another boarder living in the same household who spoke Spanish as a first language. At the end of the first year, CANUSABoy was accepted to college in San Jose, California, USA, and moved there. While attending college, he lived with two Cantonese-L1 roommates.
2.1.5
USAGirl
USAGirl immigrated to Wausau, Wisconsin, USA in 2014. She lived with an Englishspeaking family in which there were three children (two boys and one girl). She attended secondary school along with the children in this household. After her first year, USAGirl moved to Murphy, North Carolina, USA, to live with her mother and stepfather, who immigrated a year after she did. The language of this household was Cantonese.
2.1.6
AUSBoy
AUSBoy immigrated to Sydney, Australia in 2013 to attend university. During his first month in Sydney, he lived with an English-speaking family of four, after which he began boarding with a 70-year-old native English speaker. There was one other boarder in the house, who was from Guangdong Province in China, and whose L1 was also Cantonese. Nevertheless, most of the communication between AUSBoy and his fellow boarder was in English, because the majority of their interaction occurred at the dinner table with their host mother. During his second year in Sydney, AUSBoy moved into an apartment with three L1 Mandarin speakers, but there was very little interaction between him and these three roommates.
Speech Rhythm, Length of Residence and Language Experience: …
2.1.7
105
UKBoy
UKBoy immigrated to Cambridge, UK in 2015. During the first year of observation, he lived with an English-speaking family, with three children (all girls), and another boarder who was a native Spanish speaker. He attended a secondary school in Cambridge that offered foundation courses to prepare for university. At the end of his first year, UKBoy moved to Exeter, UK to attend university.
2.2 Recordings In every case except for one, the participants were recorded before emigration (T1) and then at approximately six months (T2), 1 year (T3), 1.5 years (T4), and 2 years (T5) after immigration. The exception is AUSBoy, who was recorded at 3 months, 9 months, 15 months, and two years after moving to Australia. Because of the differences between the school schedule of Hong Kong and Australia, AUSBoy returned to Hong Kong for his December (summer) holiday three months after his departure. Depending on various circumstances, the interviews were sometimes conducted in person and sometimes remotely over Skype. In every case, the participants were recorded using a Zoom H2 recorder, with digital sampling at 44.1 Hz, which was placed approximately 20 cm from the participants’ faces. (When interviews were conducted over Skype, the recorder was mailed to the participants and operated by him or her.) They were recorded reading three passages from which the data for the present study are taken: “The North Wind and the Sun”, “The Rainbow”, and fourteen sentences composed by the authors (see Appendix 1). Several criteria determined which utterances were suitable for analysis. First, the utterances had to be at least five syllables in length and spoken within the same breath group. Second, any utterance with a pause or false start was rejected. Finally, the utterance had to meet these criteria in all five recordings. For each participant, the total numbers of acceptable utterances per recording were as follows: CANGirl— 18, CANGirl 2—20, CANBoy—9, CANUSABoy—6, USAGirl—18, AUSBoy—20, UKBoy—13. The measurement of identical utterances at the five time points was motivated by suggestions made by Wiget et al. (2010), which evaluated the efficacy of speech rhythm metrics in longitudinal investigations. It is worth noting that in the cases of CANBoy, and CANUSABoy, the relatively small numbers of utterances were due largely to a lack of fluency during their pre-emigration interviews.
2.3 Segmentation and Speech Rhythm Metrics After the suitable utterances were identified and isolated, they were segmented in Praat (Boersma, 2001) on two tiers. The first demarcated vocalic (vowel) and consonantal boundaries; syllable boundaries were established on the second tier.
106
D. White and P. Mok
Segmenting syllables is a somewhat controversial process because it requires commitment to a hypothesis regarding the composition of a syllable. In this process, the authors adhered to the Maximum Onset Principle (Kahn, 1976) to determine syllable boundaries; however, this did not preclude a number of choices that were essentially judgment calls based on careful listening and observation of spectrograms. These cases were, for the most part, instances when the final coda consonant was resyllabified across a word boundary. In these cases, the final consonant was considered part of the first syllable in the second word. After segmentation was complete, rhythmic scores were tabulated for eight durational rhythmic metrics: VarcoC, VarcoV, VarcoS (Dellwo, 2006; White & Mattys, 2007; Mok and Dellwo, 2008), nPVI_C, nPVI_V (Low, Grabe and Nolan, 2000), nPVI_S (Deterding, 2001; Mok and Dellwo, 2008) %V (Ramus et al., 1999), and Speech Rate in syllables per second (s/s). The Varco and PVI metrics both measure durational variability, and a higher score generally indicates a greater tendency toward stress-timing. The Varco metrics quantify the global variability of consonantal, syllabic, and vocalic segments through measurement of standard deviation (also known as the Delta metrics (Ramus et al., 1999)) normalized for variations in speech rate. The PVI metrics also measure durational variability, but they gauge the differences between adjacent pairs of like segments. In other words, the duration of a consonantal interval, for example, is compared with that of the duration of the consonantal interval that immediately precedes it and immediately follows it. The small “n” at the beginning of the PVI metrics denotes normalization for variations in speech rate. We did not use the non-normalized versions of these metrics because there were very extensive changes in speech rate within participants, across participants, and throughout the entire observation period. Finally, PercentV is the overall percentage of vocalic content in a given utterance. In contrast to the other rhythm metrics, decrease in PercentV indicates a tendency toward stress-timing. In the context of the present study, it is relevant to point out that the metrics are generally better equipped to detect broad differences in speech rhythm rather than fine-grained differences between the various accents of English. In other words, we expect that there will be rhythmic changes in the direction of stress-timing regardless of the ambient English accent; the L2 English of L1 Cantonese speakers will be, on the whole, more syllable-timed than all native English accents.
2.4 Statistical Analysis Three sets of statistical tests were carried out, two across participants, and one within participants. First, a linear mixed model analysis was run in RStudio (version 1.1.463) using the lme4 package (Bates et al., 2015). The model included random intercepts for participants and a fixed effect for time (T1–T5). Eight different linear mixed models were designed—one for each metric. The main purpose of these tests was to examine differences between T1 and the subsequent measurements (T2–T5). Second, to investigate pairwise differences between times, the mixed effects model was followed by
Speech Rhythm, Length of Residence and Language Experience: …
107
a Type III Analysis of Variance Table with Satterthwaite’s method, which included a post hoc Tukey HSD to compare each pair. The third statistical test was a withinparticipant paired t test, which we conducted for the following pairs of time points: T1–T2, T1–T3, T1–T5, T2–T3, and T3–T5.
2.4.1
Language Experience Survey
In order to assess their language use during their time abroad, the students also answered a questionnaire about their experiences. This questionnaire was a modified version of the “Language Contact Profile” (Freed et al., 2004). The main focus of the questions is the degree of L1 and L2 use while living abroad, as well as their living situation (detailed above). The participants were asked to estimate the number of days per week, and the number of hours per day that they communicated in their L2 with different groups of speakers, such as friends, teachers, service industry workers, strangers etc.
3 Results The Results section is divided into two parts. First, the results for speech rhythm will be presented, beginning with across-participants, and followed by within-participant scores. Second, the results of the LE survey will be presented.
3.1 Across-Participants Results The means for the across-participants results are presented in Table 2. Table 3 details the results of the linear mixed effects model. All of the measurements of durational variability increased between T1 and T2, and several of these increased significantly after one year (T3). In most cases, these increases were maintained during the second year abroad; however, not all of the differences with T1 were significant. In contrast to the measures of durational variability, the expected decreases in PercentV were marginal and not significant. Finally, the speech rate results increased significantly across participants after six months, and these significant differences were maintained at all subsequent time points. In the repeated measures Anovas, speech rate was the only metric that reached significance (F(4, 103) = 18.748, p < 0.001). Among the pairwise comparison for Tukey HSD, there was one significant difference for speech rate between T1 (M = 4.34, SD = 0.74) and T5 (M = 5.17, SD = 0.79).
108
D. White and P. Mok
Table 2 Mean results across participants VarcoC
T1
T2
T3
T4
T5
48.73
52.81
53.15*
53.71*
52.88*
VarcoS
39.5
41.62
43.57*
43.84*
43.63*
VarcoV
41.25
42.79
45.69*
45.05
45.75
nPVI-C
56.23
60.65
60.72
63.67*
61.08*
nPVI-S
50.06
50.34
53.07
53.08
52.32
nPVI-V
45.25
47.47
49.38*
48.53
50.19
PercentV
49.05
47.85
48.05
48.9
48.42
Speech Rate *
4.34
4.81*
4.97*
5.08*
5.18*
= significantly different from T1
3.2 Within-Participant Results The within-participant results are presented below. First, the results for each metric are presented in Fig. 1. Following these figures, each participant’s results are presented in Tables 4, 5, 6, 7, 8, 9, 10 and discussed in turn. Paired t tests were carried out for the following pairs: T1–T2, T1–T3, T1–T5, T2–T3, and T3–T5. The first year was compared more thoroughly than the second year because we expected that rhythmic development would be greater during that time. In order to avoid a Type 1 error, a Bonferroni correction was applied, in which our p value of 0.05 was divided by the number of comparisons (5). The threshold for significance in the tables below, therefore, is 0.01. In Tables 4, 5, 6, 7, 8, 9, 10, a significant difference from T1 is represented by “*”, a significant difference from T2 is represented by “†”, and a significant difference from T3 is represented by “‡”.
3.2.1
CANGirl
There was a significant increase in the speech rate of CANGirl between T1 (Mean = 4.22, Standard Deviation = 0.72) and T2 (M = 4.83, SD = 0.78), (t(17) = −7.439, p < 0.001); however, there was a significant decrease between T2 and T3 (M = 4.56, SD = 0.70) (t(17) = 2.887, p = 0.01) Although this T3 score was still greater than her speech rate at T1, the T1–T3 comparison did not reach significance after correction (p = 0.015). There were also significant increases in Speech Rate between T3 and T5 (M = 5.03, SD = 0.88) (t(17) = −5.278, p < 0.001), and between T1 and T5. (t(17) = −6.558, p < 0.001). The only other significant result for CANGirl was an increase in VarcoC between T1 (M = 42.4, SD = 11.24) and T3 (M = 50.31, SD = 14.49) (t(17) = −2.917, p = 0.01).
Speech Rhythm, Length of Residence and Language Experience: …
109
Table 3 Results of mixed effects model Metric
Predictor
Estimate
Std. Error
t
p
VarcoC
Intercept
48.95
1.62
30.18
< 0.001
T2
4.26
2.1
2.03
0.052
T3
4.44
2.0
2.22
0.03*
T4
5.07
2.03
2.5
0.02*
T5 VarcoS
Intercept T2
VarcoV
nPVI-C
PercentV
Speech rate
2.1 19.31
< 0.001
0.04*
2.11
1.57
1.35
0.18
T3
4.02
1.62
2.48
0.02*
4.25
1.74
2.44
0.02*
T5
4.03
1.79
2.25
0.04*
40.72
2.03
T2
Intercept
1.56
2.11
0.74
0.46
T3
4.32
2.13
2.03
0.04*
20.1
< 0.001
T4
3.65
2.14
1.7
0.09
T5
4.23
2.22
1.9
0.07
1.96
28.85
< 0.001
Intercept
56.4
T2
4.69
2.85
1.64
0.12
T3
4.69
2.72
1.73
0.1
T4
7.72
2.9
2.67
0.02*
Intercept T2
nPVI-V
2.0 2.04
T4
T5 nPVI-S
4.21 39.31
5.08
2.78
1.83
0.08
48.86
3.5
13.94
< 0.001
0.4
2.24
0.18
0.86
T3
3.15
2.25
1.4
0.16
T4
2.68
2.4
1.12
0.27
T5
1.74
2.62
0.66
0.52 < 0.001
44.22
2.48
17.85
T2
Intercept
2.2
2.28
0.97
0.34
T3
4.1
2.28
1.8
0.07*
T4
3.4
2.29
1.48
0.14
T5
4.61
2.37
1.95
0.059
49.17
1.29
38.17
< 0.001
Intercept T2
-1.18
0.86
-1.36
0.18
T3
-0.97
1.0
-0.98
0.35
T4
-0.22
1.13
-0.19
0.85
T5
-0.7
Intercept
4.29
1.04
-0.66
0.52
0.15
27.99
< 0.001 (continued)
110
D. White and P. Mok
Table 3 (continued) Metric
Predictor T2
*
Estimate 0.54
Std. Error 0.21
t
p 2.63
0.04*
T3
0.69
0.21
3.35
0.01*
T4
0.8
0.24
3.36
0.01*
T5
0.88
0.17
5.1
0.002*
= significantly different from T1
CANGirl’s other Varco measurements had (for the most part) marginal increases in the expected direction. The same was true for the nPVI measurements, except for nPVI-S, which fell in the second year. PercentV was flat.
3.2.2
CANGirl 2
CANGirl 2’s speech rate results suggest an increase in the first six months that remained level for her remaining time abroad. Her speech rate increased significantly from T1 (M—4.56, SD—0.65) to T2 (M = 4.93, SD—0.66), t(19) = −5.21, p < 0.001. The T1 score was also significant when compared with T3 ((M = 4.94, SD = 0.87), t(19) = −3.672, p = 0.002) and T5 ((M = 5.03, SD = 0.69), t(19) = − 4.386, p = 0.002). Her VarcoS score increased significantly from T1 (M = 46.5, SD = 11.89) to T5 (M = 51.49, SD = 10.32), t(19) = −2.879, p = 0.01. There were also increases in the T2–T3 and T1–T3 comparisons, but they were not significant. This suggests that the increase in global syllabic variability began in the second half of her first year abroad, aligning with the across-participants results. Her PercentV scores increased significantly (i.e. not in the expected direction) in the second year from T3 (M = 41.14, SD = 6.3) to T5 (M = 44.92, SD = 6.84), t(19) = −3.117, p = 0.006; however, there was also a similar decrease in the T2–T3 comparison, from 43.35 to 41.14, which was not significant. The conclusion is that her PercentV scores remained more or less flat during her time abroad. Her other Varco scores and nPVI scores increased generally during the first year and remained at around those levels for the second year. However, none of these comparisons was significant.
3.2.3
CANBoy
CANBoy’s speech rate scores increased significantly from T1 (M = 4.06, SD = 0.61) to T2 (M = 5.32, SD = 0.84), t(8) = −3.661, p = 0.006. There was also a significant increase between T1 and T3 (M = 5.6, SD = 0.61), t(8) = −7.291, p < 0.001, and between T1 and T5 (M = 5.5, SD = 0.65), t(8) = −4.688, p = 0.002. This suggests an increase in the first six months that leveled off during the rest of the investigation period. For his Varco scores there was no discernible trend throughout the observation period; for his nPVI scores, there was a general increase.
Speech Rhythm, Length of Residence and Language Experience: …
111 nPVI-C
VarcoC 75
60
65 50 55 45
40 T1
T2
T3
T4
T1
T5
T2
T3
VarcoS
T4
T5
T4
T5
T4
T5
nPVI-S
60
70 60
50
50 40
40
30
30 T1
T2
T3
T4
T5
T1
T2
T3
VarcoV
nPVI-V
60
60
50
50
40
40
30
30 T1
T2
T3
T4
T5
T1
T2
PercentV
T3 Speech Rate
60
7 6
50
S/s 5 40
4 3
30 T1
T2
T3
T4
T5
T1
CANGirl
CANGirl 2
CANBoy
CANUSABoy
USAGirl
UKBoy
Fig. 1 Mean results for all metrics
T2
T3
AUSBoy
T4
T5
112
D. White and P. Mok
Table 4 Mean results for CANGirl VarcoC
T1
T2
T3
T4
T5
42.4
47.52
50.31*
50.03
45.47
VarcoS
33.05
33.24
34.51
31.95
32.07
VarcoV
40.67
39.73
42.53
44.22
44.19
nPVI-C
48.69
51.76
53.51
56.31
51.84
nPVI-S
40.28
38.96
40.53
35.1
34.45
nPVI-V
43.55
47.5
46.79
47.39
49.48
PercentV
52.5
49.88
53.25
53.68
52.45
Speech Rate
4.22
4.83*
4.56†
5.1
5.03*‡
T3
T4
T5
Table 5 Mean results for CANGirl 2 T1
T2
VarcoC
48.57
50.25
55.32
54.88
52.19
VarcoS
46.5
47.3
51.5
52.82
51.49*
VarcoV
44.19
43.94
47.56
47.8
48.1
nPVI-C
56.39
60.13
63.53
67.74
63.34
nPVI-S
64.25
60.59
62.49
68.13
67.05
nPVI-V
47.56
49.51
53.34
49.01
55.7
PercentV
43.29
43.35
41.14
45.26
44.92‡
5.25
5.03*
Speech Rate
4.56
4.93*
4.94*
T2
T3
Table 6 Mean results for CANBoy T1
T4
T5
VarcoC
51.56
54.85
55.05
55.1
56.7
VarcoS
34.05
44.28
47.75
45.7
42.76
VarcoV
33.83
43.09
43.16
46.2
40.56
nPVI-C
56.14
59.95
61.28
62.53
61.07
nPVI-S
42.81
50.25
56.7
54.69
53.15
nPVI-V
37.62
38.87
41.74
45.35
41.65
PercentV
53.9
51.42
48.93
49.14
50.67
5.6*
6.25
5.5*
Speech Rate
3.2.4
4.06
5.32*
AUSBoy
AUSBoy’s speech rate scores increased significantly from T2 (M = 4.78, SD = 0.84) to T3 (M = 5.43, SD = 0.87), t(19) = −8.411, p < 0.001. There were also significant increases from T1 (M = 4.67, SD = 0.67) to T3, t(19) = −7.142, p < 0.001; and
Speech Rhythm, Length of Residence and Language Experience: …
113
Table 7 Mean results for AUSBoy VarcoC
T1
T2
T3
T4
T5
50.13
49.6
49.24
47.9
54.01
VarcoS
44.6
46.13
47.78
48.48
52.04
VarcoV
47
48.94
53.21
52.43
56.06*
nPVI-C
58.19
57.25
56.87
54.49
58.03
nPVI-S
56.2
57.03
58.56
60.17
65.1
nPVI-V
54.36
55.31
55.75
57
59.56
PercentV
48.24
45.71*
45.05
44.8
43.25*
Speech Rate
4.67
4.78
5.43*†
5.33
5.63*
Table 8 Mean results for CANUSABoy T1
T2
T3
T4
T5
VarcoC
48.31
53.08
50.95
58.64
56
VarcoS
40.32
40.65
41.6
46.83
41.55
VarcoV
46.02
43.41
43.41
42.21
41.49
nPVI-C
53.76
66.15
63.67
68.48
70.13
nPVI-S
43.47
44.14
46.39
44.69
37.76
nPVI-V
48.63
41.53
42.66
45.93
47.66
PercentV
46.91
45.85
48.47
48.17
46.12
4.55
4.92*
Speech Rate
3.71
4.96*
4.82*
Table 9 Mean results for USAGirl T1
T2
T3
T4
T5
VarcoC
49.58
59.75*
58.88
58.27
55.7
VarcoS
36.77
38.6
38
39.77
39.66
VarcoV
37.86
41.91
46.42
41.48
42.05
nPVI-C
58.59
71.73*
69.84
72.44
69.18
nPVI-S
48.95
48.61
52.65
52.84
49.02
nPVI-V
39.86
47.73
53.41*
46.83
46.51
PercentV
51.18
53.82
53.57
55.83
54.21
4.87*
4.34
Speech Rate
3.86
4.61*
5.03*
from T1 to T5 (M = 5.63, SD = 0.87), t(19) = −10.415, p < 0.001. This pattern suggests an increase that began between the 3-month and 9-month mark during his first year, and continued during the rest of the investigation period. Additionally, VarcoV increased significantly from T1 (M = 46.0, SD = 13.09) to T5 (M = 56.06, SD = 15.93), t(19) = −2.85, p = 0.01.
114
D. White and P. Mok
Table 10 Mean results for UKBoy T1
T2
T3
T4
T5
VarcoC
52.66
57.85
51.53
56.39
54.46
VarcoS
36.99
40.33
43.15
42.29
41.63
VarcoV
36.33
36.5
37.38
36.03
39.13
nPVI-C
61.38
61.62
57.95
68.16
59.74
nPVI-S
41.9
45.33
48.63
47.02
45.48
nPVI-V
41.2
40.61
39.91
42.07
40.44
PercentV
49.05
45.47
47.64
44.8
47.68
4.77
4.48
4.69
Speech rate
4.87
4.99
AUSBoy’s PercentV scores declined gradually over the entire investigation period (i.e. in the expected direction). There was a significant decrease from T1 (M = 48.24, SD = 7.15) to T2 (M = 45.05, SD = 7.25), t(19) = 3.08, p = 0.006. The decrease between T1 and T5 (M = 43.25, SD = 5.1) was also significant (t(19) = 5.548, p < 0.001). In his T1–T3 Percent V comparison, there was a decrease from 48.24 to 45.05 s/s, which was not significant. AUSBoy’s nPVI-S scores increased generally, but not significantly, during the investigation period. His other Varco scores and nPVI-V increased generally during the investigation period; however; his nPVI-C scores were flat.
3.2.5
CANUSABoy
CANUSABoy’s speech rate scores increased significantly from T1 (M = 3.71, SD = 0.8) to T2 (M = 4.96, SD = 1.16), t(5) = −4.497, p = 0.006.; and there was a significant increase from T1 to T3 (M = 4.82, SD = 0.75), t(5) = −4.232, p = 0.008. Additionally, from T1 to T5 there was a noteworthy increase (p = 0.028). Overall, these results suggest an increase in speech rate during the first six months that leveled off during the rest of the investigation period. Although not significant, there were increases in the consonantal measures of variability. Their vocalic and syllabic counterparts showed no such trend. PercentV was also flat.
3.2.6
USAGirl
USAGirl’s speech rate scores increased significantly from T1 (M = 3.86, SD = 0.52) to T2 (M = 4.61, SD = 0.62), t(17) = −9.759, p < 0.001; from T1 to T3 (M = 4.87, SD = 0.8), t(17) = −10.254, p < 0.001; and from T1 to T5 (M = 5.03, SD = 0.71), t(17) = −11.883, p < 0.001. This suggests an increase in the first six months that leveled off during the rest of the investigation period. The p value of the T2–T3 speech
Speech Rhythm, Length of Residence and Language Experience: …
115
rate comparison was also notable (p = 0.018). In the first six months abroad, two measurements of USAGirl’s consonantal variability increased significantly: VarcoC from T1(M = 49.58, SD = 12.92) to T2 (M = 59.75, SD = 16.12), t (17) = −3.194, p = 0.005; and nPVI-C from T1(M = 58.59, SD = 14.4) to T2 (M = 71.73, SD = 17.74), t (17) = −3.224, p = 0.005. In addition, nPVI-V increased significantly from T1 (M = 39.86, SD = 12.11) to T3 (M = 53.41, SD = 15.32), t (17) = −4.194, p = 0.001. Finally, two other results were notable because they moved in a direction contrary to expectations. PercentV increased in both the T1–T3, and T1–T5 comparisons though not significantly (p = 0.04, and 0.023, respectively).
3.2.7
UKBoy
There were no significant results for UKBoy.
3.3 Language Experience Survey Results A considerable amount of residential information from the LE survey is detailed above (see Sects. 2.1.1–2.1.7). The survey was divided into two parts, year one and year two. The reason for this division is that the living situations of several participants changed at the end of their first year abroad. Figure 2 shows the average estimated amount of English and Cantonese communication (hours per week) with L1 speakers of each language; Fig. 3 shows the average estimated frequency (days per week) of English communication with strangers who spoke L1 English. Figure 2, therefore, is primarily a measurement of language contact while Fig. 3 gives an indication of motivation for speaking English among the participants. In Fig. 2, the amount of English use varies from participant to participant. When the first year is compared to the second, there were sizable reductions in the use of Fig. 2 Estimated time speaking English and Cantonese to L1 Speakers
UKBoy AUSBoy USAGirl Year 1 - English
CANUSABoy
Year 2 - English Year 1 - Cantonese
CANBoy
Year 2 - Cantonese
CANGirl 2 CANGirl 1 0
20
40
60 80 Hours/Week
100
120
116 Fig. 3 Estimated frequency of communication with L1 English strangers
D. White and P. Mok 7 6 5 4 Days/ Week 3
Year 1
2
Year 2
1 0
English for CANBoy and UKBoy, as well as an increase for CANGirl. Cantonese use was similarly varied; however, one trend that several participants had in common was a noticeable increase in Cantonese use during the second year. The reasons for these trends will be addressed individually in Sect. 4. Figure 3 suggests that there was also wide variability in the willingness of the participants to interact with Englishspeaking strangers. In fact, some of the participants indicated that they spent large periods of time without any interaction with strangers at all.
4 Discussion Our first research question and hypothesis asserted that the speech rhythm patterns of the participants would change significantly during the two-year observation period. It is clear that this hypothesis has been largely, though not entirely, borne out by our results. Across participants, a majority of the metrics changed significantly. Nevertheless, the individual results suggest that the rhythmic developments of some participants were much greater than others. In the case of UKBoy, for example, there were no significant changes at all. Our second research question and hypothesis asserted that the direction of rhythmic development would be toward stress-timing. Again, this assertion was largely borne out, but with a few exceptions. The measurements of durational variability increased across participants, which indicates general development in the direction of stress-timing. At the individual level as well, most of the significant changes were increases in durational variability of one kind or another. The exception at both levels was PercentV, which remained nearly equal at all time points across participants, and in some individual cases actually increased significantly, exactly opposite to our expectations. In fact, AUSBoy was the only participant whose PercentV scores decreased significantly throughout the observation period. Our third and final research question and hypothesis asserted that rhythmic changes would correlate positively with the participants’ estimated time spent speaking English, and negatively with their estimated time spent speaking Cantonese.
Speech Rhythm, Length of Residence and Language Experience: …
117
At the present stage of this study, we will not address this question through statistical analysis. We are still in the process of collecting other related results that will eventually be incorporated into our linear mixed model regression. The discussion below is, therefore, anecdotal and makes no statistical claims. The participant who seemed to display the greatest rhythmic development during her first year abroad was USAGirl. She spoke English the most and Cantonese the least among all of the participants. There were a number of factors that led to these patterns of language use. First, USAGirl was the participant most isolated from Cantonese influence during her first year abroad. She lived in Wausau, Wisconsin, a small city of 40,000 people where very few (if any) Cantonese speakers live. All of USAGirl’s interaction with local residents was conducted in English, and the small amount of Cantonese that she spoke during her first year took place entirely online with friends and family in Hong Kong. Second, as Figs. 2 and 3 show, USAGirl is a talkative person and spent a great deal of time communicating with her host family, fellow students, teachers, and even strangers. In her second year abroad, she left Wausau and moved to Murphy, North Carolina to live with her mother and stepfather, both L1 Cantonese speakers. Despite the increased use of Cantonese during his time, in many cases the rhythmic changes that occurred during the first year seemed to endure. In USAGirl’s case, then, the results suggest that it may have been the increased use of English, rather than the decreased use of Cantonese, that had a greater effect on her rhythmic development. AUSBoy’s communication patterns in L1 English and L1 Cantonese speakers were similar to those of USAGirl. Judging from his results in Fig. 2, it would seem reasonable to assume that he was also isolated from Cantonese; however, he was living in Sydney, Australia, where over 40,000 Hong Kong immigrants reside. It would seem, therefore, that AUSBoy’s isolation from Cantonese was somewhat self-imposed. In fact, even though his housemate was an L1 Cantonese speaker, he communicated with him mostly in English. It is also clear that AUSBoy spent a good deal of time communicating with L1 English speakers while he lived in Sydney. When compared to USAGirl, not as many of AUSBoy’s scores suggested rhythmic development toward stress-timing. He did, however, undergo significant changes in both VarcoV and PercentV (as noted above). One other participant was somewhat isolated from Cantonese during his first year abroad: CANUSABoy, who initially immigrated to Comox, BC, Canada, a very small city of 15,000 people. In spite of this environment, he still estimated that he spent about 16 h per week communicating in Cantonese. There seemed to be two reasons that his use of Cantonese did not decrease: first, a few of his classmates were L1 Cantonese speakers; and, second, he apparently communicated more frequently online with friends in Hong Kong. His L1 communication increased even further during the second year, after moving to San Jose, as both of his roommates were L1 Cantonese speakers. In any case, the greater isolation during the first year did not seem to correlate with greater stress-timing: CANUSABoy’s speech rhythm scores remained largely unchanged over the entire observation period. In contrast to the relative isolation from Cantonese experienced by the three candidates above, CANGirl, CANGirl 2, and CANBoy experienced much more exposure
118
D. White and P. Mok
to their L1 during the observation period. The presence of Cantonese in Markham (see Sect. 2.1.1) was a factor for CANGirl and CANBoy. CANGirl 2 estimated that she communicated seven times more frequently in Cantonese than in English. The uniformity of exposure to Cantonese did not, however, translate into a uniformity of speech rhythm results. Despite a common increase in fluency during the observation period, there were distinct rhythmic developments for all three participants. First, there were no developmental trends in the rhythmic patterns of CANBoy. It seemed that he was generally reluctant to communicate in English while he was living in Canada, with an average of about eight hours per week during his first year, and just 2 h per week in his second year. Additionally, he did not communicate with strangers in English. In fact, before he left Hong Kong, CANBoy was extremely shy about communicating in English, and he had difficulty overcoming this shyness during his first two years abroad. Second, as CANGirl’s survey results indicate, she is more extroverted than CANBoy. During both years, her estimated communication with L1 English speakers was between 10 and 20 h per week. Although CANGirl was boarding with a L1 Cantonese-speaking woman and her two daughters, there was limited interaction with them, especially on weekdays. Although the FA of CANGirl 2 seemed to change very little during the two-year observation period, there was one significant change in her speech rhythm: a significant increase in consonantal variability (as measured by VarcoC) over the first year. In contrast to the moderate Cantonese communication of CANBoy1 and CANGirl, CANGirl 2 continued to speak Cantonese throughout the observation period as though she had never left Hong Kong. She estimated that her communication with L1 Cantonese speakers was 35 h per week, most of which took place in her household with her two siblings. This L1 communication notwithstanding, the syllabic variability of CANGirl 2’s L2 English also increased during the first year abroad (as measured by VarcoS). It was our impression that she also seemed to reduce her FA more effectively than the other two participants living in Canada. First, at the end of the observation period, she largely replaces /t/ with alveolar flaps in phonologically appropriate contexts. This replacement occurs both within words, and across word boundaries, which gives an impression of increased fluency, and more native-like speech. Secondly, the contours of her intonation sound much more appropriate than her two Canadian counterparts at the end of the observation period. Finally, there is UK Boy, whose results contained no significant changes throughout the entire observation period. It seems that UK Boy’s speech rate was already quite high before leaving Hong Kong: his T1 Speech rate was already at 4.77, which is just below the mean level of T2 in the across-participants results (4.81). Despite his higher rate of speaking, we felt that UKBoy’s speech changed very little in comparison to several of the other participants. Based on impressionistic listening, it seemed that his accent and comprehensibility were essentially the same through all time points. This perhaps could be attributed to the fact that he interacted very little with native English speakers during the two-year observation period. In his interviews, UKBoy spoke quite candidly about what he perceived as a lack of academic seriousness among the L1 English speakers he met, especially at Exeter
Speech Rhythm, Length of Residence and Language Experience: …
119
University. Among the L1 speakers he encountered, UKBoy’s impression was that they were much more committed to the consumption of alcohol than to attaining a university diploma. Since UKBoy is very serious and pragmatic about his education, and generally abstains from the consumption of alcohol, he consciously decided that it was in his best interest to befriend L2 English speakers, many of whom speak L1 Cantonese or Mandarin. In some cases, there seems to be a relationship between changes in speech rhythm and the amount of interaction with L1 English speakers and/or L1 Cantonese speakers. These impressions are not statistically validated, and there are notable exceptions, but in general, a greater isolation from Cantonese coupled with greater interaction with L1 English speakers resulted in the most comprehensive rhythmic modifications. At this point in the study, we cannot make definite conclusions about the correlations among these variables. In order to test these impressions statistically, the next stage of this investigation will involve native-speaker judgments for FA, intelligibility, and comprehensibility among the participants. These variables and a number of segmental phonetic correlates not reported in the present study will all be taken into consideration in a mixed model regression analysis, which will measure the influences of all factors statistically. Native speaker judgments will also shed light on the amount of perceived FA before emigration. This factor is not addressed by the present study, but it may help to explain the differences in rhythmic development among the three participants who lived in the Toronto area. Specifically, if CANGirl 2’s pre-emigration FA was not as strong as that of her two counterparts living in Markham, this might suggest an explanation for rhythmic development that occurred in spite of intense, daily communication in Cantonese. Another possibility for future research on these data would involve additional rhythmic metrics outside the domain of duration. Recent work suggests that L2 rhythmic patterns are also manifested in pitch and loudness (Fuchs, 2014) and sonority (Fuchs & Wunder, 2015). The inclusion of these metrics may offer a more holistic understanding of the correlation between speech rhythm development and immigration to an L1-ambient environment, which has been demonstrated clearly by the present study.
5 Conclusion In conclusion, this longitudinal study has found that L2 English speech rhythm may develop in the direction of stress-timing after L1 Cantonese speakers immigrate to an English-ambient environment. Although these developments were neither consistent nor uniform among the seven participants, a majority of them displayed elements of significantly more stress-timed speech during their first two years abroad. These results align with previous studies of L2 speakers living in an L1-ambient environment. The speech rhythm developments towards stress-timing occurred mostly within the first year and in a number of cases remained at this level during the second.
120
D. White and P. Mok
This parallels the initial burst (see Sect. 1.1) observed by Flege and his collaborators. Finally, these results reinforce the notion that LOR is not a reliable metric when it is considered in isolation. The patterns of language use among the students in the present study were very dependent on context, and their contexts were all over the map, both literally and figuratively. In studies of this kind, therefore, the inclusion of language experience data is advisable in order to gain a deeper understanding of the time spent in an L2-ambient environment.
Appendix 1: Sentences 1. 2. 3. 4. 5. 6. 7.
8. 9. 10. 11. 12. 13. 14.
When the cup is empty, they fill it with water for the class. When it is full, they will give the water to the class, and then go back to fill it again. She was happy about the email she received on the weekend from her sister. The letter was full of things that made her laugh. There was no doubt about it: the elevator was better than the stairs because it was very hard for her to climb so many steps. Her aunt lived outside of Toronto in a city called Markham. She usually drove her car if she went out after dark. It was hard to write the letter to Mr. Jones. How could he write the letter when what he wanted to do was shout at him out loud? The first day of school is a special day. It is special because the children feel happy about the new year. They also like to play outside at recess. The boy sent the girl an email to ask her out for dinner. He told her that he would pick her up in his car at six o’clock, but she wondered how he could drive so fast. I doubt that it will matter how fast you can run. If you ask me, it is much better to learn how to walk quickly, even if people laugh at you. Nobody was madder than that crazy fool. The children used to laugh at him, which made me feel sad. She doesn’t like to feel around in the dark for her glasses. She usually will ask her husband for help. If any student doesn’t finish her homework, the teacher will shout at him. It doesn’t matter how many; even one bad student will make his voice very loud. He was proud of his son for taking the stairs because he thought the elevator was for lazy people. He gave his dog a bath every day. The only problem was that his dog did not like to take a bath. When he had finished singing, he bowed to the audience. After he had bowed, he walked off the stage.
Speech Rhythm, Length of Residence and Language Experience: …
121
References Asher, J., & Garcia, R. (1969). The optimal age to learn a foreign language. The Modern Language Journal, 53(5), 334–341. Baker, W., & Trofimovich, P. (2006). Perceptual paths to accurate production of L2 vowels: The role of individual differences. International Review of Applied Linguistics in Language Teaching, 44, 231–250. Bates, D., Maechler, M., Bolker, B., & Walker (2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67, 1–48. Boersma, P. (2001). Praat, a system for doing phonetics by computer. Glot International, 5, 341–345. Boula de Mareüil, P., & Vieru-Dimulescu, B. (2006). The contribution of prosody to the perception of foreign accent. Phonetica, 63, 247–267. Collentine, J., & Freed, B. (2004). Learning context and its effect on second language acquisition. Studies in Second Language Acquisition, 26, 153–171. de Bot, K. (1983). Visual feedback of intonation: Effectiveness and induced practice behaviour. Language and Speech, 26, 331–350. Dellwo, V. (2006). Rhythm and speech rate: A variation coefficient for /\C. In P. Karnowski & I. Szigeti (Eds.), Language and language processing (pp. 231–241). Peter Lang. Deterding, D. (2001). The measurement of rhythm: A comparison of Singapore and British English. Journal of Phonetics, 29(2), 217–230. Dewey, D., Brown, J., & Eggett, D. (2012). Japanese language proficiency, social networking, and language use during study abroad: Learners’ perspectives. The Canadian Modern Language Review, 68, 111–137. Flege, J. E. (1988). Factors affecting degree of perceived FA in English sentences. Journal of the Acoustical Society of America, 84, 70–79. Flege, J. E. (1995). Second language speech learning: Theory, findings, and problems. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 233– 269). York Press. Flege, J. E., Munro, M. J., & MacKay, I. R. A. (1995). Factors affecting strength of perceived FA in a second language. Journal of the Acoustical Society of America, 97, 3125–3134. Flege, J. E., Bohn, O.-S., & Jang, S. (1997). Effects of experience on non-native speakers’ production and perception of English vowels. Journal of Phonetics, 25, 437–470. Flege, J., Birdsong, D., Bialystok, E., Mack, M., Sung, H., & Tsukada, K. (2006). Degree of FA in English sentences produced by Korean children and adults. Journal of Phonetics, 34, 153–175. Freed, B., Dewey, D., Segalowitz, N., & Halter, R. (2004). Language contact profile. Studies in Second Language Acquisition, 26, 349–356. Fuchs, R. (2015). You’re not from around here are you? A dialect discrimination experiment with speakers of British and Indian English. In E. Delais-Roussarie, M. Avanzi, & S. Herment (Eds.), Prosody and Language in Contact (pp. 123–148). Springer. Fuchs, R., & Wunder, E. M. (2015). A sonority-based account of speech rhythm in Chinese learners of English. In U. Gut, R. Fuchs, & E. M. Wunder (Eds.), Universal or diverse paths to English phonology (pp. 165–184). de Gruyter. Fuchs, R. (2014). Towards a perceptual model of speech rhythm: Integrating the influence of f0 on perceived duration. In H. Li, H. Meng, B. Ma, E. Chng, & L. Xie (Eds.), Proceedings of interspeech 2014 (pp. 1949–1953). Grover, C., Jamieson, D. G., & Dobrovolsky, M. B. (1987). Intonation in English, French and German: Perception and production. Language and Speech, 30, 277–296. Hernandez, T. (2010). The relationship among motivation, interaction, and the development of second language oral proficiency in a study-abroad context. The Modern Language Journal, 94, 600–617. Kahn, D., (1976). Syllable-based generalizations in English phonology. Doctoral dissertation, MIT. Kawase, S., Kim, J., & Davis, C. (2016). The influence of second language experience on Japaneseaccented English rhythm. Proceedings of Speech Prosody, 2016, 746–750.
122
D. White and P. Mok
Maastricht, L., Krahmer, E., Swerts, M., & Prieto, P. (2018). Learning direction matters: a study on L2 rhythm acquisition by Dutch learners of Spanish and Spanish learners of Dutch. Studies in Second Language Acquisition, in press. Mok, P., & Dellwo, V. (2008). Comparing native and non-native speech rhythm using acoustic rhythmic measures: Cantonese, Beijing mandarin and English. In 4th conference on speech prosody (pp. 423–426). Campinas, Brazil. Munro, M. (1995). Nonsegmental factors in foreign accent: Ratings of filtered speech. Studies in Second Language Acquisition, 17, 17–34. Munro, M., & Derwing, T. (2008). Segmental acquisition in adult ESL learners: A longitudinal study of vowel production. Language Learning, 58(3), 479–502. Oyama, S. (1976). A sensitive period for the acquisition of a nonnative phonological system. Journal of Psycholinguistic Research, 5, 261–285. Polyanskaya, L., Ordin, M., & Busa, M. G. (2016). Relative salience of speech rhythm and speech rate on perceived foreign accent in a second language. Language and Speech, 60(3), 333–355. Purcell, E. T., & Suter, R. W. (1980). Predictors of pronunciation accuracy: A reexamination. Language Learning, 30, 271–287. Quene, H. and Orr, R. (2014). Long-term convergence of speech rhythm in L1 and L2 English. In Proceedings of speech prosody 7 (pp. 342–345). Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in the speech signal. Cognition, 73(3), 265–292. Riney, T. J., & Flege, J. E. (1998). Changes over time in global foreign accent and liquid identifiability and accuracy. Studies in Second Language Acquisition, 20, 213–244. Saito, K. (2015). Experience effects on the development of late second language learners’ oral proficiency. Language Learning, 65(3), 563–595. Segalowitz, N., & Freed, B. (2004). Context, contact, and cognition in oral fluency acquisition: Learning Spanish in at home and study abroad contexts. Studies in Second Language Acquisition., 26, 173–199. Statistics Canada. (2012). 2011 Census. Tahta, S., Wood, M., & Lowenthal, K. (1981). FA: Factors relating to transfer of accent from the first to the second language. Language and Speech, 24, 265–272. Tajima, K., Port, R., & Dalby, J. (1997). Effects of temporal correction on intelligibility of foreignaccented English. Journal of Phonetics, 25, 1–24. Tsukada, K., Birdsong, D., Bialystok, E., Mack, M., Sung, H., & Flege, J. (2005). A developmental study of English vowel production and perception by native English adults and children. Journal of Phonetics, 33, 263–290. Wayland, R. (1997). Non-native production of Thai: Acoustic measurements and accentedness ratings. Applied Linguistics, 18, 345–373. White, L., & Mattys, S. L. (2007). Calibrating rhythm: First language and second language studies. Journal of Phonetics, 35(4), 501–522. Wiget, L., White, L., Schuppler, B., Grenon, I., Rauch, O., & Mattys, S. (2010). How stable are acoustic metrics of contrastive speech rhythm? Journal of the Acoustical Society of America, 127(3), 1559–1569. Winitz, H., Gillespie, B., & Starcev, J. (1995). The development of English speech patterns of a 7-year-old Polish speaking child. Journal of Psycholinguistic Research, 24, 117–143.
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Phonological Theories Lukas Sönning
Abstract Previous work on non-native speech rhythm has often drawn on L2 phonological theory for the interpretation of findings. The explicit confrontation of theory-derived hypotheses with data remains scarce, however. This paper illustrates how a hypothetico-deductive approach can contribute to our understanding of L2 speech rhythm. We consider cross-sectional data on prominence alternations in German learner speech from the viewpoint of two dynamic frameworks: The Ontogeny Phylogeny Model (OPM) and the Linguistic Theory of L2 Phonological Development (LTD). While both theories deal with L1-independent, universal forces in L2 acquisition, the OPM further considers the role of L1 transfer, similarity, and markedness. The predictions we formulate based on the two models lead us to pursue distinct methodological strategies. While our reading of the OPM prompts us to measure speech rhythm as a single, global category of speech, the LTD suggests a more nuanced, componential approach to L2 rhythm. Our application of the OPM confronts us squarely with the limited utility of rhythm metrics for L2 speech research and points to a number of issues at the theory-data interface. Overall, the LTD generates more informative predictions and provides a richer framework for the empirical study of prominence grading in L2 speech. Keywords Speech rhythm · German Learner English · Rhythm metrics · Vowel duration · Vowel reduction · L2 phonology · L2 acquisition · Theory · Interlanguage development
I wish to thank the editor, two anonymous reviewers, and Ole Schützler for helpful comments and attention to detail. L. Sönning (B) Institute of English and American Studies, University of Bamberg, An Der Universität 9, 96047 Bamberg, Germany e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 R. Fuchs (ed.), Speech Rhythm in Learner and Second Language Varieties of English, Prosody, Phonology and Phonetics, https://doi.org/10.1007/978-981-19-8940-7_6
123
124
L. Sönning
1 Introduction Current empirical research on speech rhythm in second language (L2) acquisition can be described as predominantly descriptive in the sense that data analysis is not explicitly guided by theories of L2 phonological acquisition. While previous work has frequently turned to theoretical frameworks for a post-hoc interpretation or explanation of findings (e.g. Li & Post, 2014; Ordin & Polyanskaya, 2014), this contrasts with the approach taken in analytical studies, which rely on theory to generate hypotheses and then confront these with data (see, e.g. Colantoni et al., 2015: 31). The aim of this paper is to illustrate how analytical approaches can contribute to our understanding of the L2 acquisition of speech rhythm. While it has been argued that existing models of L2 phonology are ill-suited to account for prosodic phenomena (Li & Post, 2014: 224), this view may require qualification: Among others, the Ontogeny Phylogeny Model (OPM; Major, 2001) and the Linguistic Theory of L2 Phonological Development (LTD; James 1988) provide rich frameworks for studying prominence alternations in speech. This paper aims to illustrate that the OPM and LTD not only encourage new methodological approaches beyond the well-trodden path of rhythm metrics but also constitute unified frameworks for the (contrastive) study of speech rhythm across different varieties of English. After a survey of theoretical and empirical approaches to speech rhythm (Sect. 2), Sect. 3 offers a contrastive analysis of prosodic properties of English and German. In Sect. 4, the central tenets of the OPM and LTD are outlined. Section 5 then presents our case study, a set of cross-sectional data on German Learner English (GLE). Following this, both the models are applied to the development of speech rhythm in GLE. To this end, theoretical assumptions of the OPM (Sect. 6) and LTD (Sect. 7) are translated into predictions about timing patterns in learner speech, which are then compared to empirical data. Sect. 8 closes with a general summary and discussion, which recapitulate methodological and theoretical implications for research on L2 speech rhythm.
2 Speech Rhythm Research 2.1 Theoretical Approaches Over the past 80 years, the notion of speech rhythm has been approached from different perspectives. In the following, three views will be discussed: The isochrony view, the phonological view, and the prosodic view. One of the earliest conceptualizations is based on the notion of isochrony and focuses on the presumed temporal regularity of prominent units (James & Arthur, 1940; Pike, 1945; Abercrombie, 1967). In this traditional isochrony view of rhythm, languages fall into two broad classes, ‘stress-timed’ and ‘syllable-timed’. It is
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
125
assumed that in ‘stress-timed’ languages, a perceived regularity applies to the duration of feet (i.e. inter-stress intervals) while in ‘syllable-timed’ languages it is syllables that tend to be of equal duration. Acoustic studies have not corroborated the existence of such isochronous patterns in spoken language (e.g. Bolinger, 1965; Roach, 1982 for English; Borzone de Manrique & Signorini, 1983 for Spanish; Wenk & Wiolland, 1982 for French). Nevertheless, a number of perceptual studies have reported that listeners are able to discriminate between languages traditionally assigned to different classes (e.g. Nazzi et al., 1998; Ramus et al. 2003; but also see White et al., 2012). In the absence of empirical evidence for the durational equalization of feet or syllables, the notion of isochrony has given way to alternative accounts. In what is commonly referred to as the phonological view of rhythm, Dauer (1983) proposed that rhythmic differences between languages reflect a number of lower-level phonological properties. These include syllable structure, length as a distinctive feature in vowels, and the (non-)existence of vowel reduction. The percept of different rhythm classes, it is argued, results from a combination of phonetic and phonological properties. In general, stress-timed languages show vowel reduction in unstressed syllables and greater phonotactic complexity, thus permitting a larger variety of onset and coda clusters. Recent endeavors have extended this componential view of rhythm to include structural properties at the prosodic level of representation (e.g. Prieto et al., 2012; White, 2014; White et al., 2012). In this prosodic view, a focus has been on the durational marking of prosodic heads and edges, that is, local lengthening effects that are observable in prominent elements and at the boundaries of intonation phrases. In general, accented and phrase-final syllables are lengthened relative to unaccented and non-final syllables, respectively. It has been argued that the degree of prosodic length marking contributes to perceived differences between rhythm classes. Prieto et al. (2012), for instance, noted that the lengthening effect in accented and final syllables is much greater in English than in Spanish or Catalan. Utterance-final lengthening has also been observed to affect perceptual discrimination between languages such as English and Spanish (White et al., 2012). In summary, our current understanding of speech rhythm suggests that the labels ‘stress-timed’ and ‘syllable-timed’1 may be considered as cover terms for a range of phonological and prosodic properties, or components, that are shared by languages traditionally assigned to the same end of the continuum. The componential perspective puts forward a set of features for our contrastive analysis of English and German in Sect. 3. First, however, let us turn our attention to instrumental approaches to speech rhythm, that is, different attempts by researchers to quantify the rhythmic properties of speech. 1
While the terms ‘stress-timing’ and ‘syllable-timing’ are not descriptively adequate, they may be considered, at a general level of classification and comparison, a useful shorthand description. In the interest of simplicity, these labels will be used to refer to rhythm prototypes whose phonological and prosodic profiles are characteristic of language varieties that have been traditionally assigned to these rhythm classes. I will use single quotation marks as a signal to distance myself from the original, literal meaning of these terms.
126
L. Sönning
2.2 Empirical Approaches Recent empirical work on speech rhythm relies quite strongly on the application of rhythm metrics to measure rhythmic characteristics of speech (see Fuchs, 2016, Chap. 3, for a comprehensive overview and discussion). These measures build on insights gained from a componential view of rhythm and condense the degree of prominence variation in an utterance into a single score. This score expresses differences on a continuous scale and thereby offers a more fine-grained description than a binary distinction between ‘stress-timed’ and ‘syllable-timed’. In general, rhythm metrics differ along the following lines: (i)
Focal acoustic correlate: While rhythm metrics can be applied to various acoustic correlates of prominence such as intensity and fundamental frequency, most work has so far relied on durational measurements. (ii) Unit of analysis: In order for rhythm metrics to be applied, speech must be segmented into units, which then form the basis for analysis. These units can be vocalic and consonantal intervals or higher-level structures such as syllables or feet. (iii) Level of comparison: To assess variation in prominence, comparisons can be made locally, that is, between adjacent units, or globally, across all units of analysis. Only the former level of comparison takes into account the linear arrangement of units. (iv) Quantification: At the local level, units of analysis can be compared in absolute terms, where the focus is on absolute differences (e.g. a difference of 50 ms), or in relative terms, where relative differences are used (e.g. the duration of two units differs by a factor of 1.2, or by 20 percent). At the global level, metrics can be subdivided into dispersion measures, which quantify prominence variation in a batch of units (e.g. expressed as a standard deviation) and proportion measures, which document the share of certain unit types in the utterance. Members of the family of rhythm metrics arise from different combinations of these attributes. To illustrate, let us briefly discuss some commonly used scores, which rely on durational measurements of vocalic intervals. Global measurements include the proportion of vocalic intervals (%V) in an utterance and the standard deviation of vocalic interval durations, originally as a raw (/\V, Ramus et al., 1999) and now usually as a rate-normalized measure (VarcoV, Dellwo & Wagner, 2003). Low %V values and high VarcoV values indicate a high degree of vowel reduction and/or accentual lengthening (i.e. ‘stress-timing’ properties). The group of local measures, which rely on differences between successive interval durations, includes Low and Grabe’s (1995) Pairwise Variability Index (PVI), which is an average of the absolute differences of successive intervals. Thus, a higher degree of temporal variability is reflected in higher PVI values. A normalized version, the nPVI, was proposed by Low et al. (2000) to adjust for differences in speech rate. Table 1 gives a summary of the nPVI-V, %V, and VarcoV. We will encounter these metrics again in the next section, which offers a contrastive analysis of the rhythm profiles of English and German.
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
127
Table 1 Comparison of rhythm metrics focusing on the duration of vocalic intervals Metric
Correlate
Unit of analysis
Level
Quantification
Reference
nPVI-V
Duration
Vocalic intervals
Local
Absolute differences
Low et al., 2000
%V
Duration
Vocalic intervals
Global
Proportion
Ramus et al., 1999
VarcoV
Duration
Vocalic intervals
Global
Dispersion
Dellwo & Wagner 2003
3 Contrastive Analysis: English and German Speech Rhythm English and German are both considered ‘stress-timed’ (Giegerich, 1992; Kohler, 1995). From a componential perspective, then, the two languages share a number of phonological and prosodic properties that are characteristic of this rhythm class. This section compares English and German in terms of the phonological and prosodic features discussed above and offers a survey of relevant quantitative work. Let us first turn to phonetic and phonological components. Both languages have a complex syllable structure (König & Gast, 2009: 38, 42; Maddieson, 2013) and phonetically distinguish stressed and unstressed syllables in terms of quality and quantity. Both have the short central vowel [e] and show schwa deletion and syllabic consonants as extreme forms of reduction. However, the distribution of schwa vowels in German is restricted (Kaltenbacher, 1998). In simple lexemes, they only occur in stem-final syllables (Hase [ ha ze]) and inflectional affixes (ge-dacht [ge daxt]; denk-e [ dEŋke]). Differences in [e]-distribution are also found in complex lexemes. In both languages, morphophonological processes apply to derived words such as photography/Fotografie. In German, vowel reduction as a result of stress shift can be observed as a shortening of long vowels (Foto [ fo:to]—Fotograf [foto gra;f]— Fotografie [fotogra fi;]) but vowels are never reduced to schwa in these contexts. The morphophonology of English, on the other hand, produces unstressed vowels that are shortened and centralized to(wards) schwa (photo [ fe*te*]—photograph [ fe*tegrA;f]—photography [fe tAgrefi]). In general, therefore, unstressed vowels in polysyllabic lexemes show a higher degree of reduction in English. In connected speech, function words can undergo reduction in both languages (und [*nt] → [(e)n(t)]; and [ænd] → [(e)n(d)]). In German, however, these reduction processes are stylistically marked—they only occur in informal speech (Kohler, 1995; Wesener, 1999). In English, the weak form of function words (which involves [e] in many cases) is the unmarked variant, even in formal speech. Thus, while both languages show reduction in function words, a centralization of vowel quality is much more common in English. At the prosodic level, accentual and final lengthening have been identified as key correlates of rhythm classes. While accentual lengthening is observed crosslinguistically, its magnitude varies between languages. In connected speech utterances, four levels of syllable prominence are often distinguished: (i) unstressed, (ii) secondary stressed and unaccented, (iii) primary stressed and unaccented, and < >
128 Table 2 Summary of the contrastive analysis
L. Sönning Feature
English
German
Complex syllable structure
++
++
Vowel reduction: Length
++
++
Vowel reduction: Quality
++
+
Phonological vowel length distinctions
(+)
+
Accentual lengthening
++
+
Final lengthening
+
+
Phonological components
Prosodic components
(iv) accented (Vanderslice & Ladefoged, 1972; Gussenhoven, 2004: 20; Fletcher, 2010: 530). The distinction between prominence grading at the lexical and postlexical level is commonly captured by the labels ‘stress’ and ‘accent’, respectively. In both languages, accented syllables are longer. Comparing the duration of stressed to unstressed syllables, Delattre (1956: 189) reports a ratio of 1.60 for English and 1.44 for German. Similar values were presented by Li (2014), who compared AmE and German speech and observed ratios of 1.55 and 1.43, respectively. In terms of durational marking of prosodic edges, English and German behave similarly, as shown by the lengthening effects in English (1.53) and German (1.50) observed by Delattre (1965). These results were corroborated by Li (2014), who found ratios of 1.63 and 1.67, respectively. Delattre (1965) further reported on the combined effect of accentual and final lengthening, which was greater in English for both open (2.78 vs. 2.25) and closed syllables (2.63 vs. 2.06). Table 2 gives a summary of the structural profiles suggested by a componential contrastive analysis. While similarities outweigh differences, ‘stress-timing’ properties that are more pronounced in English include the reduction of vowel quality in unstressed syllables and the degree of accentual lengthening. As mentioned in the preceding section, rhythm metrics aim to quantify these properties. Given the similarities between English and German, we would expect the two languages to exhibit similar, more ‘stress-timed’ scores relative to languages that have been traditionally classed as ‘syllable-timed’ (such as Spanish, for instance), with prominence grading being perhaps slightly more pronounced in English. Since the focus in this paper is on durational properties of vocalic intervals, our literature survey concentrates on the metrics summarized in Table 1 (nPVI-V, %V, and VarcoV). Figure 1 offers a graphical summary of measurements reported in 15 studies (see Appendix 1 for details). For each metric, the y-axis is arranged to reflect ‘syllable-timing’ values at the bottom and ‘stress-timing’ values at the top. Differences in materials and tasks contribute to the variation among empirical estimates (see Arvaniti, 2012). Data points from the same study and/or condition, however, are directly comparable and therefore connected with lines. Figure 1 demonstrates some gross trends:
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories nPVI V
%V
VarcoV
70
40
60
50
45
50
50
40
30
German English Spanish
German English Spanish
129
German English Spanish
Fig. 1 Graphical summary of rhythm metrics reported in empirical work on German, English, and 2 Spanish (15 studies; see Appendix 1 for details).
• English/German versus Spanish: Overall, nPVI-V and %V pattern in the expected direction: Spanish scorses tend toward the lower end. VarcoV appears to be less successful at differentiating between representatives of different rhythm classes. • English versus German: As expected, differences between English and German are minor. On average, English shows slightly higher durational variability of vocalic intervals (nPVI-V). The next section addresses theoretical work on L2 phonological acquisition. Existing contributions will be examined from the viewpoint of ‘rhythmic acquisition’, with an eye to whether or not they are capable of accounting for the acquisition of suprasegmental prominence variation.
4 Speech Rhythm and L2 Phonological Theories 4.1 Structural Scope of Theoretical Contributions Theoretical contributions to the field of L2 phonological research can be grouped along several lines, including their scope, by which we mean the types of structures to which they extend. Table 3 lists several frameworks and indicates whether a particular approach covers segmental and/or prosodic units (see Sönning, 2020: 5–35). Speech rhythm, a special case of the latter level of analysis, is listed separately. The overview suggests that segmental structures receive more extensive coverage. As for speech rhythm, several contributions offer guidance for the study of L2 speech. It should be noted, however, that the influential family of perception-based models (e.g. Best, 1995; Flege, 1995) is concerned exclusively with individual segments. In the following, the tenets of Major’s (2001) Ontogeny Phylogeny Model and James’
2
Images with the symbols in the figure caption have been published under the Creative Commons Attribution 4.0 licence (CC BY 4.0, http://creativecommons.org/licenses/by/4.0) in the accompanying OSF project (https://osf.io/25kq4/).
Major & Kim, 1996
Similarity Differential Rate Hypothesis
● ●
Brown, 1998
Phonological Interference Model
Colantoni & Steele, 2008
Model of Segmental Acquisition
● ●
● ● ●
James 1988 Major, 2001 De Bot et al., 2007
Linguistic Theory of L2 Phonological Development
Ontogeny Phylogeny Model
Dynamic Systems Theory
(●)
●
●
(●)
(●)
(●)
Note Parentheses indicate that while the framework may be argued to extend to rhythm, this would require the stipulation of non-trivial auxiliary assumptions or premises, for which the literature gives little guidance
● ●
● ●
Gatbonton, 1978 Fasold & Preston, 2007
Gradual Diffusion Model
Model of Sociolinguistic Variation
●
● ●
● ●
Schmid, 1997 Boersma, 1998
Naturalness Differential Hypothesis
●
●
Functional Model of Phonological Acquisition
Dziubalska-Kołaczyk, 1990
Natural Model of L2 Phonological Acquisition
● ●
● ●
●
Eckman, 1991 Archibald, 1994
Structural Conformity Hypothesis
UG Model of Stress Acquisition
●
●
Hancin-Bhatt, 1994 Eckman, 1977
Feature Competition Model
●
●
Rhythm
Markedness Differential Hypothesis
●
Best, 1995 Bohn, 1995
Perceptual Assimilation Model
Desensitization Hypothesis
●
●
● ●
Honikman, 1964 Flege, 1995
Articulatory Settings
●
●
Speech Learning Model
Prosodic
Segmental
Reference Lado, 1957
Contribution
Contrastive Analysis Hypothesis
Table 3 Structural scope of L2 phonological theories
130 L. Sönning
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
131
(1988) Linguistic Theory of L2 Phonological Development will be discussed and applied to rhythmic properties of German Learner speech.3
4.2 The Ontogeny Phylogeny Model (Major, 2001) Major’s (2001) Ontogeny Phylogeny Model (OPM) combines theoretical insights into transfer, similarity, and typological markedness into a formal model outlining the dynamic nature of interlanguage (IL) development. The OPM rests on two basic assumptions: (i) a learner’s IL consists of three structural components (L1, L2, U) and (ii) the relationship between these components changes systematically over time. Thus, it is held that every structure found in learner speech is attributable to one of three sources: It may be a transferred L1 structure (L1), a target language structure (L2), or a universal structure that is not part of L1 or L2 (U). The latter component is defined by Major (2001: 83) as ‘the universal set of properties of the human language capacity and the resulting universal characteristics of languages, […] [including] anatomical, functional and processing properties of the human mind’. The OPM stipulates an organized interplay of L1, L2, and U over the course of IL development, which depends on the type of structure that is acquired (and also on speaking style). These assumptions are expressed as four ‘corollaries’, which are shown graphically in Fig. 2. The basic chronological assumption states that, over the course of five hypothetical developmental stages, L1 influence decreases while L2 structures increase; U first increases and then decreases. As panels (b) and (c) show, this interplay follows a different pattern for similar and marked structures. Compared to ‘normal’ language structures, i.e. units that do not classify as marked or as similar to an L1 counterpart, marked structures are acquired at a slower rate (7 vs. 5 stages). While equivalent L2 trajectories are posited for similar and marked structures, the relative influence of L1 and U differs: In the acquisition of similar structures, L1 transfer is more persistent; U, on the other hand, exerts no notable influence. For marked structures, transfer is assumed to decrease rapidly, while U rises to exert considerable influence. The OPM thus brings together two well-documented constraints on interlanguage—transfer and universals—and states that their weight depends on developmental stage and properties of the focal structure. Whether L1 influence or language-universal biases (or both) are observable in learner speech therefore depends on characteristics of the learner and the structure.
3
A reviewer raised the question of why these two models were chosen. Since the model proposed by Major (2001) may be considered a unification of several contributions including Lado (1957) and Eckman (1977, 1991), it covers the explanatory notions proposed in those accounts (i.e. transfer, markedness, and language universals). The only remaining model that is directly applicable to the acquisition of rhythm, then, is Archibald (1994), which is restricted to prominence grading at the lexical level, however.
132
L. Sönning
Degree of Influence
(a) Normal structure
L1
L2
(b) Similar structure
(c) Marked structure
L1
U
L1
L2
L2
U U
Developmental stage Fig. 2 Corollaries of the OPM: Interplay of transferred structures (L1), target language structures (L2), and language universals (U) in the course of L2 phonological acquisition of normal (left), similar (center) and marked structures (right); from Sönning (2020: 31).
4.3 The Linguistic Theory of L2 Phonological Development (James 1988) James’ (1988) Linguistic Theory of L2 Phonological Development (LTD) focuses on the interplay of different levels of phonological representation in the course of L2 development. Three levels—the lexical, prosodic, and rhythmic—are posited to interact systematically. For the prosodic level of representation, a non-linear framework similar to metrical phonology is employed (see James, 1986 for details). As illustrated in Fig. 3, it comprises seven layers, with binary strength values (s-marks) assigned to constituent nodes at each level. These add up to determine the structural weight of a syllable. At the rhythmic level of representation, the units at each layer are described with the generalized scheme (proclitic) head (enclitic), the head being the obligatory element. These rhythmic features reflect speech rate: Proclitics (P) show increased tempo; heads (H) and enclitics (E) are characterized by decreased tempo. The suprasyllabic rhythmic structure for the example sentence is shown in Fig. 4. Similar to prosodic s/w-marks, rate features add up to determine the tempo, or duration, of a syllable. The bottom-up acquisition of rhythmic structure thus yields increased temporal differentiation in the speech stream. James (1988) posits that L2 phonological acquisition follows a universal progression from the lexical to the prosodic to the rhythmic level. The lower levels provide Fig. 3 Illustration of supra-syllabic s/w-marking at five hierarchical levels of prosodic representation (after James, 1986, 1988)
Sentence
(Root)
Clause
w
Phrase
s
w
s
w s
s
Word
w
s
w
w
s
w
Formative
s
s
s
w
s
s
s
s
w
s s
Syllable
s
s
s w
s
s
s
s
s
s
s w
He sprang over the gate and ran to the kitchen
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories Fig. 4 Illustration of rhythmic structure: proclitics (P, faster tempo), heads (H, slower tempo), and enclitics (E, slower tempo) (after James, 1986, 1988)
Sentence Clause Phrase Word Formative Syllable
133 P
P P H H H H H H H He sprang
P H E P H E H H H H over the
H H H H H gate
P P H H and
P H P H P H H H H H H ran to the
H E H E H E H E H E H H kitchen
substance for the acquisition of higher-level structure: Higher-level s-marks rest on lower-level s-marks and the same holds for heads. The LTD also predicts acquisition sequences for the prosodic and rhythmic levels. At the prosodic level, s-marked units are acquired before w-marked units due to ‘logical priority’ (James, 1988: 155); at the rhythmic level, the properties of peaks are acquired earlier than those of proclitics. The bottom-up advancement of L2 phonological acquisition generates developmental predictions: With acquisition of the rhythmical level requiring sufficient prosodic structure, and the prosodic representation in turn commencing after the lexical level has been established, rhythmic and prosodic patterning are expected to emerge relatively late. In general, then, strength and rate asymmetries grow during IL development. With structural strength being cumulative, an increasing s/w-differentiation will yield a gradual increase in strength effects in syllables and segments. The same is true for rate values that are determined based on rhythmical structure. The postulated suprasegmental acquisition sequence ‘peaks before proclitics’ and ‘s-marked units before w-marked units’ suggests a bias toward overarticulation at early stages of L2 acquisition; backgrounding or reduction of w-marked units and proclitics is expected to emerge at a later stage. Before we discuss the empirical application of the OPM and the LTD in more detail, Sect. 5 introduces the data used in the following analyses.
5 Data A total of 88 speakers were recorded: 62 German learners and 25 native speakers of English (11 American and 15 British English subjects4 ). The data were collected as part of a more comprehensive investigation of phonological variation in GLE (Sönning, 2020) and therefore not specifically designed for the purposes of the present study. The sampling of instructional-setting German learners aimed at capturing a broad range of proficiency levels. Biographical details of subjects are given in Table 4. A foreign accent rating (FAR) was used to obtain a global accentedness score 4
All BrE and AmE informants reported that both of their parents’ native language was English. Neither group can be considered as representing a well-defined variety of English. Nevertheless, with the reported minimal education level of all subjects being a bachelor’s degree, the native speakers recorded in this study may be described as speaking an educated variety of English. Most BrE informants (M Age = 27; SD = 6) had grown up in the London area and the Midlands. Native speakers of AmE (M Age = 24; SD = 5) were predominantly from the northeastern part of the US.
134
L. Sönning
Table 4 Descriptive statistics for the sample of 62 German learners Variable
Distribution
Gender
39 female (63%); 23 male (37%)
Age
Mean = 18; SD = 4; Min = 11; Max = 30
Grade
6 (n = 4); 7 (12); 8 (1); 9 (3); 10 (6); 11 (1); 12 (9); tertiary education (26)
AOL
Age at onset of learning: Mean = 10; SD = 1.5; Min = 3; Max = 13
FAR
Foreign accent rating, 12-point scale (scores from 1 to 12): Mean = 6.0; SD = 2.5; Min = 1.6; Max = 10.8
for each learner. Two British English native speakers rated the degree of foreign accent on a 12-point scale from 1 (‘strong foreign accent’) to 12 (‘native speaker level’), based on 4 utterances per speaker. The averages of the raw scores, which range from 1.6 to 10.8, were converted to z-scores, which, by definition, have mean 0 and standard deviation 1. As for the recordings, a reading task was used to elicit 10 sentences embedded in short question–answer sequences (see Appendix 2). This aimed at eliciting consistent accentual patterns. Participants were given time to familiarize themselves with the materials and then asked to read both turns of the dialogue; they were allowed to correct themselves and re-read a sequence. The sentences included 105 vocalic intervals in total, producing 6509 measurements for learners (1 missing) and 2618 measurements for native speakers (7 missing). The acoustic analysis was carried out in Praat (Boersma & Weenink 2014) and the data were segmented manually following the principles outlined in Machaˇc and Skarnitzl (2009). Specifically, the boundaries of vocalic intervals were determined using the onset and offset of the second formant and changes in waveform amplitude. Onset /w j r l/ were assigned to consonantal intervals, coda /l/ was labeled as consonantal (except when syllabic), coda /r/ was treated as consonantal but r-coloring as part of the nucleus /ɝ ɚ/. The interval between stop release and the onset of voicing was treated as part of the consonantal stretch. To facilitate the rescaling of the durational measurements (see below), deleted vowels were coded as having a duration of 2 ms. The complete data are available from TROLLing (Sönning, 2022). < >
6 L2 Rhythm in German Learner English: An OPM Perspective 6.1 Working Assumptions In order to derive OPM predictions about prominence grading in GLE, assumptions must be made about (i) the status of English speech rhythm as a normal, similar, or marked category of speech, and (ii) the nature of L1, L2, and U influences on speech production.
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
135
A classification of English speech rhythm as ‘similar’ to German speech rhythm may seem warranted in the light of our contrastive analysis. It should be noted, however, that in the field of L2 phonology research the notion of similarity is rooted in perception-based models of L2 phonological acquisition. Perceptual similarity statements rely on the assumption that listeners are able to selectively and contrastively perceive relevant units in the speech stream. It is unclear whether (nonnative) listeners can (and do) consciously attend to structures above the level of the segment and whether they are able to make similarity judgements. For the present, we will therefore exclude perceptual similarity as a relevant structural property. As for markedness, ‘stress-timing’ can be considered more marked than ‘syllabletiming’ on several grounds. Research into the L1 acquisition of ‘stress-timed’ languages indicates that children develop from ‘syllable-timed’ to ‘stress-timed’ speech (Allen & Hawkins, 1980; Cruttenden, 1979), which has received support from acoustic studies using rhythm metrics (Grabe et al. 1999; Bunta & Ingram, 2007; Payne et al., 2011). Certain properties of ‘stress-timed’ languages appear to be more marked than those of ‘syllable-timed’ languages. Thus, the reduction of weak syllables has been observed to emerge relatively late (Allen & Hawkins, 1980), as children selectively attend to stressed syllables (Blasdell & Jensen, 1970; Risley & Reynolds, 1970). It has also been noted that children acquiring a ‘syllable-timed’ language show adult-like timing patterns at a younger age, both at the word (Vihman et al., 2006) and utterance level (Grabe et al., 1999). Research on rhythm development in L1 acquisition has shown parallels between children from typologically different languages (Grabe et al., 1999 for English, German and French; Payne et al., 2011 for English, Spanish, and Catalan). This suggests that ‘syllable-timing’ properties may, in general, be considered the default setting in L1 acquisition. Biases toward ‘syllable-timing’ properties in World Englishes are also consistent with their status as the unmarked type of rhythmic organization. Thus, Nishihara & van de Weijer (2012) consider ‘asymmetric borrowing’, that is, the tendency in varieties of English to adopt ‘syllable-timed’ properties instead of those typical for L1 varieties such as American or British English, as an indication of markedness. In light of these observations, the type of prominence variation found in L1 English will be considered, collectively, as a marked property of speech. Next, we need to state our assumptions about the nature of L1, L2, and U influence in the acquisition of English rhythm by German learners. Given the findings of our contrastive analysis, L1 and L2 should produce similar surface patterns, with L2 showing a slightly higher degree of prominence grading. The role of U, on the other hand, can be derived from markedness considerations. Assuming that U reflects universal influences that are also operative in L1 acquisition, its effect should surface in a tendency toward prominence-leveling in speech production. This assumption is coherent with previous empirical research on reduction phenomena in interlanguage phonology, which suggests this to be an area of difficulty in learner speech (Aoyama & Guion, 2007; Flege & Bohn, 1989; Gut, 2006). It has also been noted that overarticulation of unstressed syllables is a general feature of non-native speech (Barry, 2007).
136
L. Sönning Stress-timed
Syllable-timed Developmental stage
Fig. 5 Predictions based on the OPM: U-shaped developmental pattern.
In summary, the application of the OPM to rhythmic properties of GLE will rely on the following assumptions: • English speech rhythm is a marked category of speech. • L1 and L2 will surface in a tendency toward prominence variation, i.e. ‘stresstiming’ properties. • U will surface in a tendency toward prominence leveling, i.e. ‘syllable-timing’ properties.
6.2 Predictions Based on the tenets of the OPM and the assumptions stated in the preceding section, we can formulate the following expectations: Initial transfer from L1 should surface in (near) target-like prominence variation. The pronounced influence of U, which is expected due to the relatively more marked status of ‘stress-timing’ patterns, is predicted to surface in a tendency toward prominence-leveling at intermediate stages. The final increase in L2-like patterns and the decreasing influence of U should yield more temporal differentiation between units in the speech stream. In short, we expect prominence grading to follow a U-shaped pattern (see Fig. 5).
6.3 Method and Data For the application of the OPM to speech rhythm in GLE, we will rely on the rhythm metrics described in Table 1.5 The next paragraph gives information about statistical procedures and may be skipped without loss of continuity. Given the controlled method of elicitation (the same 105 syllables were produced by each speaker), the data are highly structured. Measurements are grouped by syllable, and syllables in turn are nested in 10 sentences. Further, measurements are 5
The choice of these particular metrics was motivated by the following considerations: (i) in the interest of simplicity, the focus in the present study is restricted to the analysis of vocalic intervals, (ii) these metrics are widely used in the literature comparing different languages (see Fig. 1 and Appendix 1), and (iii) they feature prominently in previous work on speech rhythm in German Learner English (e.g. Ordin et al. 2011; Li & Post 2014; Ordin & Polyanskaya 2015).
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
137
clustered by speaker. Accordingly, the data were analyzed with a hierarchical (mixedeffects) model. The nPVI-V data include 95 adjacent comparisons per speaker. These were analyzed with a hierarchical linear regression model with random intercepts for speaker (level 2) and adjacent pair (level 2). The adjacent pairs are nested in sentences and random intercepts for sentences were therefore included at level 3. As in all other models reported in this study, proficiency (as measured by the foreign accent rating) is a property of the individual speaker (i.e. a between-speaker variable); it is therefore a level-2 predictor. The %V scores are based on 10 proportions per subject (one for each sentence). These proportions were transformed to logits (i.e. log odds) and analyzed with a hierarchical linear regression model with random intercepts for speaker (level 2) and sentence (level 2). Scores were back-transformed to percentages for presentation and interpretation. VarcoV values require different treatment, as they are measures of dispersion rather than location. To preserve statistical uncertainty in VarcoV estimates, the standard deviation was modeled using a hierarchical linear regression model with random intercepts for subject (level 2) and sentence (level 2). This is to say that the variation of measurements (rather than their central tendency) was the outcome of interest. In line with the rationale behind VarcoV, the durational measurements were rate-normalized, i.e. converted to express duration relative to the average vocalic interval duration for each subject. The statistical analyses were carried out in R (R Core Team 2016), relying on the “brms” package (Bürkner 2016), which in turn builds on the Bayesian inference engine Stan (Stan Development Team 2016). The posterior distributions generated by the models were processed in R and the packages ‘lattice’ (Sarkar 2008) and ‘latticeExtra’ (Sarkar & Andrews 2016) were used for data visualization. Technical information about model parameters and priors are deferred to the online appendix (https://osf.io/25kq4/), which also includes the complete R code. For each rhythm metric, we will compare three candidate models. These encode three possible relationships between prominence variation and proficiency level: (i) no systematic relationship, (ii) a straight-line trend, and (iii) a U-shaped trend. Model (iii) is the one suggested by the OPM and the other two are simpler descriptions, which will serve as a point of reference. Our primary concern is to determine, for each rhythm metric, which of these patterns receives most support from the data. Such a ranking can be established using information criteria.6 These can be re-expressed as Akaike weights, a heuristic and more intuitive measure of the relative goodness of a model. Such weights range from 0 to 1 and can be interpreted as the probability that a given model is the best one in the set (Burnham & Anderson, 2002: 75; McElreath, 2016: 197–201). The type of information criterion we will rely on is LOOIC (Vehtari et al., 2017), whose scores are then translated into Akaike weights. In addition, the 6
The purpose of information criteria is to provide an assessment of how well the model—in our case, the pattern (horizontal vs. linear vs. U-shaped trend)—is likely to generalize to new observations (i.e. other speakers from the population of L1 German learners of English). Information criteria help the researcher guard against ‘overfitting’, that is, reporting and interpreting idiosyncratic features of the sample in hand, which may not replicate in a new sample of observations. They report what is referred to as the out-of-sample predictive accuracy, with lower values signaling higher accuracy, that is, a higher goodness rating.
138
L. Sönning
Table 5 Model comparison results Model
Pattern
LOOIC
(SE)
Akaike weight
nPVI-V (i)
None
5337.9
(141)
0.19
|||||||
(ii)
Straight-line
5336.6
(141)
0.35
||||||||||||||
(iii)
Curvilinear
5336.1
(141)
0.46
||||||||||||||||||
%V (i)
None
− 642.1
(41)
0.24
||||||||||
(ii)
Straight-line
− 643.5
(41)
0.49
|||||||||||||||||||
(iii)
Curvilinear
− 642.4
(41)
0.27
|||||||||||
None
11,428.4
(134)
0.02
|
VarcoV (i) (ii)
Straight-line
11,421.5
(134)
0.60
||||||||||||||||||||||||
(iii)
Curvilinear
11,422.5
(134)
0.38
|||||||||||||||
data are shown graphically, with native speaker values added for comparison. The way scores pattern across proficiency levels is captured by a flexible regression line, more specifically a cubic B-spline7 (Fahrmeir et al. 2013: 426–431). The trend and its uncertainty, then, are purely data-based, which will allow us to appreciate visually the degree to which the data support different candidate models.
6.4 Results Model comparison results are shown in Table 5, where a semi-graphical representation of Akaike weights is added (a single bar denotes 0.025 units of weight). The metrics differ in the extent to which they allow us to differentiate between the three patterns. While %V fails to identify a best candidate, nPVI-V and VarcoV provide some indication against ‘no trend’. However, the data do not discriminate well between models (ii) and (iii). While these comparisons do not suggest a single best model, a visualization of the patterns in the data is still revealing. Figure 6 shows, for each rhythm metric, estimates for the 62 learners. The raw, data-based trend is superimposed, with error bands denoting 50% and 90% uncertainty intervals. The patterns provide virtually no indication of a U-shaped profile.
7
Three knots were chosen to keep the flexibility at a reasonable level. See R code for further details.
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories nPVI-V
139 VarcoV
%V 30 80
100 NS 80
NS
35
NS 60
60
40
40
40 -1
0
1
2
-1
0
1
2
-1
0
1
2
Foreign accent rating (z-score)
Fig. 6 Rhythm metrics by proficiency level with flexible regression lines. Error bands denote 50% and 90% uncertainty intervals. Boxplots at the right margin show the distribution of scores for the 25 native speakers recorded in this study.
6.5 Discussion To recapitulate, we considered the acquisition of English speech rhythm by German learners from the viewpoint of Major’s (2001) OPM. A contrastive analysis and a survey of markedness properties of rhythm types led us to postulate a U-shaped trend in the overall degree of temporal variability of vocalic intervals. This expectation is rooted in the OPM assumption that initial stages of L2 acquisition should show a disproportionate influence of the L1. Having identified several rhythmic parallels between English and German, transfer from L1 was expected to surface in neartarget timing patterns, followed by a U-induced reduction of temporal variability, that is, more ‘syllable-timed’ speech at intermediate levels. However, no evidence was found for the hypothesized curvilinear trajectory and nPVI-V and VarcoV scores showed that, compared to native speakers, on average, all developmental stages are characterized by a lack of temporal variability among vocalic intervals. These findings are consistent with previous work on timing patterns in GLE (Li & Post, 2014; Ordin & Polyanskaya, 2015), in which a monotonic increase in temporal variability across proficiency levels was reported. The present study has extended the empirical scope toward lower-proficiency levels by including earlystage instructional-setting learners. Against the backdrop of the OPM, we would have expected this sub-population of learners to not have progressed beyond the stage of L1 influence. In light of the present findings, then, there is growing indication of a linear (i.e. straight-line) increase in durational variability across different (if not all) levels of pronunciation ability in GLE. A U-shaped trajectory, on the other hand, does not seem to provide an adequate description.8
8
It should be noted that existing research, including the present study, offers limited information on genuinely developmental patterns due to its cross-sectional nature. In order to make reliable statements about change across different stages of L2 development, longitudinal data would be
140
L. Sönning
Instead of questioning the OPM, we first need to cast a critical eye on the set of assumptions we had to state and rely on to formulate predictions. The expectation of a curvilinear trend rests on a contrastive analysis, which revealed similar rhythmic profiles in English and German. We therefore expected L1 transfer of the full set of phonetic, phonological, and prosodic components to yield timing patterns that are close to those of native speakers. We must recognize, however, that we are not able to pin down the precise point along the rhythmic continuum that would characterize the hypothetical first stage of the OPM, that is, a full L1 transfer scenario. This greatly compromises our ability to distinguish between different explanations of the observed patterns, specifically, the delineation of L1 and U influence. We cannot fully rule out the possibility that L1 transfer might also yield at least a certain degree of prominence leveling, given that German is somewhat less ‘stress-timed’ than English. Our survey of the empirical literature (see Fig. 1) suggests that these concerns are valid: The deviation of learner scores from those of native speakers is within the range of variation that has been observed between English and German, that is, about 10 to 20 points on each the nPVI-V and the VarcoV scale. We are thus facing considerable uncertainty when it comes to interpreting the observed patterns in terms of IL components: The steady upward cline could reflect (i) L1 slowly giving way to L2 structures, with U playing no role; (ii) U gradually giving way to L2 structures, with L1 playing no role; and (iii) simultaneous influence of L1 and U gently giving way to L2. Li and Post (2014: 244) partly steered clear of this interpretive dilemma by also recording utterances of German learners in their L1. They observed that lowerintermediate learners showed a drop in ‘stress-timedness’ compared to their L1 control values. This amounted to just under 5 points on each of the nPVI-V and VarcoV scale. Assuming perfect comparability between the English and German materials, these data suggest that L1 alone may not be able to fully account for the observed patterns, leaving (ii) and (iii) as possible structural constellations. Given the data in the present study, however, any claim of Major’s model not providing an adequate account of the acquisition of prominence variation by German leaners is poorly probed. Only via interpolation with findings in the literature may we arrive at the interpretation that a clear facilitative effect of L1 ‘stress-timing’ properties does not seem to be borne out by the data. What, then, can be learned from taking an OPM perspective on the acquisition of speech rhythm by German learners? In order to extract informative predictions from L2 phonological frameworks, we needed to state a number of auxiliary assumptions. This is necessary whenever a theoretical model leaves considerable interpretive leeway. An example in the case of the OPM is U as a structural speech rhythm component and markedness as a property of rhythm classes. We stated that U would operate to produce prominence-leveling and that ‘stress-timed’ properties of speech rhythm are, collectively, a marked category of speech. Both attempts to add rhythmic substance to OPM components arguably arrived at equally vague formulations. We required. Nevertheless, current knowledge about cross-sectional patterns consistently points to a steady increase in temporal prominence grading across proficiency levels.
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
141
thus did not manage to make a very general theory concrete for the phenomenon under study. Rather, we allowed an equal level of fuzziness to enter our predictions by relying on imprecise and weak links between theory and predicted pattern. To a large extent, this unsatisfactory exchange between theory and data may be due to the fact that we have attempted to apply a model that has been formulated based on segmental phenomena to a much more complex category of speech. It is questionable whether notions such as markedness and similarity can be meaningfully applied to an assemblage of features that collectively produce the percept of speech rhythm. This would suggest that linguistic constraints that have been observed in the acquisition of segmental structures may not be directly extended to speech rhythm. Similar reservations may be voiced about the way we measured speech rhythm. By relying on rhythm metrics, we treated speech rhythm conceptually and empirically as a complex but single structure (or ‘category of speech’), when in fact it is currently understood as an ensemble of lower-level properties (see Sect. 2). In other words, we may have blindly followed the lead of rhythm metrics by coercing a multidimensional construct into a single descriptive category. While this allowed us to (unsuccessfully) operate from an OPM perspective, we may have confused the logical relationship between construct (speech rhythm) and measurement (rhythm metrics), understanding the former in terms of the latter. Thus, it could also be argued that our approach to the object of interest was too coarse and simplistic. We will return to this point in the general discussion.
7 L2 Rhythm in German Learner English: An LTD Perspective 7.1 Working Assumptions Concerning the acquisition of prominence variation, the key aspect of James’ (1988) model is the bottom-up advancement of learners, who are assumed to build prosodic representation level by level. To apply the LTD to L2 data, the way in which speech is organized above the level of the segment must be specified. The system proposed by James (1986, 1988) will not be used in the present study for several reasons. For one, its application to the materials was in many cases not straightforward, which arguably questions the reliability with which surface structures can be mapped onto this set of hypothesized representations. Further, the strata constituting this model create a considerable level of complexity, with 5 levels of prosodic s/w-structure and 6 levels of rhythmic structure. Given these limitations, a simplified and largely theory-neutral template was chosen, which connects to the componential view of prominence variation outlined above. Specifically, prosodic strength asymmetries were coded at four levels:
142
L. Sönning
• Lexical (level 1): At the level of the lexical/grammatical word, a simplified two-way distinction is made between unstressed syllables with a reduced vowel (w-marked) and stressed syllables with a full vowel (s-marked). Primary and secondary stress both receive s-marks and monosyllabic words are considered s-marked. • Syntactic (level 2): At the post-lexical level, content words (s-marked) and grammatical words (w-marked) are distinguished. The s-mark is assigned to syllables with primary lexical stress in content words. • Nucleus (level 3): At the level of the intonation phrase, the syllable carrying the nuclear accent is s-marked. • Final (level 4): Intonation phrase boundaries at the right margin are s-marked. Specifically, the final syllable that carries strength at level 1 and subsequent units are s-marked. While this coding scheme may draw legitimate criticism, it also offers benefits. Despite its rudimentary structure, it captures a number of phonological and prosodic components of durational prominence grading. Binary distinctions, which are adopted from James’ (1988) notion of s/w-marking, allow for a reliable encoding of strength asymmetries. Further, this template offers a parsimonious way of representing relevant supra-syllabic properties. As to the hierarchical organization of these strata, levels 1 to 3 are layered in the sense that higher-level s-marking requires structure at the lower level(s)—that is, higher-level s-marks always rest on lower-level s-marks. The status of boundary effects, on the other hand, is less clear. James (1988) claims that rhythmic organization emerges last. Since final lengthening resembles the role of enclitics in his scheme, we will in the following assume that phrase-final s-marking is acquired last. To illustrate the implications of this 4-level scheme for the acquisition of prominence grading, consider the sentence He’s from the north of Germany (the coding of the materials is documented in the online appendix (https://osf.io/b34p8). Figure 7 illustrates the bottom-up construction of durational variability, which proceeds from left to right. Prominence leveling, which we will consider as the default setting in L2 acquisition, is illustrated in the leftmost arrangement: At this pre-lexical level, all units are of equal prominence. As learners build up prosodic representation level by level, systematic variability emerges, starting with strength asymmetries at the ‘Lexical’ level. We will assume that the acquisition of s/w-marking at this level yields a backgrounding of w-marked units relative to s-marked units. Similarly, the emergence of s/w-marking at the ‘Syntactic’ level will be observable as a backgrounding of function words relative to lexical elements. At the ‘Nucleus’ level, s/w-marks will surface in a foregrounding of the syllable carrying the nuclear accent. Likewise, differentiation at the ‘Final’ level results in a lengthening of units in final position. Prominence grading in the right-most pattern then reflects the aggregation of s/w-marks across all four levels. The patterns in Fig. 7 then reflect different stages in L2 acquisition. Before we go further, we will consider these stages from a different perspective, as this will help us follow the methodological procedure. Let us assume that the last (‘Final’) stage
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
Lexical
Syntactic
143
Nucleus
Final
Fig. 7 Acquisition of prosodic structure: Development of s/w-contrasts
Stage 0
Stage 1
Stage 2
Stage 3
Fig. 8 Theoretical deviations at different developmental stages: Points (i.e. syllables) below the reference line show hypoarticulation (too short), those above hyperarticulation (too long).
approximates the type of prominence variation found in native speech. Relying on the LTD and the assumptions outlined above, we expect learners to deviate systematically from this target pattern. Based on the developmental stage a learner is at, the model makes predictions about whether a syllable will show hyperarticulation (surplus of prominence) or hypoarticulation (lack of prominence). In terms of timing patterns, this enables us to state whether a vowel is expected to be too long or too short. This is illustrated in Fig. 8, which shows four hypothetical stages. For ease of exposition, we will proceed ‘backwards’: • Stage 3: Learners have advanced to the ‘Nucleus’ level. They deviate from native speakers in that they show no pre-boundary lengthening. In the example sentence, pre-boundary lengthening affects the three final syllables (Ger-ma-ny). As the learner has not progressed to the ‘Final’ stage, these three syllables will show hypoarticulation: they are too short. • Stage 2: The learner utterance also lacks nuclear accentual lengthening and the third-to-last syllable (Ger-ma-ny) is therefore further hypoarticulated. • Stage 1: The failure to background function words at this stage yields an overarticulation of syllables 1 (He’s), 2 (from), 3 (the) and 5 (of ). • Stage 0: Complete prominence leveling further yields an overarticulation of unstressed syllables in content words (Ger-ma-ny). The patterns in Fig. 8 were arrived at by subtracting from the ‘Final’-level pattern the pre-lexical (stage 0), ‘Lexical’ (stage 1), ‘Syntactic’ (stage 2), and ‘Nucleus’ pattern (stage 3) shown in Fig. 7. As explained in more detail below, these deviations will be the key quantity in the following analyses.
144
L. Sönning
7.2 Predictions The four levels of prosodic representation allow us to distinguish five syllable types. Thus, the addition of s-marks at the lexical, syntactic, and nucleus level yields four levels of prominence: • • • •
P1a: Unstressed syllables in lexical words P1b: Monosyllabic function words P2: Syllables carrying lexical stress P3: Syllables carrying lexical stress and the nuclear accent
These prominence levels can be discerned in the coding scheme, where they are reflected in the number of s-marks resting on a syllable (see online appendix at https://osf.io/b34p8). Besides these four prominence levels, we can distinguish final from non-final syllables. Table 6 summarizes the properties of these syllable types in terms of s/w-marking and position in the intonation phrase. Our focus will be on deviation patterns in these five syllable types. For each learner, we can determine, based on instrumental measurements, whether a certain type is too long or too short, on average. As illustrated in Fig. 8, the LTD makes predictions about the direction of these deviations and how they change over the course of L2 development. In Fig. 9, expected deviations are shown schematically. P2 serves as a baseline, since our assumption is that prominence levels P1a and b are affected by backgrounding (shortening) whereas prominence level 3 and final units are affected by foregrounding (lengthening). Thus, at stage 0, syllables of prominence 1a and 1b are overarticulated in contrast to those of prominence 3 and type ‘Final’, which are too short. Deviations disappear gradually and in accordance with the stages shown in Figs. 7 and 8. At stage 3, then, the only discrepancy that remains is that between final and non-final syllables: The former are too short in relative terms. Stage 4, which was not shown above, is characterized by no systematic deviations from target language timing patterns. In the following section, we discuss how, by means of a quantitative analysis, the present study aims to detect and describe the nature of these deviation patterns in learner speech. Table 6 Description of syllable types Syllable prominence
‘Lexical’
‘Syntactic’
‘Nucleus’
‘Final’
P1a
w
w
w
No
P1b
s
w
w
No
P2
s
s
w
No
P3
s
s
s
No
Final position
-
-
-
Yes
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
145
7.3 Method and Data
Fig. 9 Schematic illustration of deviation patterns at different developmental stages with prominence level 2 (P2) as the baseline of comparison.
Schematic deviation
The aim of the following analyses is to assess whether learners in our sample show a bottom-up progression, with low-proficiency subjects exhibiting the hypothesized deviation patterns for early stages (cf. Figure 9) and high-proficiency levels resembling those posited for later stages. To begin with, however, two methodological concerns must be addressed. First, we need to take into consideration that speech rate will affect vowel duration. Differences in tempo can be canceled out by centering durations, whereby the duration of each vowel is expressed relative to the speaker’s average vowel duration (a vowel may be, say, 50 ms longer than the speaker’s average vowel duration). Positive deviation scores then indicate relatively long vowels. This leads us to the second concern: If we take the average vowel duration as the withinspeaker reference point, it should be representative of the distribution of measurements—that is, it should be roughly located at its center. The distribution of durational measurements, however, is typically not symmetric but skewed, as values are bounded at the lower but not the upper end of the scale. For the ensuing analyses, vowel durations were therefore rescaled using the square root transformation9 to more closely approximate normal, or at least symmetric, distributions at the speaker level. To cancel out differences in speech rate, these square root durations were then centered at the speaker mean. This rescaling aims for better comparability; we must accept, as a trade-off, the fact that we will be comparing rather abstract scores. Recall that we are interested in how learners deviate from target language timing patterns. It therefore makes sense to express vowel durations produced by learners relative to those of native speakers. For each vowel, this new measurement reflects whether it was longer or shorter than in native speech. To this end, the utterances of the 25 native speakers in the study were used to establish a ‘target’ durational profile for the 105 vowels. The construction of this target profile for sentence 1 is illustrated in Fig. 10, where the grey lines show individual durational profiles (i.e. 25 profiles, one for each native speaker). The target profile, which is shown in black, is based on
P1b P1a
Hyperarticulation (too long)
P2
P3 Final
Hypoarticulation (too short)
0
1
2
3
4
Developmental stage
9
For this data set, the log transformation was less successful at establishing within-speaker symmetry.
Fig. 10 Construction of the target durational profile: Grey lines show the 26 patterns for the native speakers. The black line shows the target profile, which connects the medians.
L. Sönning
Centered duration (square root scale)
146
+5 0 −5
He's from the north of
Ger
ma
ny
the median duration of each vowel. This median profile will serve as the baseline of comparison in the following analyses. Deviations from temporal prominence patterns in native speech can now be assessed by comparing vowel durations to this target profile. The difference between a learner vowel and the target profile will be referred to as a deviation score. Positive deviation scores reflect hyperarticulation (the vowel was too long), negative scores reflect hypoarticulation (the vowel was too short). These deviation scores form the basis of the following analyses. Our aim, then, is to determine whether the five syllable types show the expected deviations across proficiency levels. Translating the schematic representation in Fig. 9 into deviation scores yields the constellation shown in Fig. 11, where deviation patterns are not expressed relative to a certain prominence level (as in Fig. 9), but relative to the speaker’s average vowel duration (i.e. a deviation score of zero). While this may seem counterintuitive, we need to bear in mind that changes in timing patterns systematically affect the within-speaker reference point, that is, his or her average vowel duration. To illustrate, consider a learner progressing to Stage 1. The LTD states that syllables of type P1a will now be properly backgrounded, while no changes occur for the other prominence levels. The speaker’s average vowel duration therefore decreases. As a consequence, the remaining prominence levels artificially shift upwards, as they receive a new deviation score relative to the new reference point. Due to our methodological approach, then, deviation scores are re-centered at zero at each stage, which results in a relative shift of the other prominence levels. Note, however, that this adjustment of hypo- and hyperarticulation patterns does not yield qualitatively different predictions. In fact, the expected arrangement of deviation scores by syllable type remains very much the same as that shown in Fig. 9. The crucial question in the following analyses is whether the empirical deviation patterns in our sample resemble the theoretical values shown in Figs. 9 and 11. The remainder of this paragraph lays out the statistical procedures and can be skipped without losing the main thread of the argument. To extract the empirical deviation patterns from the sample of German learners, a hierarchical linear regression model was fitted, including random intercepts for subjects (n = 62) and syllables (n = 105). Subjects further received a random slope for each syllable type described in Table 6. These random slopes document learner-specific deviations from the target profile. The model included, as fixed effects, the five syllable types and their interaction with the foreign accent rating as a measure of proficiency. Thus, the random slopes
Fig. 11 Expected average deviation patterns by syllable type and developmental stage: Translation of the schematic patterns shown in Fig. 9 to the scale of deviation scores.
Average deviation (relative to zero)
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
147 Hyperarticulation Positive deviation (too long)
P1b P1a
P2
0
Target Hypoarticulation Negative deviation (too short)
P3 Final
0
1
2
3
4
Developmental stage
were modeled conditional on foreign accent rating to detect systematic changes in deviation patterns across proficiency levels. This allows the model to capture trends in the magnitude and direction of deviation scores for each syllable type. For the analyses presented in the following section, use was made of the resources outlined in Sect. 6.2. Details about the model and the complete R code for the analysis can be found in the online appendix (https://osf.io/25kq4/).
7.4 Results
Fig. 12 Empirical deviation patterns by syllable type and proficiency level.
Average deviation score
Figure 12 provides a summary of the empirical deviation patterns, where each line represents a syllable type. The trends in these lines show how the direction and magnitude of deviation scores change across proficiency levels. The lines fan out to the left, which indicates that the timing patterns at lower proficiency levels correspond least to those of native speakers. At higher proficiency levels, there is convergence toward the target, which reflects alignment with the median profiles derived from the native speaker data.
0
Hyperarticulation Positive deviation (too long)
P1b P1a P3 P2
Target Hypoarticulation Negative deviation (too short)
−1
Final
−2
0
+2
Foreign accent rating (z-score)
148
L. Sönning
Deviation score
P1a
P1b
P3
P2
Final
0
−1
−1 0
1
2
−1 0
1
2
−1 0
1
2
Foreign accent rating (z-score) Fig. 13 Empirical deviation patterns by syllable type and proficiency level, including information about the variation among learners and statistical uncertainty. Error bands denote 50% and 90% uncertainty intervals.
A comparison of the prominence levels reveals that unstressed syllables in lexical words (P1a), syllables carrying lexical stress (P2), and syllables carrying lexical stress and the nuclear accent (P3) show near-horizontal trend lines around the target baseline. This suggests that these types of units were, on average, close to the native speaker profile with no pronounced changes of deviation patterns across proficiency levels. Deviations from TL timing patterns are discernible for prominence levels P1b and in final contexts. Vowels in monosyllabic function words (P1b) show excess duration at the beginner stage, where they reflect a notable level of hyperarticulation. At high proficiency levels, on the other hand, we see target alignment. Final syllables show the greatest deviation from the target profile: Learners with low pronunciation ability exhibit a systematic lack of lengthening, which persists well into the intermediate stages. While the patterns in Fig. 12 allow for direct comparison between the five syllable types, more detail is provided in Fig. 13, which adds information about statistical uncertainty. For each syllable type, 50% and 90% uncertainty intervals are added to the trend lines. The uncertainty bounds suggest that there is indeed scarce evidence for a sensitivity of P1a, P2, and P3 deviation patterns to proficiency level. The trends for P1b and final syllables, on the other hand, appear to be more robust.
7.5 Discussion To summarize, we applied James’ (1988) model to the acquisition of prominence variation by German learners to determine whether rhythmic timing patterns emerge in a bottom-up fashion. To this end, we adopted a simplified scheme of prosodic representation, which encodes strength asymmetries at the lexical, syntactic, and nucleus level, as well as between final and non-final syllables. These levels were assumed to form a hierarchy, which, in accordance with the LTD, allowed us to formulate
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
149
expectations about how learners establish prosodic s/w-marking level by level. The hypothesized universal progression from the lowest to the highest level corresponds to a step-by-step adaptation of native speaker timing patterns. In empirical terms, this is reflected in a predictable decrease in local hypo- and hyperarticulation. Expected deviation trends were compared to the deviation scores recorded for the sample of German learners. A level-by-level emergence of prominence grading is consistent with some, but not all, patterns in the present data. At a broad level of comparison, the empirical patterns agree with expectations in two regards. For one, the overall constellation is coherent with the fan-shaped predictions in Fig. 11. Second, there is a general match in the direction of deviation for syllable types P1a and b, which tend to be too long, and Final syllables, which are too short, on average. Lack of agreement between theoretical and predicted profiles is found for P3 syllables, which carry lexical stress and the nuclear accent. While the statistical uncertainty depicted in Fig. 13 suggests that caution should be exercised in interpreting this pattern, it appears that this syllable type does not show the type of hypoarticulation predicted by the LTD. In line with the model, vowels at prominence level 1b were hyperarticulated by low-proficiency learners and converged with the native speaker target at the advanced stages. This suggests that monosyllabic function words are susceptible to overarticulation in GLE. P1a deviation patterns showed weak alignment with predictions. Bearing in mind the statistical uncertainty represented in Fig. 13, the early stages of L2 acquisition do show the expected divergence of unstressed syllables in lexical words: They are slightly overarticulated by lower-proficiency learners. The patterns in the data are also consistent with the LTD prediction of a delayed acquisition of P1b relative to P1a prominence backgrounding, suggesting tentatively that the acquisition of the two lowest levels may conform to the sequential order postulated by James’ (1988) model. Syllables carrying lexical and syntactic stress—that is, vowels at prominence level 2—are coherent with the hypothesized pattern in that they show no notable change across proficiency levels. Syllables carrying the nuclear accent, however, appear to violate LTD predictions. There is no evidence for hypoarticulation, that is, lack of lengthening, at lower proficiency levels. In fact, vowels in P3 contexts were relatively close to the target profile. Concerning the theoretical predictions, a delayed acquisition of P3 vowels therefore does not materialize in the present data set, indicating that P3 may not conform to the hypothesized level-by-level progression. We might be dealing with an instance of L1 transfer since German, like English, shows accentual lengthening. The durational marking of final syllables yields the greatest discrepancy between learners and native speakers. The direction of deviation is in accordance with LTD predictions. In comparison to the four prominence levels, the magnitude of divergence from native speech is striking. Note that the theoretical predictions shown in Figs. 9 and 11 encode the simplistic assumption that prosodic heads and edges show the same extent of relative lengthening. More generally, these graphs suggest identical temporal effects of prominence grading across all levels. This arbitrary assumption merely served illustrative purposes. Upon reflection, the amount of accentual and
150
L. Sönning
final lengthening could have relied on empirical evidence reported in Sect. 2. This is to say that specific values for durational fore- and backgrounding could have been chosen based on findings in earlier work (e.g. Delattre, 1965; Li & Post, 2014). What matters most, however, is the relative position of syllable types. Returning to the empirical results, the deviation patterns show that there is notable durational hypoarticulation in final syllables. It appears that L1 transfer may not be able to fully account for this finding, as empirical studies have observed comparable levels of final lengthening in English and German. Based on the logic of Major’s (2001) model, we would look to universal constraints as a possible explanation. James’ account of L2 acquisition in fact offers a U-perspective, as the model makes no allowance for L1 effects in the bottom-up emergence of timing patterns. The present data suggest that prosodic boundary marking may be considered an area of L2 prosody where universal forces toward prominence leveling operate.
8 General Discussion Our exploration of speech rhythm in GLE from the viewpoint of Major’s (2001) OPM and James’ (1988) LTD has demonstrated that neither model offers a satisfactory account of rhythmic acquisition in this population of L2 speakers. Nevertheless, much can be learned from the above exercise, both from a methodological and a theoretical viewpoint. Arguably, the inaccuracy of OPM predictions is rooted in the adopted conceptual approach to speech rhythm. Guided by existing quantitative research, the notion of rhythm was captured by means of rhythm metrics, which dictated a coarse approach to the subject matter. Thus, the line of argumentation proceeded along the broad notions of prominence variation, unequal timing, and durational variability as descriptive cover terms for a complex set of features. Accordingly, ‘stress-timing’ was treated as a category of speech that can, as a whole, be described in terms of markedness. This abstract level of analysis glossed over concrete, lower-level surface phenomena that can be described more transparently, both theoretically and empirically. This lack of transparency is carried forward to the metrics that were applied for an operationalization of rhythm in learner speech. For the field of L2 speech rhythm, however, metrics offer only a limited amount of information. Thus, if such quantities suggest a mismatch between native and non-native speech, the next step, of course, is to explore the nature of this mismatch. This gives rise to questions of where learners show hyperarticulation and/or hypoarticulation and whether deviations from the target are systematic in nature. We should bear in mind that one of the unique features of our field of inquiry, L2 phonology, is that it can provide answers to these questions. In contrast to typological approaches, from which the use of rhythm metrics originates (e.g. Ramus et al., 1999), L2 research can compare non-native to native speech directly, through controlled elicitation of the same utterances. This allows us to make targeted comparisons between segments, intervals, and syllables, which allow us to uncover the local components of prominence leveling. In short, it seems
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
151
that the questions that are of direct interest to our field cannot be answered by rhythm metrics; arguably, we must move beyond the abstract and general level of description they offer.10 Critics may correctly note that controlled data of the type elicited in this study is at odds with the investigation of more natural speaking styles. We thus face a trade-off between more vs. less natural speech on the one hand, and more vs. less informative descriptions on the other. It would seem that the latter dichotomy should receive greater weight as a criterion for methodological decisions. We have seen that the LTD shifts the researcher’s attention to the individual units of analysis and thereby points to analysis strategies that maximize the information that can be extracted from a set of measurements. This offers a more fine-grained account of prominence variation in speech and allows us to address the questions raised by differences in rhythm metrics. Thus, by opting for deviation scores as the quantities of interest, the present study shifted attention to local hyper- and hypoarticulation patterns in learner speech. Of course, this approach is very much in line with large parts of the current literature on speech rhythm, which adopts a componential view of this category of speech and would therefore proceed along similar lines. From a meta-theoretical viewpoint, a unique feature of James’ model is its predictive adequacy, that is, its ability to generate precise and informative predictions that can be falsified by data. This contrasts quite dramatically with our implementation of the OPM, which yielded fuzzy links between theory and predicted patterns. As for the LTD, the hypothesized universal bottom-up progression translates into quantitative predictions about local deviations from native speaker timing patterns. These predictions were highly informative in the sense that their partial falsification adds to our understanding of the L2 acquisition of rhythm. Thus, even though the LTD fails to account fully for the observed data, the weak spots we may have uncovered nevertheless offer valuable insights. They take us one step further by highlighting which parts of the assumed acquisition mechanism we may maintain as viable explanations and which ones may need revision or enrichment by other processes. After all, the LTD takes a bold stance in disregarding the possibility of L1 transfer in this bottom-up progression. In terms of the constraints underlying the acquisition of English speech rhythm by L1 German learners, our application of the OPM and the LTD has left us with conflicting evidence for L1 transfer in L2 timing patterns. On the one hand, we interpreted the global information provided by rhythm metrics as perhaps suggestive of a limited or non-existent facilitative effect of L1 transfer. In contrast, the LTD perspective led us to refine this view, as deviation patterns at specific levels of prosodic representation appear consistent with L1 transfer. While the role of L1 transfer in the acquisition of speech rhythm remains to be explored more fully in future studies, it seems that a combination of the two models may offer fruitful perspectives. Thus, future work could reconsider the interplay of L1, L2, and U at the level of different syllable types or different levels of prosodic representation.
10
The same is true, of course, for comparisons of accents or varieties of the same language. These remarks therefore also apply to the investigation of rhythm in World Englishes.
152
L. Sönning
A final point that deserves to be mentioned concerns the added value of the necessarily broad notion of language universals for collaborative, cumulative efforts in the field of speech rhythm research. Both models stipulate the existence of constraints independent from the L1 and L2, that is, the notion of U (OPM) and that of a bottomup construction of prosodic representation (LTD). Due to their very nature, such cross-lingual tendencies (or language universals) establish common ground for the study of speech rhythm across different languages. Theoretical and empirical work can contribute to a shared knowledge base about universal constraints on the acquisition of prominence grading. Research in the SLA and the World Englishes paradigm, for instance, can mutually inform each other and consequently draw on a larger body of theoretical knowledge and empirical evidence.
Appendix 1 Empirical evidence on vocalic timing patterns in English, German and Spanish: Rhythm metrics based on vocalic intervals and durational measurements Language German
%V 46
nPVI-V
VarcoV
60
43
British English
Style
Reference
1 Reading passage
Grabe & Low 2002
7 Reading passage
Dellwo & Wagner, 2003
42
53
41
53
52
13 Free 8 Sentences1
Arvaniti, 2012
38
54
51
8 Reading passage
Arvaniti, 2012
42
54
55
8 Free
Arvaniti, 2012
42
45
41
5 Sentences
Li & Post, 2014
42
45
41
5 Sentences
Li & Post, 2014
38
73
64
6 Sentences
White & Mattys 2007
41
55
55
8 Sentences2
Prieto et al., 2012
3 Reading, retelling
Gut, 2005
38 60 41
41 American English
n
Russo & Barry, 2008
1 Reading, retelling
Gibbon & Gut 2001
62
61
9 Semi-free
Payne et al., 2011
76
61
10 Semi-free
57
1 Reading passage Sentences1
Ordin & Polyanskaya, 2015 Grabe & Low 2002
44
56
50
8
44
54
50
8 Reading passage
Arvaniti, 2012
48
59
66
8 Free
Arvaniti, 2012
46
52
47
5 Sentences
Li & Post, 2014
52
20 Free
Arvaniti, 2012
Thomas & Carter, 2006 (continued)
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
153
(continued) Language
%V
English (variety unspecified)
n
Style 10 Sentences
Reference Low et al., 2000
5 Reading passage
Dellwo & Wagner, 2003
40
4 Sentences
Ramus et al., 1999
44
57
44 48
Notes
VarcoV
42
43 Spanish
nPVI-V
38
36
41
7 Reading passage
Dellwo et al. 2009
4 Sentences
Ramus et al., 1999
6 Sentences
White & Mattys 2007
Sentences1
49
48
57
8
49
45
47
8 Reading passage
Arvaniti, 2012
50
47
66
8 Free
Arvaniti, 2012
48
44
50
6 Semi-free
Payne et al., 2011
47
37
36
8 Sentences2
Prieto et al., 2012
51
30
1 Reading passage
Grabe & Low 2002
1 Uncontrolled
condition;
2 mixed
Arvaniti, 2012
condition (see Arvaniti, 2012 for details)
Appendix 2 Materials used in the reading task. Utterances used for the analysis are printed in bold. 1. Do you want to drink something? Oh, yeah. Can I get another cup of tea, please? 2. Where is your friend Peter from? Peter? He’s from the north of Germany. 3. Is your brother home? No. He said that he’s going to be back at eight o’clock. 4. Oh no! We haven’t got any sugar. I want to bake a cake. No problem. I can get some sugar from the market. 5. Is Sally from England? She lives in England. But she was born in America. 6. Can I walk to the city centre from here? It’s too far. You must take the bus to the city centre.
154
L. Sönning
7. How was your trip to England? Great! The weather was sunny and we took a lot of pictures. 8. Where is the car? I parked it at the end of the street. 9. What is your sister doing at the moment? Becky? She’s writing an article for the school magazine. 10. How can I help you? Can you tell me the way to the cinema, please? 11. Did you like the book? I did. But the second part of the book was better than the first.
References Abercrombie, D. (1967). Elements of general phonetics. Edinburgh University Press. Allen, G. D., & Hawkins, S. (1980). Phonological rhythm: Definition and development. In G. H. Yeni-Komshian, J. F. Kavanagh, & C. A. Ferguson (Eds.), Child phonology 1: Production (pp. 227–256). Academic Press. Aoyama, K., & Guion, S. G. (2007). Prosody in second language acquisition: Acoustic analyses of duration and F0 range. In M. J. Munro & O.-S. Bohn (Eds.), Language experience in second language speech learning (pp. 281–297). John Benjamins. Archibald, J. (1994). A formal model of learning L2 prosodic phonology. Second Language Research, 10, 215–240. Arvaniti, A. (2012). The usefulness of metrics in the quantification of speech rhythm. Journal of Phonetics, 40, 351–373. Barry, W. J. (2007). Rhythm as an L2 problem: How prosodic is it? In J. Trouvain & U. Gut (Eds.), Non-native prosody: Phonetic description and teaching practice (pp. 97–120). De Gruyter. Best, C. T. (1995). A direct realist view of cross-language speech perception. In W. Strange (Ed.), Speech perception and linguistic experience: Theoretical and methodological issues in crosslanguage speech research (pp. 171–206). York Press. Blasdell, R., & Jensen, P. (1970). Stress and word position as determinants of imitation in firstlanguage learners. Journal of Speech and Hearing Research, 13, 193–202. Boersma, P. (1998). Functional phonology. Holland Academic Graphics. Boersma, P., & Weenink, D. (2014). Praat: Doing phonetics by computer. Version 5.3.68. http:// www.praat.org/. Bohn, O.-S. (1995). Cross-language speech perception in adults: First language transfer doesn’t tell it all. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 273–304). York Press. Bolinger, D. L. (1965). Pitch accent and sentence rhythm. In D. Bolinger, I. Abe, & T. Kanekiyo (Eds.), Forms of english: Accent, morpheme, order (pp. 139–180). Harvard University Press. Borzone de Manrique, A. M., & Signorini, A. (1983). Segmental duration and rhythm in Spanish. Journal of Phonetics, 11, 117–128.
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
155
Brown, C. A. (1998). The role of the L1 grammar in the acquisition of segmental structure. Second Language Research, 14, 139–193. Bunta, F., & Ingram, D. (2007). The acquisition of speech rhythm by bilingual Spanish- and Englishspeaking 4- and 5-year-old children. Journal of Speech, Language, and Hearing Research, 50(4), 999–1014. Burnham, K. P., & Anderson, D. R. (2002). Model selection and multimodel inference: A practical information-theoretic approach. Springer. Colantoni, L., & Steele, J. (2008). Integrating articulatory constraints into models of second language phonological acquisition. Applied Psycholinguistics, 29(3), 489–534. Colantoni, L., Steele, J. & Escudero, P. (2015). Second language speech: Theory and practice. Cambridge University Press. Cruttenden, A. (1979). Language in infancy and childhood: A linguistic introduction to language acquisition. Manchester University Press. Dauer, R. (1983). Stress-timing and syllable-timing reanalysed. Journal of Phonetics, 11, 51–62. De Bot, K., Lowie, W., & Verspoor, M. (2007). A dynamic systems theory approach to second language acquisition. Bilingualism: Language and Cognition 10(1), 7–21. Delattre, P. C. (1965). Comparing the phonetics features of english, German, Spanish and French. Groos. Dellwo, V., & Wagner, P. (2003). Relations between language rhythm and speech rate. In D. Recasens, M. Solé & J. Romero (Eds.), Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS 2003), (pp. 471–474). Universitat Autònoma de Barcelona. Dellwo, V., Gutiérrez Díez, F., & Gavaldà, N. (2009). The development of measurable speech rhythm in Spanish speakers of English. In Proceedings of XI Simposio Internacional de Comunicacion Social (pp. 594–597). Santiago de Cuba. Dziubalska-Kołaczyk, K. (1990). A theory of second language acquisition within the framework of natural phonology. AMU Press. Eckman, F. R. (1977). Markedness and the contrastive analysis hypothesis. Language Learning, 27, 315–330. Eckman, F. R. (1991). The structural conformity hypothesis and the acquisition of consonant clusters in the interlanguage of ESL learners. Studies in Second Language Acquisition, 13, 23–41. Fasold, R. W., & Preston, D. R. (2007). The psycholinguistic unity of inherent variability: Old Occam whips out his razor. In R. Bayley & C. Lucas (Eds.), Sociolinguistic variation: The-ories, methods, and applications (pp. 45–69). Cambridge University Press. Fahrmeir, L., Kneib, T., Lang, S. & Marx, B. (2013). Regression: Models, methods and applications. Springer. Flege, J. E. (1995). Second language speech learning: Theory, findings, and problems. In W. Strange (Ed.), Speech perception and linguistic experience: Issues in cross-language research (pp. 233– 277). York Press. Flege, J. E., & Bohn, O.-S. (1989). An instrumental study of vowel reduction and stress placement in Spanish-accented English. Studies in Second Language Acquisition, 11, 35–62. Fletcher, J. (2010). The prosody of speech: Timing and rhythm. In W. J. Hardcastle, J. Laver, & F. E. Gibbon (Eds.), The handbook of phonetic sciences (pp. 521–602). Wiley-Blackwell. Fuchs, R. (2016). Speech rhythm in varieties of English: Evidence from educated Indian english and British english. Springer. Gatbonton, E. (1978). Patterned phonetic variability in second-language speech: A gradual diffusion model. Canadian Modern Language Review, 34, 335–347. Giegerich, H. J. (1992). English Phonology. Cambridge University Press. Gibbon, D., & Gut, U. (2001). Measuring speech rhythm. In P. Dalsgaard, B. Lindberg, H. Benner, & Z. Tan (Eds.), Proceedings of Eurospeech 2001 (pp. 91–94). Aalborg, Denmark. Grabe, E., Gut, U., Post, B., & Watson, I. (1999). The acquisition of rhythm in english, French and German.In Current research in language and communication: Proceedings of the Child Language Seminar (pp. 156–62). London: City University.
156
L. Sönning
Grabe, E., & Low, E. L. (2002). Durational variability in speech and the rhythm class hypothesis. In N. Werner & C. Gusshoven (Eds.), Papers in Laboratory Phonology 7 (pp. 515–546). Mouton de Gruyter. Gussenhoven, C. (2004). The phonology of tone and intonation. Cambridge University Press. Gut, U. (2005). Nigerian english prosody. English World-Wide, 26(2), 153–177. Gut, U. (2006). Unstressed vowels in non-native German. In Rüdiger Hoffmann & Hansjörg Mixdorff (Eds.), In Proceedings of the 3rd International Conference on Speech Prosody, Dresden, Germany. Hancin-Bhatt, B. J. (1994). Segment transfer: A consequence of a dynamic system. Second Language Research, 10, 241–269. Honikman, B. (1964). Articulatory settings. In D. Abercrombie, D. Fry, P. MacCarthy, N. Scott & J. Trim (Eds.), In honour of Daniel Jones: Papers contributed on the occasion of his eightieth birthday, 12 September 1961, 73–84. London: Longmans. James, A. R. (1986). Suprasegmental phonology and segmental form. Tübingen: Niemeyer. James, A. R. (1988). The acquisition of a second language phonology. Tübingen: Narr. James, A. L. (1940). Speech signals in telephony. Pitman & Sons. Kaltenbacher, E. (1998). Zum Sprachrhythmus des Deutschen und seinem Erwerb. In H. Wegener (Ed.), Eine zweite Sprache lernen (pp. 21–38). Narr. Kleber, F., & Klipphahn, N. (2006). An acoustic investigation of secondary stress in German. Arbeitsberichte des instituts für phonetik Und digitale sprachverarbeitung der universität Kiel, AIPUK, 37, 1–18. Kohler, J. (1995). Einführung in die Phonetik des Deutschen. Erich Schmidt. König, E., & Gast, V. (2009). Understanding english-German contrasts. Erich Schmidt. Lado, R. (1957). Linguistics across cultures. University of Michigan Press. Li, A., & Post, B. (2014). L2 acquisition of prosodic properties of speech rhythm. Evidence from L1 Mandarin and German learners of English. Studies in Second Language Acquisition, 36, 223–255. Low, E. L. & Esther Grabe. (1995). Prosodic patterns in Singapore English. Proceedings of the 13th ICPhS, 636–639. Stockholm, Sweden. Low, E. L., Grabe, E., & Nolan, F. (2000). Quantitative characterization of speech rhythm: Syllabletiming in Singapore english. Language and Speech, 43(4), 377–401. Maack, A. (1959). Der Einfluss der Betonung auf die Lautdauer deutscher Sonanten. Zeitschrift Für Phonetik, 3, 341–356. Machaˇc, P., & Skarnitzl, R. (2009). Principles of phonetic segmentation. Epocha Publishing House. Maddieson, I. (2013). Syllable Structure. In M. S. Matthew & M. Haspelmath (Eds.), The World Atlas of Language Structures Online. Leipzig: Max Planck Institute for Evolutionary Anthropology. (http://wals.info/chapter/12, Accessed on 2016–12–14.) Major, R. C. (2001). Foreign accent: The ontogeny and phylogeny of second language phonology. Erlbaum. Major, R. C., & Kim, E. (1996). The similarity differential rate hypothesis. Language Learning, 46, 465–496. McElreath, R. (2016). Statistical rethinking: A Bayesian course with examples in R and Stan. CRC Press. Nazzi, T., Bertoncini, J., & Mehler, J. (1998). Language discrimination by newborns: Towards an understanding of the role of rhythm. Journal of Experimental Psychology: Human Perception and Performance, 24(3), 756–766. Nishihara, T., & van de Weijer, J. M. (2012). On syllable-timed rhythm and stress-timed rhythm in World Englishes: Revisited. Bulletin of Miyagi University of Education, 46, 155–163. Ordin, M., Polyanskaya, L., & Ulbrich, C. (2011). Acquisition of timing patterns in second language. In P. Cos, R. de Mori, G. di Fabbrizio & R. Pieraccini (Eds.), Proceedings of Interspeech 2011 (pp. 1129–1132). Florence, Italy. Ordin, M., & Polyanskaya, L. (2014). Development of timing patterns in first and second languages. System, 42, 244–257.
(Re-)viewing the Acquisition of Rhythm in the Light of L2 Theories
157
Ordin, M., & Polyanskaya, L. (2015). Perception of speech rhythm in second language: The case of rhythmically similar L1 and L2. Frontiers in Psychology, 6, 316. Payne, E., Post, B., Astruc, L., Prieto, P., del Mar, M., & Vanrell. (2011). Measuring child rhythm. Language and Speech, 55(2), 203–229. Pike, K. L. (1945). The intonation of American english. Michigan University Press. Prieto, P., Vanrell, M. M., Astruc, L., Payne, E., & Post, B. (2012). Phonotactic and phrasal properties of speech rhythm: Evidence from Catalan, English and Spanish. Speech Communication, 54(6), 681–702. Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in the speech signal. Cognition, 73, 265–292. Ramus, F., Dupoux, E., & Mehler, J. (2003). The psychological reality of rhythm classes: Perceptual studies. In D. Recasens, M. Solé amp; J. Romero (Eds.), Proceedings of the 15th International Congress of Phonetic Sciences (ICPhS 2003), (pp. 337–342). Universitat Autònoma de Barcelona. Risley, T. R., & Reynolds, N. J. (1970). Emphasis as a prompt for verbal imitation. Journal of Applied Behavior Analysis, 3, 185–190. Roach, P. (1982). On the distinction between ‘stress-timed’ and ‘syllable-timed’ languages. In David Crystal (Ed.), Linguistic controversies. Essays in linguistic theory and practice, (pp. 73–79). Arnold. Russo, M., & Barry, W. J. (2008). Isochrony reconsidered: Objectifying relations between rhythm measures and speech tempo. In A. Barbosa, S. Madureira & C. Reis (Eds.), Proceedings of Speech Prosody 2008, 419–422. Schmid, S. (1997). The naturalness differential hypothesis: Cross-linguistic influence and universal preferences in interlanguage phonology and morphology. Folia Linguistica, 31(3–4), 331–348. Sönning, L. (2020). Phonological variation in German Learner English. University of Bamberg Dissertation. https://doi.org/10.20378/irb-49135. Sönning, L. (2022). Speech rhythm in German Lerner English: Dataset for Soenning 2022 “(Re-) viewing the acquisition of rhythm in the light of L2 phonological theories”, https://doi.org/10. 18710/GTI2BR, Dataverse NO, DRAFT VERSION. V1. Thomas, E. R., & Carter, P. M. (2006). Prosodic rhythm in African American English. English World-Wide, 27(3), 331–355. Vanderslice, R., & Ladefoged, P. (1972). Binary suprasegmental features and transformational word accentual rules. Language, 48, 819–839. Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-oneout cross-validation and WAIC. Statistics and Computing, 27, 1413–1432 Vihman, M. M., Nakai, S., & DePaolis, R. (2006). Getting the rhythm right: A cross-linguistic study of segmental duration in babbling and first words. In L. Goldstein, D. Whalen, & C. Best (Eds.), Laboratory Phonology 8: Phonology and phonetics (pp. 343–368). De Gruyter. Wenk, B., & Wiolland, F. (1982). Is French really syllable-timed? Journal of Phonetics, 10, 193–216. Wesener, T. (1999). The phonetics of function words in German spontaneous speech. In K. Kohler (Ed.), Phrase-level phonetics and phonology (pp. 327–377). Universität Kiel. White, L., & Mattys, S. L. (2007). Calibrating rhythm: First language and second language studies. Journal of Phonetics, 35(4), 501–522. White, L., Mattys, S. L., & Wiget, L. (2012). Language categorization by adults is based on sensitivity to durational cues, not rhythm class. Journal of Memory and Language, 66, 665–679. White, L. (2014). Communicative function and prosodic form in speech timing. Speech Communication, 63–64, 38–54.
Monolingual-Bilingual (Non-) convergence in L3 Rhythm Christina Domene Moreno and Barı¸s Kabak
Abstract This study examines the production of speech rhythm in Turkish-German bilinguals and German monolinguals in their L3 English and their L1/L2 German. A variety of durational and pitch-based rhythm metrics were calculated from read speech produced by the participants in both English and German. A comparison of rhythm metrics across groups and languages revealed that (a) the productions of bilinguals and monolinguals differ both in their English and in their German, (b) the groups vary more in their English than in their German productions, and (c) the source of variation cannot always be clearly attributed to CLI from the bilinguals’ L1 Turkish since unexpected results emerged in the L3 English. We take these findings as further evidence for a combined multilingual language system in which all background languages are interconnected and may be the source of CLI, which is not only conditioned by both universal and language-specific factors, but also manifested as property-by-property and bit-by-bit transfer. Keywords Speech rhythm · Prosody · Phonology · Bilingualism · SLA · L3
1 Introduction Languages differ in the way they exhibit rhythmic phenomena, which may stem from broad language-specific differences in both segmental and suprasegmental properties. Such differences are expected to lead to effects of crosslinguistic influence (CLI) among the languages of multilinguals, which is the basis for the present study. To investigate the way multiple rhythm systems interact in the multilingual mind, we examine the way the different rhythmic properties of the languages that bilinguals have mastered as their first and second languages (L1 and L2, respectively) may C. Domene Moreno (B) · B. Kabak University of Würzburg, Würzburg, Germany e-mail: [email protected] B. Kabak e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 R. Fuchs (ed.), Speech Rhythm in Learner and Second Language Varieties of English, Prosody, Phonology and Phonetics, https://doi.org/10.1007/978-981-19-8940-7_7
159
160
C. Domene Moreno and B. Kabak
influence the production of an additional target language that constitutes a third language (henceforth L3). To allow for a crosslinguistic comparison, we also test a second group of learners whose L1 is the L2 of the bilingual group. Our overarching aim is to explore whether rhythm in the shared language is realized differently by the monolinguals and the bilinguals, and to what extent the rhythmic properties of the additional language acquired by the two groups exhibit inter-group variability since the bilingual group has an additional language in the background with rhythmic properties that are different from the target language. Since the present study combines research issues in the domain of rhythm with those in the area of L3 acquisition, in the following we first review previous studies that reveal the complexity of CLI effects in L3, followed by a discussion of different approaches to the notion of linguistic rhythm and the empirical issues pertaining to the methodological utilities developed therein to identify crosslinguistic differences in Sect. 3. Section 4 will focus on previous research on the acquisition of rhythmic structure in an additional language by multilinguals. We postulate our research questions as well as the design features of our production study in Sect. 5 with a brief overview of the most important prosodic characteristics of the languages of the learners tested in our experiment. The methodological and empirical consequences of our findings for the investigation of rhythmic differences and rhythm acquisition, as well as overall theoretical implications of our research outcome for models of L3 acquisition are discussed in Sect. 6. We conclude in Sect. 7 with an outlook for future research.
2 Crosslinguistic Influence in Third Language Acquisition of Phonology and the Necessity for Global Measures of Prosodic Features The field of L3 acquisition is replete with empirical issues surrounding the nature and dynamics of CLI. The following questions make the centerpiece of inquiry: (i) Which features and patterns are most likely to be transferred onto an L3 target language, (ii) which of the background languages is most likely to act as a supplier language, and (iii) in what way do the individual languages affect each other (i.e., is there a preferred directionality of CLI)? To that end, various models have been proposed in the last few decades, including, for example, those that assumed a uniform and unilateral transfer scenario conditioned by various factors like typological (Rothman, 2010) or psychotypological proximity (e.g., Bardel & Lindqvist, 2007), and the Second Language (L2) status of a language (Bardel & Falk, 2012), and variably modulated by other factors such as language dominance (Llama & López-Morelos, 2016) and proficiency (Sánchez, 2017). Additionally, some approaches have dwelled more on the positive aspects of the notion of linguistic transfer and assumed that structures are transferred exclusively from the background language that provides facilitating structures (Flynn et al., 2004). Empirical evidence has been found for each of these
Monolingual-Bilingual (Non-)convergence in L3 Rhythm
161
models albeit within the limits of the experimental setup chosen for the respective studies. At the same time, these models have also been shown to fall short of managing to account for the complex nature of language acquisition in general as well as for the aggregate patterns visible in the L3 data amassed so far. Furthermore, multiple studies have clearly shown influence from more than one of the background languages. Accordingly, researchers have come up with new scenarios of CLI in an attempt to account for these sometimes contradictory findings and the dynamic nature of crosslinguistic influence. Slabakova’s Scalpel Model (Slabakova, 2016) and Westergaard’s Linguistic Proximity Model (Westergaard et al., 2016), for instance, propose potential CLI from all background languages that are present in the speaker’s mind but crucially conditioned by various intra- and extra-linguistic factors that work together (additively) or against each other (subtractively) to enhance or block CLI. Slabakova (2016) argues that the languages of a multilingual are not separate from each other but rather an “amalgamation of sub-grammars coming from the previously acquired languages” (656) whose properties are tagged as belonging to either of the grammars in the multilingual min. Consequently, transfer from only one of the background languages is not to be expected. Furthermore, she gives evidence from previous studies on morphosyntactic L3 acquisition to show that transfer does not necessarily have to be facilitative but can be detrimental. While Slabakova uses theoretical analyses to develop her model, Westergaard et al. base theirs on empirical evidence. Specifically, in an acceptability judgement study on word order in English on two groups of young monolingual (Norwegian and Russian, respectively) and bilingual (Norwegian-Russian) children, Westergaard et al. showed that ungrammatical inversion in English was judged to be more acceptable by Norwegian monolinguals than Russian monolinguals and Norwegian-Russian bilinguals since this pattern would fit in the default V2 order in Norwegian, but would be ungrammatical in Russian. However, the monolingual Russian participants outperformed the L1 Norwegian speakers on the grammatical sentences (83% as opposed to 55% correct), while the bilinguals achieved rates between the two monolingual groups (65%). According to the authors, these results show that facilitative and non-facilitative CLI can be active at the same time, and it can stem from abstract rather than concrete structures. Both these models (and, in fact, all others proposed before that) were developed in the light of the acquisition of morphosyntax and only later adapted for L3 speech, which is the focus of our current study. Recently, Kopeˇcková et al. (2016) have used Dynamic Systems Theory, already adapted for and relatively well established in bilingualism studies and FLA and SLA (see, for instance, Van Geert, 2008 for an extensive review) to explain the highly diverse findings across speakers in a study on vowel systems in the different languages of their multilingual speakers. They found that across three groups of students who were learning Polish at school (one group with monolingual German parents, one group with one Polish parent but German as the family language, and one group with two Polish parents and Polish as the family language), there were differences in the quality of the produced vowels that held for all three languages, but they also found high intra-group variability. They thus assume a (potential) mutual influence of all variables in the system, including the speaker’s/learner’s languages.
162
C. Domene Moreno and B. Kabak
All in all, then, these empirical and theoretical findings have confirmed that CLI is an important factor in multilingual language acquisition. As such, given a sensible experimental design, CLI should be trackable despite being multilateral and conditioned by a plethora of overt as well as covert factors. Crucially, however, the studies focusing on phonetic and phonological CLI have mostly dealt either with the success in the production of single individual sounds, or with the specific acoustic realizations of phonetic features across the languages of speakers, like, for instance, Voice Onset Time (Wrembel, 2011) or the production of rhotics (Kopeˇcková, 2016), often employing mirror image designs to establish which of the background languages in a multilingual learner by default supplies the feature in question. While this approach is useful to help spot instances of transfer and make observations about the way individual features are subject to CLI effects, it cannot necessarily aid in uncovering intricate interactions between phonological grammars as a whole since its view is limited in focus, which is especially evidenced by the highly diverse, and often contradictory findings in the field. This, we think, can be remedied in multiple ways. Both Domene Moreno (2021) and Kopeˇcková et al. (2016) have been employing multi-feature analyses in their studies in an attempt to map the speech system of their subjects more fully. They have both found, perhaps unsurprisingly, a very complex relationship between the individual features and the way they are treated by the learner. Domene Moreno (2021) tested young monolingual (German) and bilingual (Turkish German) learners of English on seven English speech sounds that she expected to either promote facilitative CLI from Turkish, from German, or from none of the background languages. She found that CLI effects were feature specific rather than global, and that non-facilitative CLI from one background language can override potential facilitative effects from the other. She concluded that CLI is both complex, i.e., conditioned by multiple factors, and takes place on an abstract level, i.e. may be conditioned by underlying rather than surface properties of either of the background grammars, thus confirming Westergaard et al.’s (2016) findings. Yet another step towards a thorough understanding of the processes involved in L3 CLI could be made by looking at prosodic structures rather than at segments as prosodic structures are more readily divided into multiple layers with potentially different properties, and they can thus be considered multi-dimensional “by nature”. Still, they have only been explicitly considered in few studies in the field of L3 acquisition. One of these is Louriz’s (2007) study on L3 word stress with speakers of Moroccan Arabic and French who were learning English as their L3. Louriz uses an Optimality Theory approach to show that, just as in the acquisition of an L2, both universal factors and language-specific misinterpretations of the L3 input contribute to the production of L3 stress. Such factors have also featured in Cabrelli Amaro’s (2017) study on vowel reduction as well as in Gabriel and Rusca-Ruth’s (2015) investigation into L3 speech rhythm (which will be discussed further in Sect. 4). More work in this domain is necessary, however, especially since even very subtle deviations from a native speaker norm have been shown to be perceived as potentially salient markers of a foreign accent in studies employing accentedness measures. Investigating attriters, Bergmann et al. (2016), for instance, have failed to show a correlation between perceived foreign accentedness and phonetic details on the
Monolingual-Bilingual (Non-)convergence in L3 Rhythm
163
segmental level (specifically acoustic changes to vowel quality), which can lead to the assumption that the perception of a foreign accent cannot necessarily be attributed to segments, but must—at least in part—be due to deviations on the prosodic level. Here it is instructive to highlight the implications of another study that was concerned with L3 speech patterns: using accentedness judgments, Lloyd-Smith et al. (2017) found that German heritage speakers of Turkish (who are highly proficient daily users of German) were rated as German speakers in their English productions in only 60% of the ratings (and often taken for speakers of Turkish, Polish, Russian, and Ukrainian).1 However, the foreign accent of the German controls in the same study was mostly rated as German (80%) or as Swedish or Danish. What these findings imply is that, first and foremost, the perceived accent of the bilingual speakers is more variable than that of the monolingual speakers. Second, they suggest that phonological features that are beyond the level of segmental units are very likely to contribute to observed variability in accentedness and the differences between the two groups. Lloyd-Smith et al. did not perform phonetic analyses on their data to assess which specific factors play a role in the perception of one accent over the other(s). As such, in order to both unearth covert CLI effects and understand the underlying patterns that can potentially guide the perceptual evaluations of listeners, it is necessary to investigate non-native sound patterns beyond the phoneme.
3 Measuring Speech Rhythm One possible domain that lends itself to this type of observation is speech rhythm. Very broadly, speech rhythm is the way recurring patterns in language are organized in time. Rhythm is intuitively perceived by people, not only in language, but also in other domains like poetry and music. Moreover, there is a tendency to perceive rhythmic patterns and to find recurring rhythmic structures even when they are not intended: Consider, for instance, the rhythm of footsteps in a hallway or the tapping of raindrops on a window pane. Even highly complex polyrhythms are interpreted as layers of patterns with an underlying metrical structure, although the concrete interpretations may not match across listeners (see an extensive study by Handel & Lawson, 1983). This universal bias towards interpreting acoustic events as possessing rhythmic structure can be expected to also extend to human language. While rhythmic properties are commonly assumed to differ crosslinguistically, there has been no empirical justification for a binary (or absolute) view of rhythmic classes (e.g., Abercrombie, 1967) across the languages of the world. Nevertheless, crosslinguistic differences in rhythmic structures have been shown to reflect underlying tendencies that can be evident at least in the psycho-perceptual domain. This 1
One reviewer rightly noted that the subjects may have been speakers of the German multiethnolect Kiezdeutsch which is, among other things, associated with Turkish immigrant identity. This cannot be ruled out since their Turkish was not examined for accentedness in this case. We would then be dealing with an English accent associated with the variety of German that is spoken by the participants rather than with their HL. See Sect. 6 for more details.
164
C. Domene Moreno and B. Kabak
assumption is primarily based on research findings on infants, who were shown to distinguish between languages on opposing ends of the rhythmic spectrum, but not between languages that are closer together rhythmically (Nazzi et al., 1998). Thus, it is reasonable to assume a spectrum along which languages display their rhythmic character, reflecting the phonetic and perceptual properties of a particular timing unit, be it the syllable or the stress interval, on which language users base their articulatory and perceptual behavior in speech. In recent studies, these crosslinguistic differences in rhythm have been largely attributed to the variability in the alternation of consonantal or vocalic intervals in speech. For instance, Ramus et al. (1999) measured the relative duration of vocalic intervals (%V) as well as the standard deviation of the duration of vocalic (/\V) and consonantal (/\C) intervals. Those languages that are considered to be at the stress-timed end of the assumed spectrum have been shown to cluster together with relatively low %V and high /\V and /\C, while those languages considered to be more syllable-timed yield inverse values for these metrics. Since studies showed /\C and speech rate to be inversely correlated, Dellwo (2006) introduced metrics that normalize /\V and /\C for speech rate: VarcoV and VarcoC, respectively. These are calculated by multiplying the /\ of the segment in question (so, consonants in VarcoC and vowels in VarcoV) by 100 and dividing the result by the mean duration of that segment. In a slightly different approach, Grabe and Low (2002) developed the Pairwise Variability Index (PVI) to compare the relative durations of neighboring pairs of speech events and found that languages again are distributed on a continuum: PVI-V (the pairwise variability of vocalic durations), for instance, is low in Mandarin and Spanish, and relatively high in German, British English, and Malay. Correcting for speech rate yielded a normalized variant of the PVI score, nPVI, introduced by Ling et al. (2000), as can be seen in (1) below. (1) nPVI metrics | |m−1| | E| dk − dk+1 | | | n P V I = 100 × | (d + d )/2 |/m − 1 k=1
k
k+1
m = total amount of items; d = duration of the kth item These four metrics, %V, /\V/C, the VarcoV/C, and the nPVI,2 constitute the core measures that are commonly used to quantify durational properties of speech and to compare them across languages.3
2
Usually, only PVI-V is normalized for speech rate in the field, but not PVI-C, since speech rate has only been found to correlate with the vocalic and not the consonantal variables. Since it is difficult to justify this practice in crosslinguistic studies involving languages with different vocalic and consonantal inventories as well as diverging phonotactic structures, we will use the normalized PVIs of both variables (i.e., nPVI-V as well as nPVI-C) to allow for a more objective measure of pairwise variability in speech. 3 Note that there are several other rhythm metrics, often slight mutations of each other, that are rarely used and are thus not mentioned here. See Fuchs (2016) for a review.
Monolingual-Bilingual (Non-)convergence in L3 Rhythm
165
It should be noted that durational manifestations of rhythm can be reduced to phonetic reduction and syllable complexity in the language (Dasher & Bolinger, 1982; Dauer, 1983; Roach, 1982). Accordingly, in a language like Turkish, one of the background languages in the present study, the fact that consonant clusters in general are rather restricted and there is no significant spectral concomitant of stress that would lead to a major quality opposition between stressed and unstressed vowels, we expect less durational variation than in a language like English or German, where reduction processes and phonological strings with complex consonant clusters are abundant. Hence, it is debatable whether rhythmic differences as measured by durational metrics are truly due to rhythm or an epiphenomenon of segmental and syllable structure. Furthermore, languages like Turkish and languages such as English or German are fundamentally different when it comes to the way they employ prosodic structures, especially in terms of intonation and prominence patterns (see Sect. 5.1). Consequently, qualitative and quantitative changes in pitch should also be considered when assessing crosslinguistic differences in the patterning of suprasegmental structures. Some indices that have been proposed along these lines are pitch range, the number of pitch peaks, and mean slope, which are all meant to quantify the way pitch changes contribute to the perception of rhythm in language (see, for instance, Vicenik & Sundara, 2013). Other indices that have been considered in the literature are pairwise measurements of f 0 , intensity, sonority, and loudness (see Fuchs, 2016 for an overview). Since reducing rhythm to any one of the concepts proposed so far is questionable, we employ a variety of approaches that, as a whole, reveal a picture of rhythmically relevant phenomena. In addition, since numerous studies have shown crosslinguistic differences on the basis of these metrics, we will assume, for now, that they reflect certain rhythmic differences pertaining to suprasegmental properties of languages. Thus, studying the way they interact in the learner can potentially uncover patterns in L2 and L3 speech irrespective of the fact that the metrics in question may only offer fragmentary answers to the true nature of “rhythm” in language.
4 Speech Rhythm in L2 and L3 Acquisition By and large, in L2 speech, rhythmic properties have been shown to be influenced by the first language (L1) of the speaker (e.g., Lee & Jang, 2004). In their study on German and Mandarin learners of English, Li and Post (2014) tested both the acquisition of stress assignment through accentual lengthening and the overall proportion of the vocalic material used. They found a difference between the learner groups only in the proportion of vocalic material, which they interpreted as stemming from direct transfer of rhythmic properties from the L1. Furthermore, research has shown rhythmic values that are intermediate between the L1 and the target L2 (Carter, 2005), while Whitworth (2002) has found that simultaneous bilinguals manage to differentiate between the rhythmic patterns of their languages only if those (nativelike) differences are present in the linguistic input, and that they closely emulate
166
C. Domene Moreno and B. Kabak
their parents’ values. Another study has shown non-convergence in the rhythm patterns between the English of Cantonese-English bilinguals and that of English monolinguals (Mok, 2011). Gut (2012) remarks that rhythm metrics might not be suitable at all to analyze L2 speech, mostly because of their correlation with overall speech rate (e.g., Dellwo & Wagner, 2003): In particular, since a learner’s speech rate can be expected to deviate from an L1 speaker’s speech rate, these measurements do not necessarily work to compare rhythmic features in learners vs. speakers. This, we feel, is an issue, however, only when trying to compare L2 speech to an L1 norm and can easily be avoided by analyzing speech samples of sufficient length and by carefully controlling for language proficiency. In what is, to our knowledge, the only full-scale study so far conducted on L3 rhythm,4 Gabriel and Rusca-Ruths (2015) tested a group of young Turkish Heritage Language (HL) speakers (n = 5) with German as their L2 and Spanish as an additional language (after English), and a control group of German monolingual students (n = 5) with the same language biography. They all were learning Spanish in secondary school and were being taught by a native speaker of Spanish, who was also included in the study as a control participant. Spanish is considered to be syllable-timed, and thus was assumed by the authors to pattern with the HL (Turkish) of the bilingual participants. They thus expected positive transfer from Turkish to Spanish, and therefore hypothesized the bilinguals to be more target-like in Spanish than the monolingual controls. All participants read sentences in all of their languages and %V and VarcoV were calculated. They did indeed find their hypothesis confirmed for most of their participants. However, this apparent benefit of a rhythmically closer background language did not hold for one of the participants, who performed relatively poorly, which the authors explained by a low degree of metalinguistic awareness on the part of that particular speaker. In their study, metalinguistic awareness was assessed by explicitly asking participants about the rhythmic properties of their languages. Gabriel & Rusca-Ruths conclude that a typologically close rhythm in one of the background languages facilitates the acquisition of an L3 rhythm, but that a lack of metalinguistic awareness can override this facilitating effect. The present study takes this line of research further and, instead of quantifying the degree to which speakers are able to produce target-like L3 rhythm (as facilitated by either one of the background languages), it sets out to examine the differences and similarities in the rhythmic properties of the languages that bilinguals have mastered as their second and third languages. In particular, we ask whether these “rhythms” converge with those of monolinguals when both groups produce the same two languages and thus extend the approach used in Gabriel & Rusca-Ruths by examining rhythmic values in both the L2 and the L3, and by employing a larger number and array of rhythm metrics (both durational and pitch-based) in order to account for multi-dimensional transfer phenomena. In other words, the present study 4
But, see Gut (2010) for a case study on four trilingual speakers with different background languages that show that these speakers produced rhythmic structures that were distinct from those of native speakers.
Monolingual-Bilingual (Non-)convergence in L3 Rhythm
167
asks what influence the first and second languages already present in the learners’ mind may have on the production of rhythm in an L3.
5 Speech Rhythm in Monolingual and Bilingual Speakers of English 5.1 Research Objectives and Questions We aim to examine the complex nature of crosslinguistic transfer in L3 by focusing on a speech pattern that adds to the global accentedness of an individual in an additional language. More specifically, we examine two groups of language learners who learn an additional target language either as their second or third language (L2 or L3 respectively). Both groups share a common language whose rhythmic properties overlap with those of the target language, henceforth the facilitatingbackground language. We ask whether rhythm in the additional language develops differently since one of the groups has another language in the background that has rhythmic properties that do not match those of the target language, henceforth the adverse-background language. To answer that question, we compare adult monolingual German speakers and adult Turkish-German early bilinguals in their acquisition of the rhythmic properties of English, an L2 for German monolinguals but an L3 for Turkish-German bilinguals. While both German and English are considered to be stress-timed languages, Turkish aligns more with syllable-timed languages due to the prevalence of syllable-based generalizations in its phonological system (e.g., Kabak, 2014, see Sect. 3). Similar to languages, such as Hungarian and Basque, Turkish has also been shown by the researchers in the study to have a relatively higher %V score than, for example, English and Dutch (e.g., Nespor et al., 2011), resulting in its clustering with the syllable-timed languages. As such, as far as the rhythmic properties are concerned, German will be assumed to act as a facilitating-background language and Turkish as an adverse-background language when the target language is English. We hypothesize that group differences, if any, must stem from the adversebackground language, Turkish, but we also entertain the possibility that the adverse effects stemming from Turkish might have already been weaved into the rhythmic structure of the bilinguals’ German. Therefore, we also compare both learner groups’ rhythmic structure in their German (L1 for monolinguals, L2 for bilinguals).5 More specifically, in our experimental study, we ask: a. whether the rhythmic patterns in the shared facilitating-background language, German, differ significantly between the two groups, and if so, in what way, and
5
Although age of acquisition in German differs in both groups, due to early immersion, the bilingual group is generally perceived as native-like, with presumably pervasive nuances that may emerge in highly scrutinized contexts such as laboratory settings.
168
C. Domene Moreno and B. Kabak
b. whether the two learner groups approach the task of acquiring an additional language differently and therefore produce English with different rhythmic structures due to the adverse-background language (Turkish in the bilinguals), or whether their rhythmic structures will converge due to the common facilitating-background language (German in both groups). We also ask in precisely which suprasegmental domains crosslinguistic influence emerges, leading to variation in rhythmic patterns. While language-specific restrictions on syllable structure as well as temporal properties are reflected in the rhythmic patterns of individual languages, other prosodic properties such as stress and intonation may be additionally responsible for crosslinguistic differences in rhythmic structure. Indeed, Turkish differs from English and German not only in terms of the function and realization of word-level stress (e.g., Domahs et al., 2012; Kabak & Vogel, 2001; Kabak, 2016; Zora et al., 2016), but also concerning utterance-level accentual phenomena and the inventory of pitch-accents used therein (e.g., Güne¸s, 2015; Kamali, 2011). Concerning word-level stress, while f0 has been shown to be the most reliable cue to distinguish stressed vowels from unstressed ones in Turkish, the average differences in intensity and duration, although significant, have been argued not to yield perceptually robust cues for the language user (e.g., Levi, 2005; Pycha, 2006). Furthermore, as mentioned in Sect. 3, Turkish stress assignment does not render any phonemic opposition in terms of a vowel quality difference since each of the 8 vowel phonemes can theoretically be stressed or unstressed. Concerning the utterance-level accentual phenomena, it has been convincingly shown that Turkish patterns with what Féry (2010) calls phrase languages (Güne¸s, 2013). As such, Turkish can be expected to manifest rhythmic structure through different parameters than intonation languages like English or German. On the basis of these salient differences at the level of stress and intonation, we will not only examine durational concomitants of rhythm but also explore whether the differences in the pitchbased implementations of vowel alternations, which presumably also contribute to the percept of rhythm, can be responsible for crosslinguistic differences in the acquisition of rhythm.
5.2 Methodology 5.2.1
Participants
The data used in this study was collected from Turkish-German early bilinguals (n = 6, mean age = 22.17, sd = 2.67) and a control group of 9 German native speakers with L2 English (mean age = 24.56, sd = 2.63) with no other background language acquired in early childhood. Both groups included both male and female speakers. The subjects in the bilingual group grew up in Germany and went through the German educational system uninterruptedly. When tested on their overall German proficiency using a C-test (Schmid & Dusseldorp, 2010), the bilingual speakers did
Monolingual-Bilingual (Non-)convergence in L3 Rhythm
169
not perform significantly differently from their monolingual peers, which also gives evidence to their native-like performance (as assumed above). All bilinguals selfreported to use Turkish in their family and close social circles. Using the Bilingual Language Profile (BLP, Birdsong et al., 2012), administered on the day of the data collection, all bilinguals were however shown to be German-dominant. Due to the unbalanced dominance of their L1 and L2 as well as the specialized use of the L1 in more private spheres of their lives, our bilinguals constitute prototypical heritage speakers of Turkish (see, for instance, Valdés, 2005) and Turkish is thus their heritage language (HL). All participants were undergraduate students of English language and literature at a German university at the time of testing and can thus be assumed to be highly proficient speakers and regular users of English, which they had all initially started acquiring between the ages of 8 and 10 in the German school system. Additional languages spoken by some of the participants were French, Spanish, and Italian. One participant knew Chinese and another one Arabic, and one participant knew no additional languages. Five participants had learned Latin at some point (either at school or a university), but since Latin is not taught as an actively spoken language, it cannot be expected to influence L3 phonology. A detailed list of the linguistic background of all the participants can be found in Tables 1 and 2. Table 1 Monolingual participants
Table 2 Bilingual participants
Subject
Age
Gender
Additional languages
M01
27
Male
French
M01
24
Male
French, Spanish, (Latin)
M03
23
Female
Italian, (Latin)
M04
28
Male
French, (Latin)
M05
21
Female
French, Spanish
M06
24
Female
French, Italian
M07
22
Male
French, Spanish, Chinese
M08
23
Female
French
M09
29
Female
French
Subject
Age
Gender
Additional languages
B01
24
Female
None
B02
24
Female
French
B03
25
Male
Arabic
B04
19
Female
French, (Latin)
B05
18
Female
(Latin)
B06
23
Female
Spanish, (Latin)
170
5.2.2
C. Domene Moreno and B. Kabak
Material and Speech Analysis
The speech elicitation data that we analyzed for this study stems from a reading task that was part of another empirical study on L3 acquisition. The stretch of English speech we analyzed for our current purpose (see Appendix 1) comprised a total of 331 words (454 syllables). Since the text was designed to elicit segmental features, variables like sentence length or stress/accent patterns were not controlled for (unlike in Gabriel & Rusca-Ruth, 2015). This is considered an advantage since it makes for a random sample of test sentences, which enhances the reliability of the rhythm metrics.6 In addition to the English text, data elicited for German consisted of readings of the text “Der Nordwind und die Sonne” (“The North Wind and the Sun”) which comprises 108 words (180 syllables) in the version we used (see Appendix 2). Audio recordings of both elicitations took place in a quiet room, using an Olympus LS-11 speech recorder. The language of instruction was exclusively English to put all the participants in target language (TL) mode. The reading passages in both languages were analyzed for the physical realization of speech rhythm using the following two approaches: (a) A durational approach: For the sake of comparability, we employed a variety of durational rhythm metrics that are largely used by numerous studies. (b) A pitch-based approach: As far as English is concerned, the percept of rhythmic patterns is due to the alternation of stressed and unstressed syllables, and the segmental and durational concomitants of stress. Irrespective of segmental and durational properties of stressed vs. unstressed syllables, however, language users may employ different cue-weightings in the acoustic realization of stressed syllables, which may lead to additional differences in the perception of rhythm that are not directly related to duration. As discussed in Sect. 5.1, Turkish wordand utterance-level prosody displays patterns that differ from languages such as English and German. These differences are again expected to emerge from language-specific realizations of stress and create a venue for potential CLI effects in producing the target language, albeit in terms of a truly pitch-based, rather than a duration-based, perceptual difference. Here we chose mean pitch and average slope as our main metrics since in Vicenik and Sundara (2013) those were the ones affected by language (American English vs. German), but not by variety (American vs. Australian English). Pitch range was tested as a supplementary metric. The speech material was automatically annotated and segmented using the WebMAUS service (Kisler et al., 2017), and the annotations were then checked and corrected (if needed) by hand using Praat (Boersma & Weenink, 2018). Each sentence was analyzed in a separate file to obtain one data point per sentence per speaker for each measure (13 sentences per speaker in the English text and 6 sentences 6
See Arvaniti (2012), who showed that different types of rhythm can be elicited from one and the same language if the prosodic structure of words used in sentences is manipulated in a way to foster a specific type of rhythm.
Monolingual-Bilingual (Non-)convergence in L3 Rhythm
171
per speaker in the German one). For the rhythm measures, each segment was either labeled C if it was an obstruent or V if it was a sonorant. Based on Vicenik and Sundara (2013), we chose a distinction between obstruents and sonorants rather than the classical vowel-consonant categorization since, as they argue, sonorant consonants often form syllable nuclei in Germanic languages and are thus, functionally, vocalic material. Furthermore, according to the authors, the rhythm metrics have shown to be equally useful when using the obstruent-sonorant rather than the consonant–vowel distinction. We then calculated different durational rhythm indices that have been shown to yield significant differences in rhythm in previous L2 or bilingual studies, i.e., %V, /\V, /\C, VarcoV, and VarcoC7 (using a Praat script), as well as nPVI-V and nPVI-C using Matlab (The MathWorks, 2018), and compared the values across groups and languages. Additionally, overall speech rate was extracted for each speaker for each language in Praat. For the pitch-based approach (pitchbased metrics), we extracted values for mean pitch, pitch range, and mean slope for each sentence in Praat.
5.3 Hypotheses Based on Gabriel and Rusca-Ruth (2015, see Sect. 4), we expect the HL of the bilinguals (i.e., Turkish) to exert an influence on their productions of rhythm in the TL English, meaning that they will produce values that deviate significantly from the monolingual speakers, exhibiting a shift towards more syllable-timed values. If this is shown to be the case, it might be due to a potential early divergence in their production of German (i.e., their German is already influenced by Turkish) rather than their Turkish HL transferring directly onto their L3. In this case, we will find group differences in the German speech sample that overlap, at least in parts, with the differences found in the English production. If the bilinguals produce values similar to their monolingual peers, they might make use of what they know from their German, which is rhythmically closer to English than Turkish. This would point towards two separate rhythm systems in the bilinguals’ mind and contradict the idea of an integrated system (see Jared & Kroll, 2001, Boukrina & Mariam, 2006, Kopeˇcková et al., 2016) since it would show that the learners draw from an L2 that provides them with untarnished native speaker values. In that case, the two speaker groups should also converge in their production of German. Table 3 shows the expected directions of differences between the monolinguals and the bilinguals in case of rhythm transfer from Turkish for the individual rhythm metrics, assuming Turkish to pattern with languages such as French and Spanish (i.e., those languages that are classically described as syllable-timed). 7
While we used a categorization into obstruents and sonorants, we will consider obstruents to be equivalent to consonantal material and sonorants equivalent to vocalic material. Thus, measures based on obstruents will be referred to as consonantal metrics (e.g., VarcoC instead of VarcoO), while those based on sonorants will be referred to as vocalic ones (e.g., VarcoV instead of VarcoS).
172 Table 3 Predictions of group differences based on expected CLI effects
C. Domene Moreno and B. Kabak Rhythm metrics %V
monolingual
bilingual
/\C
monolingual
>
bilingual
VarcoV
monolingual
>
bilingual
VarcoC
monolingual
>
bilingual
nPVI-V
monolingual
>
bilingual
nPVI-C
monolingual
>
bilingual
It should be pointed out that utterance-level prosody is also expected to yield CLI effects in the languages in question, which may lead to group differences in pitch-based in addition to duration-based measures. It remains to be seen, however, whether these two types of differences will exert an influence in the same direction and magnitude.
5.4 Results For each of the metrics, two Linear Mixed Effects model were fitted—one using the data of the German and one using the data of the English reading text—with the respective measure as a dependent variable and with participant as a random factor. Cohen’s d was calculated for all metrics as a measure of effect size. A higher Cohen’s d indicates a larger effect, but note that a large effect does not necessarily imply statistical significance. For the sake of completeness, Cohen’s d is given for each of the pitch and durational metrics, but is only deemed meaningful in those measures that have been shown to differ significantly between groups. Table 4 shows the results of the German reading text. While Group is not a significant predictor of any of the pitch-based metrics in German, there is a significant main effect of Group for %V, /\C, VarcoV, and nPVI-C, but not for /\V and VarcoC, and nPVI-V. Furthermore, speech rate is predicted by Group in the German data. Concerning the English reading text, the models reveal a significant main effect of Group for all the durational metrics we tested for, as can be seen in Table 5. At the same time, none of the pitch-based metrics were significantly predicted by Group, and, contrary to the German reading text, neither was speech rate. In the German text, %V and VarcoV were higher, while /\C and nPVI-C were lower in the bilinguals than in the monolinguals. Similarly, the bilingual subjects produced higher %V, /\V, VarcoV, and nPVI-V in the English text than the monolinguals. The same is true for VarcoC (i.e., higher values in the bilinguals), but not for nPVI-C and /\C. We argue that the unexpected result with VarcoC is rather epiphenomenal: Essentially, VarcoC corresponds to /\C that is normalized for meanC duration as can be seen in (2). As such, we would not intuitively expect the direction of the
Monolingual-Bilingual (Non-)convergence in L3 Rhythm
173
Table 4 Main effects in the German reading text Metrics
Estimate
Std. error
t value
p value
|d|
%V
−1.1750
0.1851
−6.348
bi
✗ ✗ ✓
✓
Monolingual-Bilingual (Non-)convergence in L3 Rhythm
175
Fig. 1 Interaction Language*Group for VarcoC
significantly only for VarcoC (see Fig. 1): While this index does not differ between the two learner groups in the production of German, it does so in English, where the bilinguals produce higher values for VarcoC than the monolinguals. There were no significant interactions in any of the other metrics.
6 Discussion We hypothesized that group differences that we might find would stem from the adverse-background-language, Turkish. This hypothesis could only be partially confirmed. Most Group differences found in the rhythm metrics in the German of the bilinguals that were analyzed here can indeed be explained with straightforward transfer effects: %V, /\C, VarcoV, and nPVI-C all exhibit a significant effect of Group, with all but VarcoV matching the direction predicted by the transfer of rhythmic properties from the bilinguals’ L1 Turkish. The same effects surface in the English reading text. In particular, %V turns out to be higher in the bilingual group, while /\C and nPVI-C are lower, both of which are on a par with the characteristics of syllable-timed languages. This is also the case in the German reading text. However, group differences also occur in all the other durational metrics, and some metrics, namely /\V, VarcoV, VarcoC, and nPVI-V, showed rather unexpected results in the English productions. More specifically, they were higher in the bilinguals than in the monolinguals, which does not support a transfer effect from Turkish at first sight. However, this outcome finds a straightforward explanation from the point of view of both developmental and phonetic variability as L2 grammars in particular are characterized by an aura of high variability. Furthermore, speech rate is significantly higher in the bilinguals in the production of the German text. Even though the trend towards a slightly elevated speech rate in English does not reach significance, it seems to still have an influence on /\V. As a result, we attribute the higher /\V found in the bilinguals to fall out from an inverse relationship of vocalic durations with speech rate (as shown, for instance, by Dellwo & Wagner, 2003).
176
C. Domene Moreno and B. Kabak
Fig. 2 Rhythm metrics across groups in English and German
It should be noted that, even in metrics that are not predicted by Group in the German reading text, there is a trend towards group differences (as shown in Fig. 2 in Sect. 5.4) that consistently matches the direction of the same metrics found in the English production. So, in other words, the trends in the English data fit the effects found in German. Therefore, the differences between the speaker groups cannot be language-specific but rather hold for both the bilinguals’ L2 (German) and their L3 (English). However, with the exception of VarcoV, all group differences in German can be explained by transfer from Turkish, i.e., the bilingual speakers produce a type of German that is more syllable-timed than that produced by the monolinguals. In the productions of English, some metrics show the opposite effect, whereas those same metrics do not reveal any group differences in German. This leads us to suggest that Turkish rhythm has an influence on the bilinguals’ German, and that both Turkishspecific as well as additional, non-language-specific factors have an effect on the bilinguals’ L3 English. Finally, looking at the interactions between Language and Group in each index using the aggregate data showed that VarcoC is the only index that shows a significant interaction. This is largely on a par with the observation we made above that rhythm behavior is non-selective such that the groups were not only
Monolingual-Bilingual (Non-)convergence in L3 Rhythm
177
different from one another on several counts, but they were also consistent in their rhythmic patterning irrespective of the language in which they performed. All in all, the bilinguals produce rhythmic patterns in their L2 German that are similar to those they produce in their L3 English and deviate from those found in the monolingual German speakers. Additionally, more group differences surface in their L3 English than in their L2 German. This gives further support to the notion of an integrated multilingual language system with interconnected linguistic subsystems, as also shown by Kopeˇcková et al. (2016). The fact that the differences are less pronounced in the L1/L2 (i.e., German) than they are in the L2/L3 (i.e., English) might hint to a proficiency effect, i.e., the more proficient the multilingual speakers are in a language, the closer they are to performing on a par with their monolingual peers. Since the bilinguals’ proficiency in German is close to native-like, more differences between the monolinguals and bilinguals are bound to surface in the less proficient language, i.e., the target language English, than in the shared language. Unlike the study by Fuchs (2016), we did not find any group effects in any of the pitch-based indices we tested. This might be a methodological issue: We might simply have picked those measurements that were not capable of revealing differences in the particular constellation of languages (e.g., German and English). A more likely reason, however, is that differences in the pitch correlates of rhythm do not surface in the speech produced by highly proficient but non-native L2/L3 speakers. Indeed, pitch differences may be more salient than slight deviations in durational values, and thus their acquisition is more likely to be complete before that of durational phenomena. This is supported by the fact that infant directed speech is characterized by higher variability in pitch, but also by an overall lower speech rate than adult directed speech (see, for instance, McMurray et al., 2013). In summary, we are dealing with one perceptual entity in the study of rhythm (see Sect. 3), but on closer inspection and by using different metrics that have all been shown to reflect the same typological and perceptual categorization, three distinct patterns become apparent in the production of L3 rhythm in this study: (a) transfer from the L2 German (beneficial), (b) transfer from the HL Turkish (detrimental, either directly or via a Turkish-influenced German), and, crucially, (c) a multi-grammar effect that triggers higher variability in the bilinguals than in the monolingual controls. Moreover, the bilinguals exhibit values that differ significantly from those produced by the monolinguals in a higher number of the metrics in English than in German, which indicates an additional, non-CLI effect at play in the L3. Layers of the same perceptual entity, then, seem to be subject to different types of language-specific and non-language-specific influence, or CLI “bits” (i.e., transfer and non-transfer effects) simultaneously. In a next step, it will be important to re-assemble these bits and examine how speakers are perceived in terms of their accentedness in order to identify the ways in which these bits interact and then surface as perceived rhythmic structure. Finally, a few statements are in order concerning the potential influence of Kiezdeutsch, the German variety spoken by large groups of young immigrants, on the L2 and L3 productions of the bilinguals (Wiese, 2012). We regard this as unlikely in this particular speaker group for two reasons: All speakers were tested in an academic
178
C. Domene Moreno and B. Kabak
environment that is associated by the participants with more formal speech, which is supported by the experimenter’s impression that the German produced by the participants is close to standard German with a slight regional influence in some of the speakers. Furthermore, if it were a German multiethnolect that was the source of the CLI we observed in our data, we would expect the influence on the (standard) German to be stronger than the influence on the L3, which was not the case in our results. Instead, we speculate that the skewing of speakers’ durational values (for any of the indices) towards Turkish in any of the languages learned later in life could be due to a universally preferred, unmarked type of rhythm that is retained for all other languages as the default. The observation of Grabe et al. (1999) that syllable timing is acquired earlier than stress timing and Polyanskaya and Ordin’s (2015) findings that when acquiring a stress-timed L1, a child’s speech patterns tend to be syllable-timed at the outset and become increasingly more stress-timed (i.e., target-like) accords with our speculation. Furthermore, the fact that both the inner and the outer-circle varieties of English are converging on a syllable-timed rhythm, as shown in studies on New Zealand English (Nokes & Hay, 2012), Hong Kong English (Setter, 2006), Nigerian English (Gut, 2002), Multicultural London English (Torgersen & Szakay, 2012), and Indian English (Fuchs, 2016) turns the notion of markedness into a promising test case for future studies on the L2/L3 acquisition of rhythm.
7 Conclusions In this chapter, we asked whether the rhythmic properties of a syllable-timed heritage language would be traceable in a stress-timed L3 in speakers whose L2 is also stress-timed, and, if that was the case, whether these traces were already visible in the speakers’ L2. To this end, we tested two groups of highly proficient users of English, Turkish-German bilinguals, and German monolinguals, on their production of speech rhythm in English and German. We found that the two groups indeed performed differently in both languages, albeit not in all the metrics they were tested on. Specifically, although no differences were attested in any of the pitch-based metrics, the groups differed in most of the durational metrics. Furthermore, they showed more differences in their English than in their German, although the general tendencies were robustly similar in both of these languages. These findings have important methodological and theoretical implications. As demonstrated in our study, one can make use of rhythm metrics to uncover differences in speech production when comparing populations that are not expected to differ in their proficiency (and thus in their speech rate). The quality and quantity of these differences can then inform us about the interaction of multiple languages in the multilingual mind. When we find group differences in the speakers’ L3, it is thus crucial to look at the speakers’ background languages as well, since, rather than assuming simple transfer from either of the languages, we might be dealing with transfer from one or more affected background language(s). Ideally, all of the
Monolingual-Bilingual (Non-)convergence in L3 Rhythm
179
speakers’ languages should be considered, which was not possible in this study due to methodological limitations. For the field of L2 and L3 phonology in general, our findings give further evidence for language systems in the bilingual that are at least partly integrated. In the production of prosody, bilingual participants exhibited similar behavior in the two languages (German and English), and both seemed to be influenced to some degree by their L1 (Turkish). This influence may, among other factors, be determined by universal tendencies as the transfer of the unmarked structure may be preferred if this option is available in the multilingual grammar. Crucially, however, this type of HL transfer does not happen across the board, not even when various aspects of the same abstract entity (i.e., rhythm as it is manifested in different metrics) are examined. Rather, it is individual “data packets” or “bits” that can get transferred from either or both background languages. In the case of phonemic segments, these data packets could be articulatory gestures or distinctive features, while in the case of prosody they could be, as shown in this study, different concomitants of rhythmic structure. So, while some of the rhythm metrics reveal the predicted effect from the HL in the bilingual speakers, others give evidence to the contrary, suggesting a non-language-specific, bilingual effect. In summary, we are dealing with property-by-property and bit-by-bit transfer in L3 phonology wherein the source of transfer is selected due to universal as well as language-specific attributes. In further studies, it will be particularly interesting to see the concrete realization of rhythm in the bilinguals’ L1, especially since we can assume that their heritage language is different from that same language in a monolingual L1 setting. Furthermore, it is necessary to explore in more depth whether and how the differences we found between monolingual and bilingual speakers are perceived by native as well as by non-native speakers of the target language. The fact that it was the syllable-timed background (heritage) language that could explain the variation in the L3 speech patterns, and not the dominant stress-timed background language, can be taken to suggest that universal forces other than CLI may be at play. Future research should thus compare L1s with conflicting rhythm types in the acquisition of L3 speech patterns. Acknowledgements We would like to thank Marlene Keßler for her assistance in the coding of the data, as well as the two anonymous reviewers for their helpful comments and suggestions. Needless to say, all errors are our own.
Appendix 1: Reading Text English One day Catherine and Rose, two recent Caltech graduates who had been friends ever since they had been little children, decided to go on a three-week hiking trip to the Laprig hills together. They met at Rose’s place to make travel plans: to book their hotel room, decide on various places they wanted to visit for their breakfasts, lunches and dinners and find out about the rangers’ warnings they had to heed. Rose had
180
C. Domene Moreno and B. Kabak
prepared the most wonderful meal—potato soup, cooked plums with root vegetables and cranberry sauce, and apple ice cream with sunflower seeds for dessert—and she had put a full bottle of expensive red wine into the fridge. Rose was extremely excited about their holiday and wanted this evening of planning to be perfect. They used to go to the theater together every Thursday, but the two young women hadn’t really spent quality time with each other for the last three years since they both had extremely busy schedules. At six o’clock sharp Catherine rang the doorbell and Rose opened her front door. After they had shooed away a rook from the front porch, the two friends hugged happily and went inside, where Catherine hung up her bag on the coat rack. Rose asked her to take a seat. Doing as she had been bid, Catherine sat down at the clean wooden table and Rose went into the kitchen to fetch drinks. While Catherine was sitting in the living room on her own contemplating the neat yellow pattern on Rose’s wallpaper she heard the strangest noise. It sounded as if a panther cub was trying to roar, and she could not place it anywhere: It might have originated outside the window just as well as right under the plush sofa that she could just see from where she was sitting. For a moment Catherine was worried, but then Rose came back with two crystal glasses and the wine, so Catherine forgot about the incident.
Appendix 2: Reading Text “Der Nordwind und die Sonne” Einst stritten sich Nordwind und Sonne, wer von ihnen beiden wohl der Stärkere wäre, als ein Wanderer, der in einen warmen Mantel gehüllt war, des Weges daherkam. Sie wurden einig, dass derjenige für den Stärkeren gelten sollte, der den Wanderer zwingen würde, seinen Mantel abzunehmen. Der Nordwind blies mit aller Macht, aber je mehr er blies, desto fester hüllte sich der Wanderer in seinen Mantel ein. Endlich gab der Nordwind den Kampf auf. Nun erwärmte die Sonne die Luft mit ihren freundlichen Strahlen, und schon nach wenigen Augenblicken zog der Wanderer seinen Mantel aus. Da musste der Nordwind zugeben, dass die Sonne von ihnen beiden der Stärkere war.
References Abercrombie, D. (1967). Elements of general phonetics. Aldine. Arvaniti, A. (2012). The usefulness of metrics in the quantification of speech rhythm. Journal of Phonetics, 40(3), 351–373. https://doi.org/10.1016/j.wocn.2012.02.003 Bardel, C., & Falk, Y. (2012). The L2 status factor and the declarative/procedural distinction. In J. Cabrelli Amaro, S. Flynn, & J. Rothman (Eds.), Studies in bilingualism. Third language acquisition in adulthood (Vol. 46, pp. 61–78). John Benjamins. https://doi.org/10.1075/sibil.46. 06bar
Monolingual-Bilingual (Non-)convergence in L3 Rhythm
181
Bardel, C., & Lindqvist, C. (2007). The role of proficiency and psychotypology in lexical crosslinguistic influence: A study of a multilingual learner of Italian L3. Atti del VI Congresso Internazionale dell’Associazione Italiana di Linguistica Applicata, 123–145. Bergmann, C., Nota, A., Sprenger, S. A., & Schmid, M. S. (2016). L2 immersion causes non-nativelike L1 pronunciation in German attriters. Journal of Phonetics, 58, 71–86. https://doi.org/10. 1016/j.wocn.2016.07.001 Birdsong, D., Gertken, L. M., & Amengual, M. (2012). Bilingual language profile: An easy-to-use instrument to assess bilingualism. COERLL, University of Texas at Austin. Web. 20 Jan 2012. https://sites.la.utexas.edu/bilingual/ Boersma, P., & Weenink, D. (2018). Praat: Doing phonetics by computer [Computer program]. Version 6.0.40. Retrieved May 11, 2018, from http://www.praat.org/ Boukrina, O., & Marian, V. (2006). Integrated phonological processing in bilinguals: Evidence from spoken word recognition. In Proceedings of the Cognitive Science Society. Cabrelli Amaro, J. (2017). Testing the phonological permeability hypothesis: L3 phonological effects on L1 versus L2 systems. International Journal of Bilingualism, 21(6), 698–717. https:// doi.org/10.1177/1367006916637287 Carter, P. M. (2005). Quantifying rhythmic differences between Spanish, English, and Hispanic English. In R. S. Gess & E. J. Rubin (Eds.), Amsterdam studies in the theory and history of linguistic science Series 4, Current issues in linguistic theory: Vol. 272. Theoretical and experimental approaches to Romance linguistics: Selected papers from the 34th Linguistic Symposium on Romance Languages (LSRL), Salt Lake City, March 2004 (Vol. 272, pp. 63–75). Benjamins. https://doi.org/10.1075/cilt.272.05car Dasher, R., & Bolinger, D. (1982). On pre-accentual lengthening. Journal of the International Phonetic Association, 12, 58–69. Dauer, R. M. (1983). Stress-timing and syllable-timing reanalysed. Journal of Phonetics, 11, 51–69. Dellwo, V. (2006). Rhythm and speech rate: A variation coefficient for deltaC. In P. Karnowski & I. Szigeti (Eds.), Language and language-processing (pp. 231–241). Peter Lang. Dellwo, V., & Wagner, P. (2003). Relationships between rhythm and speech rate. Presented at the 15th International Congress of the Phonetic Sciences, Barcelona, August 3–9, 2003. Domahs, U., Genç, S., Knaus, J., Wiese, R. & Kabak, B. (2012). Processing (un)-predictable word stress: ERP evidence from Turkish. Language and Cognitive Processes, 1–20. Domene Moreno, C. (2021). Beyond transfer? The acquisition of an L3 phonology by TurkishGerman bilinguals. Doctoral thesis, University of Würzburg. Féry, C. (2010). The intonation of Indian languages: An areal phenomenon. In Imtiaz Hasnain and Shreesh Chaudhury (Eds.), Festschrift in honour of Ramakant Agnihotri. Flynn, S., Foley, C., & Vinnitskaya, I. (2004). The cumulative-enhancement model for language acquisition: Comparing adults’ and children’s patterns of development in first, second and third language acquisition of relative clauses. International Journal of Multilingualism, 1(1), 3–16. https://doi.org/10.1080/14790710408668175 Fuchs, R. (2016). Speech rhythm in varieties of English: Evidence from educated Indian English and British English. Springer. Gabriel, C., & Rusca-Ruths, E. (2015). Der Sprachrhythmus bei deutsch-türkischen L3Spanischlernern: Positiver Transfer aus der Herkunftssprache? In S. Witzigmann & J. Rymarczyk (Eds.), Mehrsprachigkeit als Chance (pp. 185–203). Peter Lang. Grabe, E., Post, B., & Watson, I. (1999). The acquisition of rhythmic patterns in English and French. Proceedings of the International Congress of Phonetic Sciences, 1201–1204. Grabe, E., & Low, E. L. (2002). Durational variability in speech and the rhythm class hypothesis. In C. Gussenhoven & N. Warner (Eds.), Laboratory phonology VII (pp. 515–546). Mouton de Gruyter. Gut, U. (2010). Cross-linguistic influence in L3 phonological acquisition. International Journal of Multilingualism, 7(1), 19–38.
182
C. Domene Moreno and B. Kabak
Gut, U. (2012). Rhythm in L2 speech. In D. Gibbon, D. Hirst, & N. Campbell (Eds.), Rhythm, melody and harmony in speech: Studies in honour of Wiktor Jassem, special edition of speech and language technology (pp. 83–94). Poznan. Gut, U. (2002). Prosodic Aspects of Standard Nigerian English. In Gut Ulrike, Gibbon Dafydd (Eds.), Typology of African prosodic systems (pp. 167–178). Bielefeld. Handel, S., & Lawson, G. R. (1983). The contextual nature of rhythmic interpretation. Perception & Psychophysics, 34(2), 103–120. https://doi.org/10.3758/BF03211335 Güne¸s, G. (2013). Limits of prosody in Turkish. In E. E. Taylan (Ed.), Dilbilim Ara¸stırmaları Dergisi (the Journal of Linguistics Research), special issue “Updates in Turkish Phonology”, 133–169. Bo˘gaziçi University Press, Istanbul. Güne¸s, G. (2015). Deriving prosodic structures. Doctoral Dissertation. University of Groningen. Handel, S., & Lawson, G. R. (1983). The contextual nature of rhythmic interpretation. Perception & Psychophysics, 34(2), 103–120. https://doi.org/10.3758/BF03211335 Jared, D., & Kroll, J. F. (2001). Do bilinguals activate phonological representations in one or both of their languages when naming words? Journal of Memory and Language, 44(1), 2–31. https:// doi.org/10.1006/jmla.2000.2747 Kabak, B., & Vogel, I. (2001). The phonological word and stress assignment in Turkish. Phonology, 18, 315–360. Kabak, B. (2014). Pervasive syllables. In J. C. Reina & R. Szczepaniak (Eds.), Syllable and word languages (pp. 112–139). Linguae & Litterae Series. DeGruyter. Kabak, B. (2016). Refin(d)ing Turkish stress as a multifaceted phenomenon. Paper delivered at Second Conference on Central Asian Languages and Linguistics (ConCALL-2), October 7–9, 2016, Indiana University. Kamali, B. (2011). Topics at the PF interface of Turkish. Doctoral thesis, Harvard University. Kisler, T., Reichel, U. D., & Schiel, F. (2017). Multilingual processing of speech via web services. Computer Speech & Language, 45, 326–347. Kopeˇcková, R. (2016). The bilingual advantage in L3 learning: A developmental study of rhotic sounds. International Journal of Multilingualism, 13(4), 410–425. https://doi.org/10.1080/147 90718.2016.1217605 Kopeˇcková, R., Marecka, M., Wrembel, M., & Gut, U. (2016). Interactions between three phonological subsystems of young multilinguals: The influence of language status. International Journal of Multilingualism, 13(4), 426–443. https://doi.org/10.1080/14790718.2016.1217603 Lee, J.-P. & Jang, T.-Y. (2004). A comparative study on the production of inter-stress intervals of English speech by English native speakers and Korean speakers. Proceedings of the 8th International Conference on Spoken Language Processing, 1245-8. Levi, S. V. (2005). Acoustic correlates of lexical accent in Turkish. Journal of the International Phonetic Association, 35, 73–97. Li, A., & Post, B. (2014). L2 acquisition of prosodic properties of speech rhythmL2: Evidence from L1 Mandarin and German Learners of English. Studies in Second Language Acquisition, 36(02), 223–255. https://doi.org/10.1017/S0272263113000752 Ling, L. E., Grabe, E., & Nolan, F. (2000). Quantitative characterizations of speech rhythm: Syllabletiming in Singapore English. Language and Speech, 43(Pt 4), 377–401. https://doi.org/10.1177/ 00238309000430040301 Llama, R., & López-Morelos, L. P. (2016). VOT production by Spanish heritage speakers in a trilingual context. International Journal of Multilingualism, 13(4), 444–458. https://doi.org/10. 1080/14790718.2016.1217602 Lloyd-Smith, A., Gyllstad, H., & Kupisch, T. (2017). Transfer into L3 English. Linguistic Approaches to Bilingualism, 7(2), 131–162. https://doi.org/10.1075/lab.15013.llo Louriz, N. (2007). Alignment in L3 phonology. Langues Et Linguistique, 18(19), 129–160. McMurray, B., Kovack-Lesh, K. A., Goodwin, D., & McEchron, W. (2013). Infant directed speech and the development of speech perception: Enhancing development or an unintended consequence? Cognition, 129(2), 362–378. https://doi.org/10.1016/j.cognition.2013.07.015
Monolingual-Bilingual (Non-)convergence in L3 Rhythm
183
Mok, P. P. K. (2011). The acquisition of speech rhythm by three-year-old bilingual and monolingual children: Cantonese and English. Bilingualism: Language and Cognition, 14(04), 458–472. https://doi.org/10.1017/S1366728910000453 Nazzi, T., Bertoncini, J., & Mehler, J. (1998). Language discrimination by newborns: Toward an understanding of the role of rhythm. Journal of Experimental Psychology: Human Perception and Performance, 24(3), 756–766. https://doi.org/10.1037/0096-1523.24.3.756 Nespor, M., Shukla, M., & Mehler, J. (2011). Stress-timed vs. syllable-timed languages. In M. Van Oostendorp, C. J. Ewen, E. V. Hume, K. Rice (Eds.). The Blackwell companion to phonology, Vol. II. (pp. 1147–1157). Blackwell. Nokes, J., & Hay, J. (2012). Acoustic correlates of rhythm in New Zealand English: A diachronic study. Language Variation and Change, 24(01), 1–31. https://doi.org/10.1017/S09543945120 00051 Polyanskaya, L., & Ordin, M. (2015). Acquisition of speech rhythm in first language. The Journal of the Acoustical Society of America, 138(3), EL199–204. https://doi.org/10.1121/1.4929616 Pycha, A. (2006). A duration-based solution to the problem of stress realization in Turkish. UC Berkeley Phonology Lab Annual Report, 141–151. Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in the speech signal. Cognition, 73(3), 265–292. https://doi.org/10.1016/S0010-0277(99)00058-X Roach, P. (1982) On the distinction between ‘stress-timed’ and ‘syllable-timed’ languages. In D. Crystal (Ed.) Linguistic controversies (pp. 73–79). Edward Arnold. Rothman, J. (2010). L3 syntactic transfer selectivity and typological determinacy: The typological primacy model. Second Language Research, 27(1), 107–127. https://doi.org/10.1177/026765831 0386439 Sánchez, L. (2017). An inquiry into the role of L3 proficiency on crosslinguistic influence in third language acquisition. ODISEA. Revista de estudios ingleses. Advance online publication. https:// doi.org/10.25115/odisea.v0i15.282 Schmid, M. S., & Dusseldorp, E. (2010). Quantitative analyses in a multivariate study of language attrition. Second Language Research, 26(1), 125–160. Setter, J. (2006). Speech rhythm in world Englishes: The case of Hong Kong. TESOL Quarterly, 40(4), 763. https://doi.org/10.2307/40264307 Slabakova, R. (2016). The scalpel model of third language acquisition. International Journal of Bilingualism, 21(6), 651–665. https://doi.org/10.1177/1367006916655413 The MathWorks. (2018). MATLAB user’s guide. The MathWorks, Inc. Torgersen, E. N., & Szakay, A. (2012). An investigation of speech rhythm in London English. Lingua, 122(7), 822–840. https://doi.org/10.1016/j.lingua.2012.01.004 Valdés, G. (2005). Bilingualism, heritage language learners, and SLA research: Opportunities lost or seized? The Modern Language Journal, 89(3), 410–426. https://doi.org/10.1111/j.1540-4781. 2005.00314.x Van Geert, P. (2008). The dynamic systems approach in the study of L1 and L2 acquisition: An introduction. The Modern Language Journal, 92(2), 179–199. https://doi.org/10.1111/j.15404781.2008.00713.x Vicenik, C., & Sundara, M. (2013). The role of intonation in language and dialect discrimination by adults. Journal of Phonetics, 41, 297–306. Westergaard, M., Mitrofanova, N., Mykhaylyk, R., & Rodina, Y. (2016). Crosslinguistic influence in the acquisition of a third language: The linguistic proximity model. International Journal of Bilingualism, 21(6), 666–682. https://doi.org/10.1177/1367006916648859 Whitworth, N. (2002). Speech rhythm production in three German-English bilingual families. Leeds Working Papers in Linguistics and Phonetics, 9, 175–205. Wiese, H. (2012). Kiezdeutsch. Ein neuer Dialekt entsteht. C.H. Beck.
184
C. Domene Moreno and B. Kabak
Wrembel M. (2011). Cross-linguistic influence in third language acquisition of voice onset time. In W.-S. Lee & E. Zee (Eds.). Proceedings of the 17th International Congress of Phonetic Sciences. 17–21 August 2011. Hong Kong (pp. 2157–2160). City University of Hong Kong. Zora, H., Heldner, M., & Schwarz, I.-C. (2016). Perceptual correlates of Turkish word stress and their contribution to automatic lexical access: Evidence from early ERP components. Frontiers in Neuroscience, 10. https://doi.org/10.3389/fnins.2016.00007
Measuring Rhythm
Rhythm Metrics and the Perception of Rhythmicity in Varieties of English as a Second Language Robert Fuchs
Abstract While rhythm metrics have been widely used to quantify speech rhythm, direct evidence of their perceptual validity is currently very limited. If it were to be shown that particular rhythm metrics reflect, at least to some degree, listeners’ perception of speech rhythm, this would substantially enhance the case for their use as accurate quantifications of speech rhythm. To this end, this chapter presents the results of a perception study harnessing listeners’ ability to conduct pairwise comparisons of the regularity of utterances. 22 speakers of Indian English rated all 55 pairwise comparisons of 11 intonation phrases for regularity. A Multidimensional Scaling analysis reveals that 74% of the variation in the responses is accounted for by two dimensions, which are in turn explained in regression analyses by the speech rate normalised rhythm metrics nPVI-V, VarcoV and VarcoC (all with final intervals removed). In addition, speech rate was also relevant as a separate independent variable. The results of the present perception study partially converge with previous evidence from production studies concerning which rhythm metrics are reliable and are likely to at least partially account for listeners’ perception of rhythmicity. Keywords Speech rhythm · Perception · Production · Rhythm metrics · nPVI-V · VarcoV · Speech rate · Regularity · Timing
1 Introduction The concept of speech rhythm is surrounded by vigorous theoretical and empirical debate (see, for example, Arvaniti, 2012; Nolan & Jeon, 2014; Gibbon to appear) and continues to be considered a crucial means of understanding and analysing supra-segmental phonology, both in terms of the number of research studies (for example, Google Scholar lists 3,690 publications containing the phrase speech rhythm between 2017 and 2020) as well as the wide range of areas in which it finds application. Among these diverse fields are first language acquisition (Goswami, R. Fuchs (B) Department of English, University of Hamburg, Überseering 35, 22297 Hamburg, Germany e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 R. Fuchs (ed.), Speech Rhythm in Learner and Second Language Varieties of English, Prosody, Phonology and Phonetics, https://doi.org/10.1007/978-981-19-8940-7_8
187
188
R. Fuchs
2019; Post & Payne, 2018), non-verbal communication (Seifart et al., 2018), automatic language identification (Kim & Park, 2020), the management of social relations (Polyanskaya et al., 2019), speech disorders (Eijk et al., 2020), reading intervention for beginning readers (Harrison et al., 2018), neurological diseases such as Alzheimer’s and Parkinson’s disease (Martínez-Sánchez et al., 2017; Sztahó et al., 2017), lip-smacking in primates (Pereira et al., 2020) and music (Lee et al., 2017). Moreover, a core area of speech rhythm research remains the analysis of the speech rhythm of particular languages and dialects as well as comparisons between them, such as Arabic (Ibrahim et al., 2020), Ghanaian languages (Boll-Avetisyan et al., 2020), and also, more broadly, in subdomains of linguistics such as bi- and multilingualism (Aldrich, 2020; Law et al., 2020; White & Mok, 2019) and variational linguistics (Fuchs, 2016; Romano, 2020). Speech rhythm was defined by Kohler (2009: 41) as “the production, for a listener, of a regular recurrence of waxing and waning prominence profiles across syllable chains over time.” While there are many different ways of measuring speech rhythm (see, for example, Gibbon & Li, 2019; Goswami & Leong, 2013; Tilsen & Arvaniti, 2013), the most widely used methods rely on a definition of speech rhythm that is based on variability in duration, which may be considered an important contributor to the “waxing and waning” in prominence. In this operationalisation of speech rhythm, greater variability in duration is identified with stress-timing and lower variability in duration with syllable-timing. Such variability can be measured for syllables, vowels and consonants, although it is in fact not the durations of individual vowels and consonants that is measured, but that of so-called vocalic and consonantal intervals (where vocalic intervals are stretches of vowels uninterrupted by consonants or pauses, and consonantal intervals are stretches of consonants uninterrupted by vowels or pauses). A further distinction can be made regarding the calculation of these rhythm metrics in that durational variability can either be calculated as the average of pairwise differences between adjacent intervals, or as the variability of all intervals across an utterance, regardless of position. While such rhythm metrics have been widely used to quantify speech rhythm, direct evidence of their perceptual validity is currently lacking. At present, the available evidence for the perceptual relevance of speech rhythm metrics is mostly indirect (see Sect. 3). However, no study has shown to date that any particular rhythm metric, or a combination of two or more rhythm metrics, directly accounts for listeners’ perception of rhythmicity. Such evidence is urgently needed in order to test the validity of rhythm metrics (Do they account for what they are supposed to measure, i.e. rhythm?) and is also of interest because of the great variety of duration-based rhythm metrics (see Sect. 2), which are unlikely to all account equally well for listeners’ perception of rhythm. Furthermore, speech rhythm metrics are used across a great number of studies (see Sect. 2 and Introduction to this volume). If it were to be shown that particular rhythm metrics reflect, at least to some degree, listeners’ perception of rhythmicity, this would substantially enhance the case for their use as accurate quantifications of speech rhythm.
Rhythm Metrics and the Perception of Rhythmicity in Varieties …
189
The remainder of this chapter is structured as follows. Section 2 provides a brief overview of existing speech rhythm metrics and Sect. 3 discusses the available evidence for their perceptual relevance. Subsequently, a new approach to the study of rhythm perception is introduced in general terms (Sect. 4), followed by its implementation in the present study (Sect. 5). Sections 6 and 7 present and discuss its results, followed by a conclusion (Sect. 8).
2 Measuring Speech Rhythm Duration-based rhythm metrics generally try to quantify the notion that stress-timed languages use reduced vowels in unstressed syllables as well as longer consonant clusters than syllable-timed languages, whereas syllable-timed languages have no or very little vowel reduction and short or no consonant clusters (Dauer, 1983: 55–8; Ramus et al., 1999: 270; Schiering, 2007). While there are also rhythm metrics based on a more general concept of variability in prominence, rather than just duration, they are currently rarely used in research on speech rhythm compared to duration-based metrics (He, 2012; Cumming, 2011; Fuchs, 2014a, 2014b, 2016: 69–79; Low, 1998).1 Three broad distinctions can be made among duration-based rhythm metrics, involving (1) the distinction between vowels and consonants, (2) the presence or absence of speech rate normalisation and (3) the way that variability is quantified (see Fig. 1). Consequently, a distinction can be made between (1) vocalic and consonantal metrics, (2) rhythm metrics that are normalised for speech rate and those that are not, and (3) global and local metrics. Of these, the global/local distinction requires further explanation. Global rhythm metrics are those that compute a measure of the variability of duration of all vocalic or consonantal intervals regardless of their position in the utterance. This can be achieved by computing the standard deviation, yielding the measures /\V (read: “Delta V”) and /\C for vocalic and consonantal intervals, respectively (Ramus et al., 1999). Their speech rate normalised equivalents are computed by taking the standard deviation, divided by the mean, multiplied by 100, and are known as VarcoV and VarcoC, respectively (also called coefficients of variation for vocalic and consonantal durations, respectively; White & Mattys, 2007a, Dellwo 2006). In contrast to global rhythm metrics, local metrics compute differences between adjacent pairs of vocalic intervals (for vocalic metrics) or consonantal intervals (for consonantal metrics) and then take the mean of all pairwise comparisons. These metrics are often referred to as Pairwise Variability Indices (PVI), with an initial lowercase “r” indicating the raw and “n” the speech rate normalised version (Low et al., 2000). Of the four theoretically possible PVIs, the raw vocalic (rPVI-V) and 1
Moreover, there are alternative approaches that conceptualise rhythm in ways that depart from the quantification of rhythm as encapsulated in metrics such as the PVI, and instead focus, for example, on the influence of speech rhythm on word segmentation strategies (Kim et al., 2008; Murty et al., 2007).
190
R. Fuchs
Local variability
Vocalic
Global variability
Speech rate normalised
nPVI-V
Not normalised
(rPVI-V)
Speech rate normalised
VarcoV
Not normalised
ΔV
Global proportion
Rhythm Metrics
Local variability
%V Speech rate normalised
(nPVI-C)
Not normalised
rPVI-C
Speech rate normalised
VarcoC
Not normalised
ΔC
Consonantal Global variability
Fig. 1 Taxonomy of common duration-based rhythm metrics (theoretically possible but uncommon metrics shown in brackets)
the normalised consonantal index (nPVI-C; both shown in brackets in Fig. 1) are used much more rarely than their counterparts, i.e. nPVI-V and rPVI-C (due to the assumption that variation in speech rate affects vowel duration more than consonant duration). Local and global as well as speech rate normalised and raw indices of durational variability can also be computed for syllable durations (e.g. VarcoS and nPVI-S; Rathcke & Smith, 2011, Gibbon & Gut, 2001) or voiced versus unvoiced durations or sonorant versus obstruent durations (Steiner, 2004, 2005; Dellwo et al., 2007; Fuchs, 2016: 39–52), but these measures are also used comparatively rarely and not considered in the present chapter. By contrast, a rhythm metric that does not fit into this taxonomy, but is widely used, quantifies the proportion of vocalic durations over total utterance duration, and is known as %V (Ramus et al., 1999). Finally, a methodological choice that needs to be made in any empirical analysis relying on these rhythm metrics is whether utterance-final vocalic and consonantal intervals are included or excluded because they may be subject to phrase-final lengthening (Fuchs, 2016: 94).
3 Evidence on the Perceptual Relevance of Rhythm Metrics As pointed out in Sect. 1, there is as yet no direct evidence that rhythm metrics can account for the human perception of rhythmicity. However, a number of studies have provided evidence for the perceptual relevance of rhythm metrics through experiments in which participants successfully identified sociolinguistic or social attributes
Rhythm Metrics and the Perception of Rhythmicity in Varieties …
191
of speakers (such as their dialect or ethnicity) based on rhythmic differences in their speech. While these studies did not determine whether rhythm metrics actually measure speech rhythm, they attest more generally to the psychological relevance of rhythm metrics. This line of research requires researchers to isolate speech rhythm (as compared to other sources of phonetic information, such as intonation) as a cue to ethnic background or other social attributes, without which it would be difficult to determine what sort of information contained in the speech signal contributed to a listener’s responses (Levon, 2007: 536–537). Previous studies have mainly taken three approaches— they either explored statistical correlations between accent ratings and measurements of rhythm (Sect. 3.1), modified the acoustic signal in order to manipulate or suppress particular sorts of phonetic information such as intonation or speech rhythm (Sect. 3.2) or tried to assess directly which rhythm metrics can account for listeners’ perception of rhythmicity (Sect. 3.3).
3.1 Accent Ratings and Their Correlations with Rhythm Metrics One way to determine to what degree speech rhythm influences accent ratings (i.e. ratings of how strong a speaker’s accent is) is to explore correlations between measurements of speech rhythm and accent ratings. This method was used by White and Mattys (2007b: 248–253), who had first language (L1) speakers of English rate English speech samples produced by Dutch, Spanish and L1 British English speakers. Vocalic metrics revealed high and significant correlations with the accent ratings, ranging from 0.74 for VarcoV to 0.65 for %V and 0.56 for nPVI-V. Speech rate (syllables/second) was another good predictor, but consonantal metrics (/\C, /\V, VarcoC, rPVI-C) showed poor correlations of 0.26 or less. A regression analysis further revealed that VarcoV and speech rate together accounted for an even greater part of the observed variation (R2 for a model comprising nPVI-V was 0.51, and for a model comprising nPVI-V and speech rate 0.63). Further analysis indicated that VarcoV was a good predictor only for the Spanish accent ratings, although this could arguably be simply due to the fact that the Spanish speakers of English varied much more in rhythm than the other groups (Spanish being syllable-timed while Dutch and English are stress-timed). Further adding to the evidence presented by White and Mattys (2007b), results from a study on Korean-speaking learner of Japanese also attested to the predictive power of rhythm metrics for accent ratings (Kinoshita & Sheppard, 2011). In this analysis, nPVI-V accounted for between 28 and 47% of the variation in accent ratings for the stimuli that differed from an L1 Japanese rhythm. While these studies provide evidence indicating that accent ratings correlate with variation in speech rhythm as measured by vocalic metrics, there are a number of provisos to this interpretation. First of all, this result only appears to obtain where
192
R. Fuchs
there is enough variation in speech rhythm in the speech samples, which typically occurs when speakers of one language learn and speak a second language that differs in rhythm from their first language. Secondly, speech rate also appears to correlate with accent ratings. This result might be explained by the fact that learners, as they become more proficient, also tend to speak faster, which in turn might be reflected in better accent ratings. A complication that arises from this explanation is that it is not clear whether the correlation between speech rate and accent ratings is due to a causal relationship (listeners rate faster speech as less accented) or whether speech rate is only indirectly related to perceived accent strength (more proficient speakers might have both less accented speech and a higher speech rate than less proficient speakers, without there necessarily being a causal relationship between speech rate and accent ratings). Similarly, correlations between measures of rhythm and accent ratings might be due to a causal link (the raters focus to a considerable degree on speech rhythm) or the relationship might be non-causal (as speakers become less accented, they modify both their rhythm and non-rhythmic phonetic variables such as intonation and the realisation of particular phonemes, and it is unclear how much attention raters truly devote to rhythm).
3.2 Isolating Rhythm as a Source of Acoustic Information on Speaker Origin The question of what sort of information listeners rely on in accent ratings or judgments on speaker origin—speech rhythm, intonation or segmental/phoneme-level information—can be answered much more clearly if these sources of information are separated (Drager, 2010). If it were possible to present otherwise identical speech samples that vary only in rhythm, any differences between ratings would much more clearly point to a causal relationship between speech rhythm and accent ratings, rather than just a correlation between the two. Several studies have successfully managed to separate these distinct sources of information through the selective suppression and resynthesis of intonation, speech rhythm and/or segmental information. Early research trying to demonstrate the usefulness of rhythm metrics focused on providing evidence of their perceptual relevance through rating tasks involving stimuli from different languages. The stimuli were lowpass filtered (i.e. only acoustic energy below a certain frequency threshold was retained in the stimuli), so that segmental information could be largely excluded as a source of information. The experimental paradigm relies on the assumption that listeners would be able to distinguish such stimuli from rhythmically different languages if and only if this rhythmic difference can be perceived by listeners. By contrast, rhythmically similar languages would be indistinguishable. This line of inquiry showed that infants can distinguish languages with a stress-timed rhythm (English and Dutch) from languages with a syllable-timed rhythm (Spanish and Italian), but do not differentiate between languages within these groups (Nazzi et al., 1998). While this and
Rhythm Metrics and the Perception of Rhythmicity in Varieties …
193
related experiments point towards the perceptual relevance of speech rhythm, the evidence is not unambiguous. While lowpass filtering largely obscures segmental information, in fact, both speech rhythm and intonation remain unaffected. Listeners might thus rely on either or both of these sources of information and their ratings cannot be unambiguously traced back to rhythm alone. Other studies addressed this problem through the resynthesis of speech stimuli, where all consonants are replaced with one particular consonant—usually [s]—and all vowels with one particular vowel—usually [a] (sasasa resynthesis). In such an experiment, L1 French listeners were able to differentiate rhythmically different languages under experimental conditions where rhythm was preserved, but not in the intonation-only condition (Ramus & Mehler, 1999). The sasasa resynthesis technique (and related methods) was also used in other studies to isolate rhythm as a phonetic cue in accent discrimination. Vicenik (2011: 70–83) and Vicenik and Sundara (2013) found that listeners were unable to distinguish American and Australian English (which have similar rhythm) in the rhythm-only condition, and Kolly and Dellwo (2014) showed that English-accented German speech (stresstimed) and French-accented German speech (syllable-timed) can be distinguished by German-speaking listeners based on rhythm alone. Similarly, Szakay (2007, 2008) showed that relatively syllable-timed Maori (i.e. indigenous) New Zealand English can be distinguished from relatively stress-timed Pakeha (i.e. European) New Zealand English based on rhythm alone. Finally, Fuchs (2015, 2016) presented further results on the relevance of speech rhythm for accent recognition, using a special form of resynthesis that allowed the grafting of the speech rhythm of one speaker onto the speech of another speaker, while keeping all other phonetic information intact. In these experiments, listeners were able to distinguish British English (more stress-timed) and Indian English (more syllable-timed) based on speech rhythm alone.2 What these studies show is that speech rhythm appears to be a prosodic category that can be perceived by listeners, and that, based on this information, listeners can make judgements about the ethnicity of the speakers or the dialects they speak, as long as these ethnolects or dialects are rhythmically different. Nevertheless, while these studies provide general evidence on the perceptual relevance of speech rhythm defined on the basis of durational variability, they did not establish a direct link between particular rhythm metrics and listeners’ perception of rhythmicity.
2
In addition to these studies, sasasa resynthesis was used by Ordin and Polyanskaya (2015) in order to test whether and which rhythm metrics could account for listeners’ perception of the proficiency level of low, intermediate and advanced German-speaking learners of English. Results indicated that speech rate broadly accounted for listener responses. This result appears to confirm that speech rate is a good measure of proficiency in learners, but the perception experiment does not contribute further information on the question of which rhythm metrics reflect the perception of rhythmicity.
194
R. Fuchs
3.3 Rhythm Metrics and the Perception of Rhythmicity A limited number of studies have attempted to establish how listeners’ perception of rhythmicity could be conceptualised. Two avenues of investigation were followed, one accessing more abstract knowledge of rhythm on the part of the participants and the other accessing more concrete perceptions of rhythmicity. The former strategy, accessing more abstract knowledge, is faced with the challenge that a direct classification of languages as syllable- or stress-timed, or placing languages on a continuum between syllable- and stress-timing, might be difficult to realise with lay listeners, given that stress- and syllable-timing are concepts that are unknown to the general public. However, phoneticians are able to carry out this task with sufficient consistency, as shown by Benguerel (1999; building on work by Miller, 1984). In this study, lowpass filtered and spectrally inverted stimuli from 20 languages were labelled by participants as stress-, syllable- or mora-timed, with the overall choice largely (though not always) corresponding to the rhythmic classification that previous research suggested. These results suggest that phoneticians are able to access phonetic information on speech rhythm in the acoustic signal and that they can perform a rhythmic classification on this basis. The second research strategy referred to at the start of this section, accessing more concrete perceptions of rhythmicity, was pursued by three studies investigating whether listeners’ ratings of the regularity of speech stimuli can be explained by particular rhythm metrics. These studies indicate that some of the variations in regularity perceived by listeners can be accounted for by some of the rhythm metrics, although they disagree regarding which metrics those are. For example, Dellwo (2008, 2010: 111–130) presented delexicalised French and German intonation phrases to English- and German-speaking listeners. Delexicalisation was based on replacing vowels with a complex harmonic waveform and consonants with white noise. Linear regression analysis showed that nPVI-V, %V and VarcoC were poor predictors of the ratings. However, speech rate (CV rate) was a much better predictor (R2 = 0.65), but only for German (0.73) and not for French (0.16). By contrast, Ong et al. (2005), based on regularity ratings of Singapore English stimuli rated by speakers of this variety, found a correlation of 0.51 with nPVI-V and 0.37 with the syllabled-based Variability Index (VI). Taken together, these results indicate that some rhythm metrics could potentially account for the perception of rhythmicity, but it is unclear which rhythm metric, or combination of several metrics, explains rhythmicity ratings best. In summary, there is convincing evidence (i) that listeners can perceive speech rhythm, (ii) that they can use this information to determine the social background of a speaker and (iii) that the perception of rhythmicity is at least partially accounted for by duration-based rhythm. However, there is as yet only limited evidence which of the existing rhythm metrics can account directly for listeners’ perception of rhythmicity.
Rhythm Metrics and the Perception of Rhythmicity in Varieties …
195
4 A New Approach to Testing Rhythm Perception As the discussion above has shown, few studies have directly tested which, if any, rhythm metrics can account for the human perception of speech rhythm. Those studies that have investigated this question yielded contradictory results. The present study will take up this question, approaching it from a new angle. Instead of human judges rating individual speech samples for regularity, in this perception experiment they are asked to compare pairs of speech samples in terms of regularity. All possible pairwise comparisons of the speech samples are presented for assessment. Subsequently, the ratings are subjected to multidimensional scaling (MDS), a technique that might be better suited to exploring complex variables such as rhythm perception than simply exploring correlations, as previous research has done. This statistical method is well-suited to identifying the different dimensions underlying a complex response variable and to quantifying what proportion of the variance is accounted for by each of the dimensions (Cox & Cox, 2008). Among the statistical issues that MDS is able to address are scenarios where judges rely on different rhythmic dimensions in their regularity ratings, as captured by different rhythm metrics. MDS results can also be represented in intuitive graphic displays. If two dimensions account for a substantial part of the observed variance, a two-dimensional plot may be sufficient to provide an accurate representation of the data. After the identification of the dimensions underlying the ratings by the human judges is accomplished, the next step in the analysis assesses their possible association with particular rhythm metrics. The dimensions identified by MDS will only be of analytical interest in the context of the present chapter, if it can be shown that one or more of the metrics accounts for them. In principle, there is no reason to expect that each dimension will be linked to only one rhythm metric. It is equally conceivable that a combination of two or more rhythm metrics might provide a better quantitative explanation of a particular MDS dimension. Just relying on correlations, as several previous studies did, between particular rhythm metrics and the dimensions would make it impossible to determine how a combination of several rhythm metrics might potentially account for a particular dimension. Instead, the present analysis follows White and Mattys (2007b) and Dellwo (2008, 2010) in using linear regression in order to explore which rhythm metrics can explain the dimensions identified by the MDS analysis. Finally, another methodological innovation that this study adopts is a focus on varieties of English as a Second Language (ESL). Specifically, the study breaks with the bias of considering varieties of English as a Native Language (ENL), especially British and American English, as the explicit or implicit point of reference (Hansen, 2018: 49; Saraceni, 2015: 87; Westphal & Wilson, 2020: 51). Instead, the analysis explores how speakers of one ESL variety (Indian English) may perceive the rhythm of another ESL variety (Nigerian English).
196
R. Fuchs
5 Methods 5.1 Experiment Twenty-two normally hearing listeners (aged 21–34, median 23 years; 15 m, 7 f) took part in the experiment, run in the Praat MFC environment (Boersma & Weenink, 2014) on a laptop computer equipped with headphones. They were all university students and educated speakers of Indian English and indicated as their first Indian languages Malayalam (10), Telugu (6), Bengali (4), Hindi (1) and Marathi (1), respectively. Listeners were presented with pairs of eleven lowpass filtered intonation phrases (filtered at 400 Hz, with 100 Hz smoothing), each between 3 and 11 vocalic intervals long. The eleven intonation phrases yielded 55 pairwise combinations, disregarding the order of the elements in a pair (for a similar procedure in research on rhythm, see Barry et al., 2009: 87). Because order effects cannot be a priori disregarded, all 110 pairwise combinations were included, but each listener heard only one order for each pair in order to limit the length of the experiment. All participants heard these 55 pairs in random order, with the replaying of stimuli allowed (Fig. 2 shows the user screen during the experiment). The samples were extracted from an academic talk given by an educated, male speaker of Nigerian English (file unsp_04 from the Nigerian component of the International Corpus of English, Wunder et al., 2010). In each trial, judges heard the two speech samples, with an initial silence of 0.5 s and an inter-stimulus interval of 1.5 s, and were asked to rate “whether utterance 1 or utterance 2 has a more regular rhythm”,3 with an optional break after blocks of 30 trials. Prior to the main part of the experiment, participants were introduced to the task with three pairs of synthetically created training stimuli, the first of which always had an isochronous rhythm, while the second had a very irregular rhythm. Participants were told that “Utterance 1 has a MORE REGULAR rhythm.” These training stimuli were lowpass filtered in the same way as the stimuli in the main part. No further explanation of rhythm was provided to participants in order to avoid any bias to their responses. In order to measure the rhythm of each speech sample with a range of rhythm metrics, they were annotated in Praat for phonemes as well as vocalic and consonantal intervals. A Praat script (available at https://osf.io/79qyg/) then calculated the rhythm metrics /\V, /\C, nPVI-V, VarcoV, rPVI-C, VarcoC and speech rate (phonemes per second), with each metric measured in two variants, one including the final vocalic or consonantal interval, respectively, and one excluding it. Table 4 in the Appendix provides an overview of a selected number of descriptors and rhythm metrics for the eleven speech samples. 3
Moreover, listeners were introduced to the experiment as follows: “This is a listening experiment. In each round you will hear two different utterances. Please rate which of the two has a more regular rhythm. The utterances have been filtered to make it easier to focus on the rhythm. Click to listen to some examples. You can adjust the volume whenever you like”.
Rhythm Metrics and the Perception of Rhythmicity in Varieties …
197
Fig. 2 User interface during the main part of the experiment
5.2 Analysis In total, 1,210 responses were analysed (22 raters * 55 trials), where each trial consisted of a comparison of two different utterances. For each trial, a ratio was computed from the number of “utterance 1 more regular” versus “utterance 2 more regular” responses and subtracted from 0.5, yielding an overall agreement scale ranging from −0.5 (utterance 1 rated as more regular by all participants) to + 0.5 (utterance 2 rated as more regular by all participants), with values closer to 0 indicating disagreement among the participants. This agreement scale, comprising one data point for each of the 55 trials, was then subjected to MDS analysis in R (command cmdscale, R Core Team, 2020). The eigenvalues of the first three dimensions amounted to 0.171, 0.123 and 0.048, indicating that a three-dimension solution would capture only a limited amount of additional variation in comparison to a two-dimension solution. Consequently, two dimensions were retained for the next step of the analysis, which explored with linear regression models which of the rhythm metrics can account for the two dimensions identified by the MDS analysis. Model selection was based on the maximisation of Akaike’s Information Criterion (Bozdogan, 1987), using exhaustive screening with interactions of up to two independent variables (command glmulti from package glmulti, Calcagno & Mazancourt, 2010). Because of the collinearity between vocalic and consonantal metrics, respectively, multiple vocalic/consonantal metrics could not be included in a single regression model. The procedure was therefore run with, as independent variables (IVs), either (1) nPVI-V and rPVI-C, (2) VarcoV and VarcoC, or (3) /\V and /\C, with or without the inclusion of the final intervals of utterances. In addition, speech rate was included as a potential predictor in all models. Thus, the glmulti procedure was applied to (three * two = ) six different sets of three metrics each. In all six instances, a single model was selected. These six models were then ranked by AIC, and only the model with
198
R. Fuchs
the lowest AIC was retained. The two resulting models (one for each MDS dimension) were then pruned using the step-down method, i.e. removing IVs or interactions between IVs that were not significant at p < 0.05. The goodness of fit of the complete statistical model, comprising MDS and regression analysis, can then be computed based on the variance in the raw response data explained by the dimensions identified by the MDS, in conjunction with the variance inherent in each dimension explained by each of the regression models. That is, at each of the two steps of the analysis, there is a potential loss in terms of the variance explained.
6 Results The MDS identified two dimensions, accounting for 49.6% and 21.8% of the variance in the data, respectively, i.e. 71.4% in total (see Fig. 3). For dimension 1 (D1), the regression analysis revealed that the best model consisted of a significant interaction of the normalised pairwise vocalic variability index, i.e. the nPVI-V metric, as well as the raw pairwise consonantal variability index, i.e. the rPVI-C metric, each with final vocalic/consonantal intervals omitted (henceforth nPVI-Vm1 and rPVI-Cm1). The model had a high adjusted R2 of 0.78, but it was discovered that one utterance (number 2) exerted undue influence and that it was only this data point that sustained the interaction in the model. After the removal of this data point, the best model involved only one independent variable, nPVI-Vm1, with an adjusted R2 of 0.58 (see Fig. 4). Alternative models were inferior in terms of Fig. 3 Dimensions 1 and 2 of the multidimensional scaling analysis of binary regularity ratings for nine utterances
lowpass05
0.2
lowpass01
0.1
Dimension 2
lowpass02
0.0
lowpass08
lowpass06
lowpass09 lowpass03 lowpass11
−0.1
lowpass10 lowpass07
lowpass04 −0.2
0.0 Dimension 1
0.2
Rhythm Metrics and the Perception of Rhythmicity in Varieties …
199
explained variation (adjusted R2 ) and the significance of the independent variables (see Table 1 for a summary of model statistics for selected alternative models). For dimension 2 (D2), the best-performing model involved the independent variables VarcoCm1 (the variation coefficient for consonantal durations with the final consonantal interval removed) and a significant interaction between VarcoVm1 and speech rate (see Table 2). The model had a moderate adjusted R2 of 0.45 (multiple R2 0.67). Alternative models were inferior in terms of explained variation (adjusted R2 ) and the significance of the independent variables (see Table 3 for a summary of model statistics for selected alternative models). Generally, all three independent variables in the best-performing model were positively associated with D2, i.e. utterances with a higher VarcoCm1 were associated with a higher score on D2 (see Fig. 5). Furthermore, the model contained an interaction between VarcoVm1 and speech rate (see Fig. 6). At high speech rates (>9.5), higher VarcoVm1 is associated with a higher score on D2, but at low speech rates, the Table 1 Statistics for selected regression models explaining dimension 1 of the MDS solution Independent variable
Est
Std. error
t
p
Model p
Adj. R2
Model 1
nPVI-Vm1
0.0039
0.064
3.6
0.1
−0.10
Model 4
VarcoV
−0.0014
0.003
0.5
>0.1
>0.1
−0.09
Model 5
rPVI-C
−0.0001
0.001
0.1
>0.1
>0.1
−0.12
Model 6
VarcoC
0.0046
0.003
1.6
>0.1
>0.1
0.15
0.4
0.2
Dimension 1
Fig. 4 Dimension 1 of MDS solution and nPVI-V (final intervals removed), with 95% confidence intervals (regression line and confidence interval computed by ggplot2, not lm)
0.0
−0.2 40
60
nPVI−Vm1
80
100
200
R. Fuchs
Table 2 Statistics for best-performing regression model explaining dimension 2 of the MDS solution Est
Std. error
t value
Pr(>|t|)
Expl. variance
(Intercept)
2.673121
1.010223
2.646
0.0382*
VarcoCm1
0.005410
0.002163
2.501
0.0464*
32.8%
speechRate
−0.313314
0.115240
−2.719
0.0347*
7.0%
VarcoVm1
−0.086161
0.032753
−2.631
0.0390*
11.7%
0.009313
0.003600
2.587
0.0414*
48.6%
speechRate: VarcoVm1 Model p: 0.11; Adj.
R2
= 0.45
reverse relationship holds. Similarly, at moderate and high VarcoVm1 scores (>35), a higher speech rate is associated with a higher score on D2, while at low VarcoVm1 scores, the reverse is true (Table 3).
Fig. 5 Dimension 2 of MDS solution and VarcoC (final intervals removed), with 95% confidence intervals (regression line and confidence interval computed by ggplot2, not lm)
0.2
D2
0.1
0.0
−0.1
−0.2 20
30
40
VarcoCm1
50
60
Rhythm Metrics and the Perception of Rhythmicity in Varieties …
201
Dimension 2
1.0
0.5
0.0
−0.5
7
8
9
10
11
12
13
Speech rate (phonemes/second) VarcoVm1
25
40
55
Fig. 6 Dimension 2 of MDS solution and interaction between VarcoC (final intervals removed) and speech rate, with 95% confidence intervals (regression line and confidence interval computed by lm) Table 3 Statistics for selected alternative regression models explaining dimension 2 of the MDS solution Independent Variables
Est
Std. error
t
p
Model p
0.0036
0.098
1.4
>0.1
>0.1
Adj. R2
Model 2
VarcoCm1
Model 3
VarcoC
0.0048
0.003
1.8
>0.1
>0.1
0.18
Model 4
VarcoVm1
−0.0020
0.003
1.0
>0.1
>0.1
−0.01
Model 5
VarcoV
−0.0021
0.003
0.8
>0.1
>0.1
−0.04
Model 6
speechRate
−0.0122
0.022
0.6
>0.1
>0.1
−0.07
Model 7
nPVI-V
−0.0009
0.002
0.5
>0.1
>0.1
−0.09
Model 8
rPVI-C
0.0004
0.001
0.4
>0.1
>0.1
−0.09
0.08
202
R. Fuchs
7 Discussion The aim of this study was to determine which, if any, rhythm metrics can account for the perception of rhythmicity. Previous research on speech rhythm in various languages, dialects and sociolects has relied on a variety of duration-based speech rhythm metrics (see, for example, Grenon & White, 2008; Sarmah et al., 2009). However, it is currently unclear which, if any, of these rhythm metrics has perceptual validity. In order to address this question, this study presented the results of a perception experiment in which listeners compared pairs of short, lowpass filtered phrases and rated which member of the pair has a more even rhythm. Their responses were subsequently subjected to Multidimensional Scaling, with two dimensions accounting for the bulk of the variation. Regression analysis for the two most important dimensions then showed which rhythm metrics potentially account for variation in the listeners’ perception of rhythm in the stimuli.
7.1 Implications for the Analysis of Speech Rhythm Production These results have implications for the analysis of speech rhythm because they indicate that particular rhythm metrics and particular operationalisations of speech rhythm may have perceptual validity and could thus potentially be superior to their alternatives. Specifically, in this analysis, the metrics nPVI-Vm1, VarcoVm1, VarcoCm1 and speech rate were able to explain to a substantial degree how listeners rated the rhythmicity of the stimuli. Several implications emerge from these results. First of all, both local (e.g. nPVI-V) and global metrics (e.g. VarcoV) partially explained rhythmicity ratings. This finding corroborates previous research that found that both local and global metrics systematically account for rhythmic differences between languages and between accents (White & Mattys, 2007a, 2007b). Furthermore, the exact manner in which local and global metrics accounted for rhythm perception may provide a tentative indication of distinct psychological mechanisms of rhythm perception. While dimension 1 was explained by a local rhythm metric, dimension 2 was explained by global rhythm metrics. Conceivably, local and global metrics may account for different modes or subcomponents of rhythm perception, which in this analysis were revealed through their association with distinct dimensions. Secondly, both vocalic (i.e. nPVI-V and VarcoV) and consonantal metrics (VarcoC) partially accounted for rhythmicity ratings. The two normalised vocalic metrics are known to account for rhythmic differences between languages and between accents (White & Mattys, 2007a, 2007b). In the present analysis, it was precisely these metrics that accounted for the bulk of the explained variation in
Rhythm Metrics and the Perception of Rhythmicity in Varieties …
203
rhythm perception, indicating that they partially account for the perception of rhythmicity. This result, moreover, provides support for the use of these metrics use in production studies on speech rhythm. By contrast, previous research presents a mixed picture in its assessment of consonantal metrics. On the one hand, consonantal metrics were suggested alongside vocalic metrics in the foundational studies by Ramus et al. (1999) and Low et al. (2000), and they continue to be widely used in research on speech rhythm (e.g. Kawase et al., 2016; Li & Post, 2014). On the other hand, systematic comparisons found that consonantal metrics poorly account for rhythmic differences between languages and between accents (White & Mattys, 2007a, 2007b). In the present study, VarcoC was associated with one of the two dimensions, where it accounted for about a third of the variation. This result provides tentative support for the relevance of the variability of consonantal durations in accounting for the perception of rhythmicity. It is also conceivable that consonantal variability in duration, as measured by VarcoC, is associated with one particular mode or subcomponent of rhythm perception, similar to the point made above for local versus global metrics. The third implication emerging from the present study relates to the role of speech rate in the analysis of speech rhythm. Crucially, only rhythm metrics normalised for speech rate, but not non-normalised metrics, were able to account for the perception of rhythmicity in the present study. This result corroborates systematic comparisons between rhythm metrics in production studies (White & Mattys, 2007a, 2007b). However, speech rate itself (as opposed to the normalisation of rhythm metrics for speech rate) also turned out to contribute explanatory power in accounting for dimension 2 of the present analysis, where it was involved in an interaction with a vocalic metric (VarcoV). A possible interpretation of this interaction is that speech rate has a moderating effect on the influence of VarcoV. When both speech rate and VarcoV are high, dimension 2 was also high, but when both are low, the interaction between the two factors moderated their joint effect on dimension 2. A tentative explanation of this result might be that the precise manner in which VarcoV is normalised for speech rate is imperfect and overcompensates at low speech rates, and that the interaction between the two factors in the regression analysis for dimension 2 adjusted for this overcompensation. A possible implication for studies on the production of rhythm emerging from this finding might be that speech rate has its own role to play in the analysis of rhythm production, in addition to normalised vocalic metrics. This view is also supported by previous research. Dellwo (2008) suggested that a high speech rate might be associated with syllable-timing and a low speech rate with stress-timing (see also Pettorino et al., 2013 and Pettorino & Pellegrino, 2016, who claimed that their speech rhythm index reflects the perception of speech rhythm in connection with variation in speech rate). The present results suggest that this relationship might be more complex, with speech rate perhaps playing a mediating role in the perception of rhythm. A final implication of the present investigation concerns the question of the inclusion or exclusion of the final intervals of phrases in the computation of rhythm metrics. Research on rhythm production has variously included or excluded final intervals or syllables (Fuchs, 2016: 94), with the argument in favour of exclusion
204
R. Fuchs
being that phrase-final lengthening might interfere with the assessment of the underlying rhythm of the phrase. The statistical analysis of the perception data in this study considered all rhythm metrics in two versions, i.e. including final intervals and excluding final intervals. The analysis indicated that it was exclusively measures that disregarded final intervals that proved capable of explaining rhythmicity ratings. This result suggests that the exclusion of final intervals in the computation of rhythm metrics might provide a superior way of measuring speech rhythm compared to their inclusion. In summary, the analysis tentatively indicates that speech rate normalised vocalic metrics (i.e. VarcoV and nPVI-V), with the exclusion of final intervals of phrases, may capture a substantial portion of listeners’ perception of speech rhythm. The normalised metric VarcoC and speech rate may also account, to some extent, for the perception of speech rhythm. These results support the view that no single metric can capture speech rhythm in its entirety, but that it is rather a multidimensional phenomenon.
7.2 Limitations and Implications for Future Research This chapter presented an innovative approach to the study of the perception of speech rhythm. While this topic has been largely neglected in previous research, it is crucial for an assessment of the validity of speech rhythm metrics. However, the implications of the results are limited in several ways, which future research on speech rhythm perception should attempt to overcome. The most important limitation is arguably the limited number of utterances that listeners were presented with. This limitation partly stems from the experimental paradigm, which relied on a pairwise comparison of all utterances, such that the number of trials would increase exponentially with a larger number of utterances. Moreover, the experiment included stimuli from a single speaker only, some of which were very short. Future research should attempt to include a larger number of longer utterances from several speakers and should furthermore involve a large number of listeners from various backgrounds as well as speech from languages other than English. Moreover, future studies should seek to determine whether and to what extent rhyhthm is influenced by the rhythmic structure of listeners’ first languages. Finally, although a sizeable number of vocalic and consonantal duration-based metrics was considered in the analysis, syllabic duration-based metrics (Gibbon & Gut, 2001; Gut, 2003) as well as metrics based on other acoustic correlates of prominence, such as intensity, loudness and pitch (Cumming, 2010, 2011; Fuchs, 2016; He, 2012; Low, 1998), were not part of the analysis, but have been claimed to account for speech rhythm in production studies. Another potential limitation concerns the statistical analysis, which relied on multidimensional scaling. An alternative method might consist of a logistic regression analysis on each trial, with the potential advantage that ratings would not need to be transformed through multidimensional scaling, but can be analysed in a single
Rhythm Metrics and the Perception of Rhythmicity in Varieties …
205
step without any transformation, which often leaves some variation unaccounted for. Logistic regression might also account well for the nature of the task, where larger rhythmic differences might be expected to yield a broad agreement between listeners, while smaller rhythmic differences between pairs of stimuli might lead to disagreement between listeners. Finally, the experimental paradigm used in this study, i.e. the pairwise comparison of lowpass filtered stimuli, could be triangulated with other methods. A challenge to be overcome in this regard is the difficulty that lay listeners tend to encounter with direct ratings of rhythmicity, so that indirect methods are required.
8 Conclusion Against the background of the widespread use of rhythm metrics in the study of speech rhythm, this chapter presented results from a study on the perception of rhythmicity and asked whether, and which, speech rhythm metrics can account for it. The results indicate that vocalic and consonantal metrics that are normalised for speech rate, as well as speech rate itself, may capture a substantial proportion of listeners’ perception of rhythmicity. These findings tentatively support the validity of these widely used metrics for production studies on speech rhythm, but need to be tested in future research on a broader empirical basis, with a wider selection of speech samples, representing multiple languages and dialects.
Appendix
9.4
lowpass08
47.3
67.2
47
8.5
9.2
12.7
lowpass09
lowpass10
lowpass11
60.4
23.5
41.5
7.7
9.4
34.5
lowpass06
7.1
lowpass05
30.4
34.4
35.1
29.5
VarcoV
lowpass07
8.9
9.2
lowpass03
lowpass04
10.5
11.8
lowpass01
lowpass02
Speech rate
30.9
78.5
47
61.9
56.1
22.4
26.3
28.9
33.9
37.3
37.1
VarcoVm1
71
58.5
47.2
34.5
89.8
29.4
50
48.3
40.6
52.4
74.6
nPVI-V
51.2
54.7
48.4
41.1
112.1
26.3
52.6
61.3
41.7
61.2
74.1
nPVI-Vm1
0.44
0.52
0.34
0.47
0.36
0.33
0.32
0.37
0.33
0.37
0.41
%V
0.36
0.5
0.35
0.49
0.39
0.38
0.3
0.41
0.33
0.4
0.45
%Vm1
30.4
19.3
52.9
30.4
39.5
39.9
40.9
32
38.2
39.5
62.9
VarcoC
34.5
18.1
54.7
34.2
33.5
24.2
30.8
22.6
39.6
38.5
61.1
VarcoCm1
49.2
39
139.4
69.2
50.5
60.7
79.5
62.4
122.7
76.1
45.1
rPVI-C
Table 4 Selected descriptors and rhythm metrics for the eleven speech samples used in the perception experiment
36.5
35.7
148.6
75.8
51.3
51.5
48.8
50.2
129.0
86.3
41.1
rPVI-Cm1
6
6
11
7
3
5
3
5
7
7
3
No. vocalic int
6
7
11
8
4
6
4
6
7
7
4
No. consonantal int
206 R. Fuchs
Rhythm Metrics and the Perception of Rhythmicity in Varieties …
207
References Aldrich, A. (2020). Adult early-bilingual speech rhythm: Evidence from Spanish and English. In Proceedings of the 10th International Conference on Speech Prosody (pp. 528–532). Arvaniti, A. (2012). The usefulness of metrics in the quantification of speech rhythm. Journal of Phonetics, 40(3), 351–373. Barry, W., Andreeva, B., & Koreman, J. (2009). Do rhythm measures reflect perceived rhythm? Phonetica, 66(1–2), 78–94. Benguerel, A. P. (1999). Stress-timing vs. syllable-timing vs. mora-timing. The perception of speech rhythm by native speakers of different language. Études & Travaux – Institut des Langues Vivantes et de Phonétique, (3), 1–18. Boersma, P. & Weenink, D. (2014). Praat: doing phonetics by computer [Computer program]. Version 5.3.53, http://www.praat.org/ Boll-Avetisyan, N., Omane, P.O., & Kügler, F. (2020). Speech rhythm in Ghanaian languages: The cases of Akan, Ewe and Ghanaian English. In Proceedings of the 10th International Conference on Speech Prosody. Bozdogan, H. (1987). Model selection and Akaike’s information criterion (AIC): The general theory and its analytical extensions. Psychometrika, 52(3), 345–370. Calcagno, V., & Mazancourt, C. (2010). Glmulti: An R package for easy automated model selection with (generalized) linear models. Journal of Statistical Software, 34(12), 1–29. Cox, M. A., & Cox, T. F. (2008). Multidimensional scaling. In C. Chen, W. Härdle, & A. Unwin (Eds.), Handbook of data visualization (pp. 315–347). Springer. Cumming, R. E. (2011). Perceptually informed quantification of speech rhythm in pairwise variability indices. Phonetica, 68(4), 256–277. Cumming, R.E. (2010). The language-specific integration of pitch and duration. PhD thesis. University of Cambridge. Dauer, R. M. (1983). Stress-timing and syllable-timing reanalyzed. Journal of Phonetics, 11, 51–62. Dellwo, V., Fourcin, A., & Abberton, E. (2007). Rhythmical classification of languages based on voice parameters. In J. Trouvain & W.J. Barry (Eds.), Proceedings of ICPhS XVI (pp. 1129–1132). Pirrot. Dellwo, V. (2008). The role of speech rate in perceiving speech rhythm. In Proceedings of Speech Prosody 2008 (pp. 375–378). Dellwo, V. (2010). Influences of speech rate on the acoustic correlates of speech rhythm. An experimental phonetic study based on acoustic and perceptual evidence (PhD thesis, University of Bonn). http://hss.ulb.uni-bonn.de:90/2010/2003/2003.htm Drager, K. (2010). Sociophonetic variation in speech perception. Language and Linguistics Compass, 4(7), 473–480. Eijk, L., Fletcher, A., McAuliffe, M., & Janse, E. (2020). The effects of word frequency and word probability on speech rhythm in dysarthria. Journal of Speech, Language, and Hearing Research, 63(9), 2833–2845. Fuchs, R. (2014a). Integrating variability in loudness and duration in a multidimensional model of speech rhythm: Evidence from Indian English and British English. In N. Campbell, D. Gibbon & D. Hirst (Eds.) Proceedings of the 7th International Conference on Speech Prosody (pp. 290–294). Fuchs, R. (2014b). Towards a perceptual model of speech rhythm: Integrating the influence of f0 on perceived duration. In H. Li, H. Meng, B. Ma, E. S. Chng & L. Xie (Eds.), Proceedings of Interspeech 2014 (pp. 1949–1953). Fuchs, R. (2015). You’re not from around here, are you? A dialect discrimination experiment with speakers of of British and Indian English. In E. Delais-Roussarie, Elisabeth, M. Avanzi & S. Herment (Eds.), Prosody and language in contact (pp. 123–148). Springer. Fuchs, R. (2016). Speech rhythm in varieties of English. Springer. Gibbon, D., & Gut, U. (2001). Measuring speech rhythm. In Proceedings of Eurospeech 2001 (pp. 91–94).
208
R. Fuchs
Gibbon, D. & Li, P. (2019). Quantifying and correlating rhythm formants in speech. In Proceedings of the 3rd international symposium on linguistic patterns in spontaneous speech. Academia Sinica. Gibbon, D. (to appear). The rhythms of rhythm. Journal of the International Phonetic Association. Goswami, U. (2019). Speech rhythm and language acquisition: An amplitude modulation phase hierarchy perspective. Annals of the New York Academy of Science, 1453(1), 67–78. Goswami, U., & Leong, V. (2013). Speech rhythm and temporal structure: Converging perspectives? Laboratory Phonology, 4(1), 67–92. Grenon, I., & White, L. (2008). Acquiring rhythm: A comparison of L1 and L2 speakers of Canadian English and Japanese. In H. Chan, H. Jacob & E. Kapia (Eds.), Proceedings of the 32nd annual Boston university conference on language development (pp. 155–166). Cascadilla. Gut, U. (2003). Non-native speech rhythm in German. In M.-J. S.D. Recasens & J. Romero (Eds.), Proceedings of the 15th international congress of phonetic sciences (ICPhS 2003) (pp. 2437– 2440). Universitat Autónoma de Barcelona. Hansen, B. (2018). Corpus linguistics and sociolinguistics: A study of variation and change in the modal systems of world Englishes. Leiden: Brill. Harrison, E., Wood, C., Holliman, A. J., & Vousden, J. I. (2018). The immediate and longer-term effectiveness of a speech-rhythm-based reading intervention for beginning readers. Journal of Research in Reading, 41(1), 220–241. He, L. (2012). Syllabic intensity variations as quantification of speech rhythm: Evidence from both L1 and L2. In Q. Ma, H. Ding & D. Hirst (Eds.), Proceedings of the 6th international conference on speech prosody (pp. 466–469). Tongji University Press. Ibrahim, O., Asadi, H., Kassem, E., & Dellwo, V. (2020). Arabic speech rhythm corpus: Read and spontaneous speaking styles. In Proceedings of the 12th Language Resources and Evaluation Conference (pp. 5337–5342). Kawase, S., Kim, J., & Davis, C. (2016). The influence of second language experience on Japaneseaccented English rhythm. Proceedings of the 8th International Conference on Speech Prosody (pp. 746–750). Kim, H., & Park, J. S. (2020). Automatic language identification using speech rhythm features for multi-lingual speech recognition. Applied Sciences, 10(7), 2225. Kim, J., Davis, C., & Cutler, A. (2008). Perceptual tests of rhythmic similarity: II Syllable rhythm. Language and Speech, 51(4), 343–359. Kinoshita, N., & Sheppard, C. (2011). Validating acoustic measures of speech rhythm for second language acquisition. In Proceedings of the 17th International Congress of Phonetic Sciences (pp. 1686–1689). Kolly, M. J., & Dellwo, V. (2014). Cues to linguistic origin: The contribution of speech temporal information to foreign accent recognition. Journal of Phonetics, 42(1), 12–23. Kohler, K. J. (2009). Rhythm in speech and language. Phonetica, 66(1–2), 29–45. Law, W. L., Dmitrieva, O., & Francis, A. (2020). Convergence of L1 and L2 speech rhythm in Cantonese-English bilingual speakers. In Proceedings of the 10th International Conference on Speech Prosody (pp. 547–550). Lee, C. S., Brown, L., & Müllensiefen, D. (2017). The musical impact of multicultural London English (MLE) speech rhythm. Music Perception: An Interdisciplinary Journal, 34(4), 452–481. Levon, E. (2007). Sexuality in context: Variation and the sociolinguistic perception of identity. Language in Society, 36(4), 533–554. Li, A., & Post, B. (2014). L2 acquisition of prosodic properties of speech rhythm: Evidence from L1 Mandarin and German learners of English. Studies in Second Language Acquisition, 36(2), 223–255. Low, E. L., Grabe, E., & Nolan, F. (2000). Quantitative characterization of speech rhythm: Syllabletiming in Singapore English. Language and Speech, 43(4), 377–401. Low, E. L. (1998). Prosodic prominence in Singapore English. PhD thesis. University of Cambridge. Martínez-Sánchez, F., Meilán, J. J., Vera-Ferrandiz, J. A., Carro, J., Pujante-Valverde, I. M., Ivanova, O., & Carcavilla, N. (2017). Speech rhythm alterations in Spanish-speaking individuals with Alzheimer’s disease. Aging, Neuropsychology, and Cognition, 24(4), 418–434.
Rhythm Metrics and the Perception of Rhythmicity in Varieties …
209
Miller, M. (1984). On the perception of rhythm. Journal of Phonetics, 12(1), 75–83. Murty, L., Otake, T., & Cutler, A. (2007). Perceptual tests of rhythmic similarity: I Mora Rhythm. Language and Speech, 50(1), 77–99. Nazzi, T., Bertoncini, J., & Mehler, J. (1998). Language discrimination by newborns: Towards an understanding of the role of rhythm. Journal of Experimental Psychology: Human Perception and Performance, 24(3), 756–766. Nolan, F., & Jeon, H. S. (2014). Speech rhythm: a metaphor?. Philosophical Transactions of the Royal Society B: Biological Sciences, 369(1658). Ong, P. K. F., Deterding, D., & Low, E. L. (2005). Rhythm in Singapore and British English: A comparative study of indexes. In D. Deterding, A. Brown, & E. L. Low (Eds.), English in Singapore: Phonetics research on a corpus (pp. 74–85). McGraw-Hill. Ordin, M., & Polyanskaya, L. (2015). Perception of speech rhythm in second language: The case of rhythmically similar L1 and L2. Frontiers in Psychology, 6, 316. Pereira, A. S., Kavanagh, E., Hobaiter, C., Slocombe, K. E., & Lameira, A. R. (2020). Chimpanzee lip-smacks confirm primate continuity for speech-rhythm evolution. Biology Letters, 16(5). Pettorino, M., Maffia, M., Pellegrino, E., Vitale, M. & De Meo, A. (2013). VtoV: A perceptual cue for rhythm identification. In P. Mertens & A.C. Simon (Eds.), Proceedings of the prosody-discourse interface conference 2013 (pp. 101–106). Pettorino, M., & Pellegrino, E. (2016). %V and VtoV: An acoustic perceptual approach to the rhythmic classification of languages. In C. Bardel & A. De Meo (Eds.), Parler les langues romanes/Parlare le lingue romanze/Hablar las lenguas romances/Falando línguas românicas (pp. 13–28). Il Torcoliere. Polyanskaya, L., Samuel, A. G., & Ordin, M. (2019). Speech rhythm convergence as a social coalition signal. Evolutionary Psychology, 17(3). Post, B., & Payne, E. (2018). Speech rhythm in development: What is the child acquiring? In P. Prieto & N. Esteve-Gilbert (Eds.), The development of prosody in first language acquisition (pp. 125–144). John Benjamins. R Core Team (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL https://www.R-project.org/. Ramus, F., & Mehler, J. (1999). Language identification with suprasegmental cues: A study based on speech resynthesis. Journal of the Acoustical Society of America, 105(1), 512–521. Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in the speech signal. Cognition, 73, 265–292. Rathcke, T., & Smith, R. (2011). Exploring timing in accents of British English. In Proceedings of the 17th International Congress of Phonetic Sciences (pp. 1666–1669). Romano, A. (2020). Vowel reduction and deletion in Apulian and Lucanian dialects with reference to speech rhythm. Italian Journal of Linguistics, 32(1), 85–102. Saraceni, M. (2015). World Englishes: A critical analysis. Bloomsbury. Sarmah, P., Gogoi, D. V., & Wiltshire, C. (2009). Thai English. Rhythm and vowels. English World-Wide, 30(2), 196–217. Schiering, R. (2007). The phonological basis of linguistic rhythm: Cross-linguistic data and diachronic interpretation. Sprachtypologie Und Universalienforschung, 60, 337–359. Seifart, F., Meyer, J., Grawunder, S., & Dentel, L. (2018). Reducing language to rhythm: Amazonian Bora drummed language exploits speech rhythm for long-distance communication. Royal Society Open Science, 5(4). Steiner, I. (2005). On the analysis of speech rhythm through acoustic parameters. In B. Fisseni, H. C. Schmitz, B. Schröder, & P. Wagner (Eds.), Sprachtechnologie, mobile Kommunikation und linguistische Ressourcen (pp. 647–658). Peter Lang. Steiner, I. (2004). Zur Rhythmusanalyse mittels akustischer Parameter. MA thesis. Universität Bonn. http://www.coli.uni-saarland.de/~steiner/pdf/MA-Arbeit.pdf. Szakay, A. (2007). Identifying Maori English and Pakeha English from suprasegmental cues: A study based in speech resynthesis (MA thesis, University of Canterbury). https://ir.canterbury.ac. nz/handle/10092/975
210
R. Fuchs
Szakay, A. (2008). Social networks and the perceptual relevance of rhythm: A New Zealand case study. University of Pennsylvania Working Papers in Linguistics, 14(2), article 18 (n.p.). Sztahó, D., Tulics, M. G., Vicsi, K., & Valálik, I. (2017). Automatic estimation of severity of Parkinson’s disease based on speech rhythm related features. In Proceedings of the 8th IEEE International Conference on Cognitive Infocommunications (CogInfoCom) (pp. 11–16). Tilsen, S., & Arvaniti, A. (2013). Speech rhythm analysis with decomposition of the amplitude envelope: Characterizing rhythmic patterns within and across languages. The Journal of the Acoustical Society of America, 134(1), 628–639. Vicenik, C. J., & Sundara, M. (2013). The role of intonation in language and dialect discrimination by adults. Journal of Phonetics, 41(5), 297–306. Vicenik, C.J. (2011). The role of intonation in language discrimination by infants and adults (PhD thesis, University of California at Los Angeles). http://phonetics.linguistics.ucla.edu/research/ Vicenik_Diss.pdf Westphal, M., & Wilson, G. (2020). New Englishes, new methods: Focus on corpus linguistics. Anglistik: International Journal of English Studies, 31, 47–65. White, L., & Mattys, S. L. (2007b). Rhythmic typology and variation in first and second languages. In P. Prieto, J. Mascaró & M.J. Solé (Eds.), Segmental and prosodic issues in romance phonology (pp. 237–257). John Benjamins. White, D., & Mok, P. (2019). L2 speech rhythm and language experience in new immigrants. In Proceedings of the 19th International Congress of Phonetic Sciences. White, L., & Mattys, S. L. (2007a). Calibrating rhythm: First language and second language studies. Journal of Phonetics, 35(4), 501–522. Wunder, E. M., Voormann, H., & Gut, U. (2010). The ICE Nigeria corpus project: Creating an open, rich and accurate corpus. ICAME Journal, 34, 78–88.
Novel Methods for Characterising L2 Speech Rhythm Chris Davis
and Jeesun Kim
Abstract In the current chapter, we are interested in ways of examining the general timing properties of L2 speech (with a focus on how these properties relate to the perception of foreign accent). We consider several novel measures that index the temporal patterns of speech that occur across a hierarchy of time scales. In our view, considering multiple time scales is important since a key property of speech energy is that it fluctuates and correlates across different grain-sizes. Furthermore, whereas traditional indices of L2 speech rhythm are based on measuring speech timing as it relates to the structural aspects of a language (e.g., syllable composition, segmental inventories), the measures we review relate more to aspects of the speech production style of the L2 talker. Keywords Speech rhythm · Second language · Foreign accent · Rhythm metrics
1 Foreign Accent When someone acquires a second language (L2) in adulthood, his/her speech typically deviates from that of native speakers. For example, the way that a speech sound is pronounced could be different, or the relative duration, fundamental frequency or intensity of sounds may differ. A person can be said to have a foreign accent when these differences go beyond the variations that normally occur when speaking a native language (L1). Note, however, that this way of defining foreign accent is based on a normative comparison and so does not specify particular features or properties that foreign accents possess. There is, of course, an extensive research literature on the perception of foreign accented speech and on what factors contribute to its production. Here, the simplest C. Davis (B) · J. Kim The MARCS Institute for Brain, Behaviour and Development, Western Sydney University, Locked Bag 1797, Penrith, NSW 2751, Australia e-mail: [email protected] J. Kim e-mail: [email protected] © Springer Nature Singapore Pte Ltd. 2023 R. Fuchs (ed.), Speech Rhythm in Learner and Second Language Varieties of English, Prosody, Phonology and Phonetics, https://doi.org/10.1007/978-981-19-8940-7_9
211
212
C. Davis and J. Kim
account is that knowledge and use of an L1 have influenced the acquisition of the L2; with the strength of this influence modulated by a range of factors (e.g., proficiency). In determining the strength of influence of L1 on L2 speech production researchers have primarily paid attention to speech segments (e.g., Bohn & Flege, 1992; KewleyPort et al., 1996). Experiments usually follow a procedure in which L2 speech is compared with that produced by L1 talkers (either by using an acoustic analysis or by having L1 speakers make perceptual judgments about L2 samples). Researchers have also investigated non-segmental contributions to foreign accent, although this remains a less well studied topic (e.g., Munro, 1995). Here too, a major focus is on the putative influence of L1 on L2 productions. Studying the extent to which non-segmental properties contribute to the perception of L2 speech calls for some ingenuity. This is because of the need to distinguish between the influence of segmental and non-segmental properties. One way of doing this has been to minimize the contribution of segmental information. For instance, Munro (1995) low-pass filtered English sentences spoken by L1 English speakers and Mandarin-speaking L2 English speakers to render them unintelligible (i.e., no distinctive segmental information was available). Native English listeners then rated on a four-point scale how likely it was that each of these filtered sentences was spoken by a native English speaker (one = definite foreign accent, 4 = no accent, English native speaker). It was found that listeners rated the stimuli produced by the L1 English speakers higher than those produced by Mandarin-speaking learners of English, demonstrating that listeners are able to detect foreign accent based only on rhythmic differences. The question of how the strength of a foreign accent relates to speech rhythm is largely unresolved. The data from some studies suggest that the L2 rhythm difference is greater for stronger foreign accents. For example, Polyanskaya et al. (2017) examined the role of rhythm on the perception of foreign accent by using speech resynthesis. That is, stimuli were constructed that consisted of native English segments with the segment timing of English learners who had different levels of proficiency. These stimuli were then presented to English L1 raters. It was found that ratings of perceived foreign accent were influenced by the rated level of the talker’s L2 English proficiency for the speech used in the resynthesis; this was interpreted as showing that speech rhythm plays a role in the perception of foreign accent. However, the data from Sereno et al. (2016) does not support the view that accent has an influence on rhythm. These authors conducted a similar study to Polyanskaya et al. (2017) using synthesized speech and employed a fully factorial design, i.e., native segments were given non-native rhythm; non-native segments were given native rhythm, etc. Participants made accent judgments on these sentences and transcribed them to assess intelligibility. The results showed that resynthesizing with non-native rhythm did not influence accent ratings even though it did influence intelligibility (although see, Kawase et al., 2016a). Other studies of the rhythmic basis of foreign accent have used natural speech rather than filtered or resynthesized versions and examined speech production rather than perception. In this research, the idea is to determine if the rhythm of a talker’s L2 is different from that of a native talker, and whether the L2 rhythm is like that of the L2 talker’s L1 (Kawase et al., 2016b). To conduct such a study, it is necessary
Novel Methods for Characterising L2 Speech Rhythm
213
to use a rhythm metric (see Fuchs, 2016), and for the L2 and L1 languages to have different rhythms since it is presumed that rhythm differences in L2 speech arise due to the influence of the different L1 rhythm. This assumption that rhythm differences in L2 speech are due to an L1 influence implicitly favours the use of rhythm metrics that are based on measuring vocalic and consonantal intervals, since it is a standard view that these intervals vary across languages that have different rhythms (see below). However, there is evidence that an L2 learner’s rhythm will differ from that of the target language even if the learner’s L1 and the L2 have similar rhythms (Ordin & Polyanskaya, 2015). One reason for this may be that difficulty in mastering the pronunciation of unfamiliar speech segments could affect spoken fluency (and hence rhythm) regardless of the rhythmic properties of the L1 and L2. Once it is acknowledged that L2 rhythm can be affected by other factors, it then seems appropriate to use a wider range of timing metrics to characterise L2 speech and foreign accent. In other words, the enterprise of determining the extent to which foreign accent is a legacy of a talker’s L1 may have overshadowed the task of simply characterising what L2 speech is like more generally. This is not to say that topics on such things as the fluency of L2 speech have been overlooked (they have not, see Trofimovich & Baker, 2006) but rather that the quest to characterise L2 speech based on the idea that foreign accent is caused solely by the L1 may have indirectly constrained the selection of research tools. Before outlining the novel measures that we have used and providing examples, we briefly discuss the concept of speech rhythm and patterns in speech, since these are fundamental concepts in what follows.
2 Speech Rhythm Broadly, speech rhythm is related to the perception that some properties of the signal repeat over time; these could be surface properties or ones that require a deeper analysis to recover (see Liberman & Prince, 1977). The story about speech rhythm is an old one that has roots in the oral traditions of poetry and prose. Sparked by the idea that some languages appear to differ in their rhythm, researchers attempted to define rhythm and to sort languages into different rhythm types (Abercrombie, 1967). This in turn led to the development of different rhythm metrics (e.g., Ramus et al., 1999) and debate about how to classify the rhythm of different languages (Dauer, 1983). Speech rhythm metrics have tended to mainly consist of durational properties and indices of variability. This focus on timing concords with the basic intuition that rhythm is a temporal phenomenon (although a better characterization is that it is a multidimensional one, see Kohler, 2009; Fuchs, 2014). Even when the investigation of rhythm is restricted to timing, several issues remain to be resolved, e.g., which properties to evaluate, and how to best characterise the temporal dimension. One straightforward set of measures of variability in speech timing was developed by Ramus et al. (1999). Ramus et al. took vowels and consonants as the units over which timing was quantised since the duration of these can vary greatly due to
214
C. Davis and J. Kim
language specific vowel reduction and consonant clustering phenomena. To capture changes in production duration they proposed measuring the variability in vowel and consonant durations within a sentence. This was operationalised in terms of the standard deviation of the vowel and consonant intervals respectively. Dellwo (2006) proposed that the Ramus et al. (1999) measures should be normalised by speaking rate (i.e., standard deviation divided by the mean duration), as this can strongly influence the measures (Dellwo & Wagner, 2003). Whereas the above rhythm metrics aimed to characterise the global characteristics of speech timing, Low and colleagues (Grabe & Low, 2002; Low, 1998; Low et al., 2000) proposed a measure to summarise local characteristics, i.e., changes that occur over adjacent intervals. The measure that Low and colleagues developed, the Pairwise Variability Index (PVI), took local variation into account by measuring the variability in pairs of successive speech units. In the normalised PVI (nPVI), differences are calculated as a proportion of the mean value within a pair and then the mean fractional PVI value is multiplied by 100. That is, the nPVI is derived by taking each pair of adjacent inter-onset intervals and calculating their difference and dividing this by their average duration; then the average of all these ratios is multiplied by 100. This measure thus captures temporal variability in terms of a single measure that only uses adjacent interval (local) information. Although the above metrics have been the standard ones used to characterise speech rhythm, they have attracted criticism. For example, some have pointed out practical problems; e.g., the time-consuming and potentially error-prone manual annotation of vowel and consonant intervals (Fuchs & Wunder, 2015). Others have highlighted problems in determining vowel length differences (Rathcke et al., 2015) or in the reliability of consonant measures (Knight, 2011). More importantly, these metrics only consider “flat” local or global timing measures, i.e., non-hierarchical relationships. For example, the nPVI provides a zeroth-order distributional statistical measure but it does not measure higher-order relationships. Here, Cummins (2002) makes the telling observation that rhythm metrics based on the linear phonological properties of a language (e.g., the relative proportion of vowels or consonants) only capture a part of what contributes to speech rhythm. He points out that the rhythmic patterns of speech also arise due to the coupling of nested prosodic units, a coupling that varies with factors like fluency, conversational intent, and so on. It is this latter aspect of rhythm that is the focus of the current chapter. In sum, the broad aim of the current chapter is to introduce three novel ways of characterising variation in the timing of L2 speech, and to examine how these relate to differences in foreign accent. These measures differ in two basic ways from those traditionally used. First, unlike traditional ways of quantifying speech rhythm, the measures index variation in speech properties over multiple hierarchical scales. Second, these measures are likely to be more sensitive to aspects of spoken performance than to structural differences between languages (see Kello et al., 2017). This last aspect highlights a difference in perspective about what is useful to measure; typically, sensitivity to performance properties is considered a problem when measuring rhythm, in our view, the ability to measure such properties may provide an additional way of characterising patterns in L2 speech.
Novel Methods for Characterising L2 Speech Rhythm
215
3 Three Multiscale Methods to Characterise Temporal Relations Below we briefly review three multiscale measures of temporal variability in speech: the Spectral-Amplitude Modulation Phase Hierarchy (S-AMPH) model (Leong, 2012), Allan Factor (AF) analysis (Falk & Kello, 2017); and the Multiscale coefficient of variation (MSCV) analysis (Abney et al., 2017). We chose three analysis methods for this exploration of novel timing measures to see whether the results converge or diverge; seeing which pattern together should provide clues about their sensitivities. In the results presented below, the S-AMPH was calculated over periodicities fitted to amplitude envelopes; the AF analysis over point events (based on the amplitude envelop); and the MSCV analysis on event durations. The following section provides some background about each method. Details of how each is calculated are given in Sect. 5.
3.1 The S-AMPH Analysis This method grew out of several ideas about which speech properties create our perception of rhythm. The first is that rhythm (at least in English) arises from the fluctuating pattern of strong and weak spoken elements conditioned by changes in amplitude and duration. The second idea is that important cues for rhythm are transmitted in relatively slow amplitude modulations. Two modulation rates are taken to be important, a rate that approximates the duration of syllables (~4 Hz) and a slower rate that captures supra-syllable (stress) phenomena (~2 Hz). The basic hypothesis that underlies the S-AMPH analysis is that the perception of strong–weak rhythm arises due to the synchrony of syllable and stress periodicities (i.e., the relationship of the two phases). In this regard, the measure provides an index of the extent to which the amplitude modulation (AM) of different frequency bands is synchronised (their phase relationship).
3.1.1
Evidence that the S-AMPH Analysis Is Sensitive to Changes in Speech Rhythm/Style
Evidence that this measure may be sensitive to speech rhythm comes from studies run by Leong and colleagues. For example, one study examined whether the perception of speech rhythm (iambic and trochaic rhythms) was associated with hierarchical AM phase relationships in speech, a key assumption of the S-AMPH analysis (Leong, 2012). In this study, participants were first presented with four unaltered sentences (that presented two rhythm patterns, iambic or trochaic) that were the models for the response options. Then participants were presented one at a time with various vocoded versions of these sentences. The manipulations consisted of filtered versions
216
C. Davis and J. Kim
of different AM modulation rates (e.g., syllable rate, stress rate) either alone or in combination, or where the phrase relationship between these rates had been manipulated. On each trial, participants were asked to judge, based on the rhythm of the vocoded stimuli, which of the sentences they had heard. The results showed that participants were more accurate in discriminating rhythms when presented with the combined stress and syllable AM rates (e.g., versus syllable only, stress only, etc.). Further, changing the phase relationship between the syllable and stress AM rates, changed the rhythm that the participants perceived. Leong and colleagues have also shown that the S-AMPH analysis can differentiate between speech styles that use different rhythms (Leong et al., 2017). They tested mothers talking to their infants (infant directed speech, IDS) or to other adults (Adult directed speech, ADS). Their approach consisted of analysing the modulation spectrum of IDS and ADS based on three modulation rates and then they determined the degree of synchrony between pairs of these modulation rates. As above, the rates chosen approximated those of different types of speech cues; a phonemic rate of 12– 40 Hz (also referred to as modulation in the beta/low gamma band), a syllable rate of 2.5–12 Hz (theta band), and a stress rate of 0.9–2.5 Hz (delta band). Consistent with the idea that the rhythmic properties of IDS differ from those of ADS, it was found that the phase synchrony of the syllable rate band (2.5–12 Hz) and the stress rate band (0.9–2.5 Hz) was greater for IDS than ADS for acoustic frequencies below 700 Hz.
3.1.2
What the S-AMPH Measures
To get an idea of what the S-AMPH analysis is sensitive to, it is instructive to consider how the results of a recent study of the conversational speech of literate and illiterate adults have been interpreted. In the study, Araùjo et al. (2018) found that the coupling (phase synchronisation) between AM bands of literate adult speech was greater than that of the speech of illiterate adults. The greater phase synchronisation for literate talkers was interpreted as indicating more regular spacing between the different phonological elements of their utterances (as assessed by the AM rates tested). That is, the greater phase synchronisation for the utterances of the literates between the syllable (theta) and phonemic (beta/low gamma) rates was taken as evidence for a more regular spacing of phonemes within syllables; the greater synchronisation between the stress (delta) and syllable (theta) rates as indicating a greater regularity in the spacing relation of syllables and stressed syllables.
3.2 Allan Factor Analysis The analysis of the timing of speech events using the Allan Factor (AF) analysis has similarities with the S-AMPH (e.g., both provide indices of how well properties are measured over one rate nest within another rate). Unlike the S-AMPH analysis, the
Novel Methods for Characterising L2 Speech Rhythm
217
AF does not pre-specify what property is measured, or the rates (time intervals) of interest (e.g., phoneme, syllable durations), as the ‘signatures’ of such time scales emerge due to the hierarchical temporal structure of the speech signal itself (Falk & Kello, 2017). In general, the AF analysis is based on a statistical method that can distinguish a Poisson process (where events occur stochastically over time) and a process in which events occur non-randomly (Allan, 1966). The AF can be used as a measure of hierarchical temporal structure in speech; providing a measure of the coefficient of variation in the timing of a point event over multiple time scales. Since the AF measures the clustering of point events, this event needs to be relevant to the phenomenon of interest. Recently, several studies have used the AF to characterise speech events over multiple time scales. Two types of point events have been used, acoustic onsets (Abney et al., 2014, 2015) and peaks in the sound amplitude envelope (Falk & Kello, 2017). Falk and Kello (2017) showed that the variance of the clustering of peaks in the AM envelope had a significant correlation with the variance in the duration of speech segments (e.g., vowels, syllables, words, and so on). Given this, it was suggested that using peak amplitude as a point process is a reasonable choice for characterising speech using AF analysis; the work described below used peak amplitude. The AF analysis consists of counting events in temporal windows and quantifies the variance of event counts between temporally adjacent windows of a given size. Figure 1 presents a schematic illustration of the clustering of point events (peaks in the amplitude envelope that exceed a threshold) over different time scales. Variance in the AF will be constant across different sized windows if there is no temporal clustering. See Sect. 5 below for more details on how the AF is calculated.
Fig. 1 An example of the basics of an AF analysis. The top panel shows a portion of the speech waveform (from the Speech Accent Archive); under which is a representation of the Hilbert envelope, below this are event counts across tiled windows of three different sizes (adapted from Kello et al., 2017)
218
3.2.1
C. Davis and J. Kim
Evidence that Allan Factor Analysis Is Sensitive to Changes in Speech Style
It is customary to describe speech in terms of different sized elements that are nested within each other, e.g., phonemes, syllables, words, phrases, sentences and so on. AF analysis is aimed at providing an index of this hierarchical temporal structure. To demonstrate that the AF may be sensitive to differences in speech style, Falk and Kello (2017) examined differences in IDS and ADS (see Leong et al., 2017). To do this, they used the AF to index the temporal distribution of amplitude peaks (events) in the AM envelope (as filtered into four frequency bands) for IDS and ADS. They found that these speech styles differed, with IDS showing greater event clustering across time scales compared to ADS. Falk and Kello (2017) suggested that this increased temporal clustering for IDS was due to increases in the variability of durations of AM peaks across time scales. It was proposed that such variability in IDS made this type of speech more interesting and increased infant listener’s level of arousal. In addition to demonstrating that the AF analysis can distinguish between speech styles, the results of Ramirez-Aristizabal et al. (2018) indicate that the AF is sensitive to the changes that occur to the relationship of linguistic units as a function of fast and slow speech rates, i.e., a faster speaking rate resulted in a shift in clustering to the shorter time scales, with lower AF values at longer times, and a slower rate to a shift to longer time scales, i.e., higher AF values at longer time scales.
3.2.2
What the Allan Factor Measures
To get a feel for what the AF measures, consider the results of Kello et al. (2017) who used it to analyse many different types of auditory signals (e.g., speech produced in different communicative settings, music of different styles/genres, non-human vocalizations). Based on this work, Kello and colleagues suggested that the AF is most sensitive to the hierarchical temporal structure produced by what they called ‘prosodic exaggeration’. For example, it was shown that signals with impoverished prosodic cues (e.g., synthesized speech, animal vocalizations) contained less nested clusters (lower AF at longer time scales) compared with human natural speech. Importantly, for current concerns, they also pointed out that AF may not be sensitive to rhythmic patterns associated with the so called “rhythm classes”.
3.3 The Multi-scale Coefficient of Variation (MSCV) Analysis Of the three methods we review, the MSCV analysis is the most similar to what has been traditionally used to index speech rhythm in that it is a straightforward extension of a standard coefficient of variation measure (nPVI). However, unlike the nPVI, the MSCV provides a single-value estimate of variation across multiple time
Novel Methods for Characterising L2 Speech Rhythm
219
scales. That is, the MSCV measure provides an estimate of the difference between a local coefficient of variation for a specific time sample and the overall coefficient of variation for all time samples. Abney et al. (2017) showed that by applying the analysis to different sized samples from time series where the structure of temporal variation was known, the MSCV provides an index of temporal variability across multiple time scales even for short time series (e.g., n = 25). Abney and colleagues point out that various measures can be computed using the MSCV analysis but to quantify the properties of a multiscale structure in a single value, the normalised MSCV measure (MSCVnorm) is useful. The MSCVnorm measure is the MSCV value divided by the value of the global coefficient of variation for the whole time series and normalised by the number of window sizes (see Sect. 5 for details of how this is calculated).
3.3.1
Evidence that the MSCV Analysis Is Sensitive to Changes in Speech Rhythm
Abney et al. (2017) gauged the sensitivity of the MSCV to temporal structure within a time series in two ways. First, they artificially generated three types of time series, one that displayed Long Range Correlations (LRC), one that exhibited Short Range Correlations (SRC), and a series that had no positive or negative autocorrelations across lags, a White Noise (WN) series. Abney and colleagues then analysed each of these time series with the MSCVnorm analysis. What they found was that the MSCVnorm measure was able to discriminate the SRC and LRC series from each other and from the WN series; with the MSCVnorm value near 1.0 for the WN series, and decreasing from this value as multiscale structure increased, i.e., the value for the LRC series was lower than that for the SRC one. The second way that Abney et al. (2017) examined what the MSCV analysis could show was by applying it to data from the BonnTempo Corpus (BTC 1.0; Dellwo et al., 2004). This corpus was initially setup as a resource to study the variability in read speech between “stress-timed” (English; German) and “syllable-timed (French; Italian) languages. To determine whether the MSCV analysis would be sensitive to putative differences in rhythm between stress- and syllable-timed languages, Abney et al. (2017) used data from 49 read phrases by native English and 42 read phrases by French talkers. The read speech consisted of talkers reading aloud, at their normal reading rate, an English version (77 syllables) or a French version (93 syllables) of a translated story by Bernhard Schlink (‘Selbs Betrug’). Abney et al. created an event series from the consonant and vowel durations of the files extracted from the corpus. They found that the MSCVnorm measure for the English talkers was lower than that of the French (controlling for local variability using the nPVI). This result indicated that the MSCVnorm measure captured variation beyond that measured by the standard nPVI one (since the variance associated with the nPVI was residualised out as a co-variable).
220
C. Davis and J. Kim
4 Applying the Above Measures to L2 Speech To apply the above measures to L2 speech, we selected speech files from the Speech Accent Archive (Weinberger, 2015). Each recording consisted of the person reading the same 69-word passage that contained most of the consonants, vowels, and clusters of standard American English (see Weinberger, 2015). The files consist of recordings in mp3 format; Fuchs and Maxwell (2016) have shown that acoustic measures (i.e., f0) remain reliable with mp3 compressed recordings. We used a set of 35 records from Korean L2 English talkers; 27 from French L2 English talkers and 32 native talkers of English (Australian). The 35 Korean talkers had a mean age of 31.8 years (SD = 12.9) and consisted of 24 females and 11 males. These talkers began learning English at various ages (Mean = 13 years; SD = 5.9) and had resided for various lengths of time in English speaking countries (Mean = 8.3 years; SD = 8.4). The French L2 English talkers (Mean Age = 30.9 years; SD = 13.6) consisted of 13 female, and 14 male talkers. These speakers began learning English at various ages (Mean = 11.6 years; SD = 2.7) and had resided for various lengths of time in English speaking countries (Mean = 5.8 years; SD = 11.8). Recordings of native English speakers (14 Female; Mean Age = 29.4 years; SD = 10.1) were used for comparison.
4.1 Quantifying Foreign Accent As a measure of foreign accent, we had three L1 English raters listen to the L2 speech recordings and judge the extent of foreign accent on a 0 to 9 point scale (0 being no accent, 9 being strong accent). We did not specify what was meant by foreign accent but left that up to each rater to decide (i.e., we did not specifically mention speech rhythm, etc.). The raters listened to the Korean and French talkers on different days. Using these ratings, we selected two extreme accent groups for the Korean and French talkers, i.e., a group whose recordings attracted low accent ratings, the weak accent group, and a group whose recordings were rated as having a strong accent, the strong accent group.
5 Results 5.1 S-AMPH Analysis The synchrony index between amplitude modulation in the three rate bands used by (Leong et al., 2017) was calculated using the S-AMPH model (Leong, 2012). This index represents how in-phase the modulation envelopes of the selected speech frequency rates (0 = no synchrony, 1 = perfect synchrony) are. Details of the signal
Novel Methods for Characterising L2 Speech Rhythm
221
processing steps involved are given in (Leong et al., 2017). In brief, using adjacent finite impulse response filters, a waveform is band-pass filtered into five frequency bands and for each band three AM rates are extracted from each down-sampled Hilbert envelope. A phase synchrony index (PSI) between pairs of AM rates is calculated according to (1) |< >| PSI = | ei(nθ1 −mθ2 ) |
(1)
where (nθ1 − mθ2 ) is the phase difference between the two AMs calculated by taking the distance between phase angles using circular distance (modulus 2π ). The S-AMPH PSI results for the Korean talkers and French talkers are shown in Fig. 2 (data from Davis & Kim, 2018). Figure 2 shows that in the intermediate frequency band, 700–1950 Hz, the PSI for the strong accent was greater for the weak accent. For this contrast between the PSI values of the strong and weak accented English, a repeated measures ANOVA was conducted. This analysis indicated that the strong accent had a higher PSI value than the weak accent, F(1,16) = 14.73, p = 0.002. As can be seen in Fig. 2, the PSI scores for the French L2 English talkers showed a different pattern from those of the Korean L2 talkers. In this case, the strong and weak accented talkers had very similar PSI scores across all the acoustic frequency bands. Indeed, the repeated measures ANOVA between the strong and weak accent PSI scores produced an F value that was < (Ni − Ni+1 )2 A(T ) = 2
(2)
Note that due to the shorter duration of the current recordings (~30 s) only time scales under a few seconds could be calculated, since the largest AF timescale is 1/16th of each recording’s length. Figure 3 shows the AF functions for the Korean and French talkers who had strong or weak L2 English accents and the English native talkers. In Fig. 3 (left panel) the curve for the strong L2 English accent starts to diverge from the other curves at about 15 ms and continues to diverge from there. For the key contrast, the difference between the two accent types was tested by repeated measures ANOVA run on the factors of accent type (strong accent; weak accent) and time. There was a significant overall effect of accent type (strong accent vs. weak accent), F(1,14) = 12.89, p < 0.01 and an interaction of this variable with time, F(11,154) = 6.43, p < 0.001. There was a difference between the English L1 values and those of the strong accent, F(1, 38) = 11.051, p < 0.001 and no difference between the English L1 values and the weak accent, F < 1. Figure 3 (right panel) shows that unlike the results for the Korean L1 talkers where the AF differed between those talkers who had a strong versus weak accent, for the French L1 talkers there was no significant difference. That is, the ANOVA for strong accent versus weak accent contrast was not significant, F(1,8) = 2.24, p = 0.165 and there was no interaction with accent type and time, F(11,88) = 1.56, p = 0.327. Also, the omnibus comparison between all three language groups (English L1, French strong and French weak accent) was not significant, F < 1.
Novel Methods for Characterising L2 Speech Rhythm
223
Fig. 3 Mean AF values for Korean L1 strong and weak accented English L2 and English L1 speech (left panel) and the French L1 strong and weak accented English and English L1 (right panel). The timescale is in seconds; an Allan Factor of 1 (10°) indicates events occurred randomly
5.3 Multi-scaled Coefficient of Variation (MSCV) Analysis The MSCV analysis was conducted using the Matlab scripts referenced in Abney et al. (2017). The MSCV analysis is typically based on a time series of event durations and measures the difference between a local coefficient of variation for a specific time window size and the overall coefficient of variation for all the time samples. For a time series, a tiling of non-overlapping windows of size (T ) is created and the coefficient of variation (CV) is computed for the elements within each window size. For window size T, the CVs are averaged, as in (3): MSCV(T ) =
σ (T ) μ(T )
(3)
where σ is the SD, and μ is the mean. The normalised version is divided by the global coefficient of variation and again divided by the number of window sizes (NT ), see (4) ET i=2
MSCVnorm =
MSCV(T ) CV
NT
(4)
In the example below, we report both MSCV and MSCVnorm and we set T as a power of 2, ranging between a minimum of 2 and maximum of L/2–1, where L represents the number of measurements in the time series. To provide continuity with Abney et al. (2017), we used the time series of consonant and vowel intervals, respectively. The MSCVnorm measure reflects the extent to which variation in a time series is heterogeneous over a time scale. Series that are homogeneous across time scales tend
224
C. Davis and J. Kim
Fig. 4 Mean MSCVnorm values for Korean L1 strong and weak accented English L2, French L1 strong and weak accented English L2 and English L1 for consonants (top) and vowels
to have a MSCVnorm value of about 1.0 (as determined in the simulation studies of Abney et al., 2017); values less than 1.0 indicate an increase in multiscale structure. Figure 4 shows the mean MSCVnorm scores as a function of talker group for the consonant and vowel durations (data from Davis & Kim, 2019). As can be seen in the figure, the mean values are lower for speech rated as having a strong accent compared to speech having a weak accent or L1 English. This indicates that the strongly accented speech had more heterogeneity of variance across the various window sizes. A linear mixed-effects model (with talker as a random effect) was fitted to the MSCVnorm scores. The analysis contrasted the scores for consonants and vowels and strong and weak accent and L1 language (Korean vs. French, the English L1 scores were not included in this analysis). There was a significant difference in the MSCVnorm scores of consonants and vowels (vowels had a lower value), F = 25.31, p < 0.01 The difference between the MSCVnorm scores for weak and strong accented L2 speech was significant (strong accent had a lower value), F = 4.36, p < 0.05. There was no significant difference between the MSCVnorm scores for the Korean and French talkers, F = 0.14, p = 0.71, and there were no significant interactions between these variables, (accent and segment type), F = 0.10, p = 0.75; (accent and language), F = 0.0002, p = 0.99. A separate analysis indicated that there was no significant difference between the MSCVnorm values for the weakly accented and the L1 English speech, F = 0.002, p = 0.97. Abney et al. (2017) showed that MSCVnorm scores were lower when calculated over English vowel durations than French ones (when nPVI was taken into account). This finding was interpreted as demonstrating that English (read) speech had more multiscale variability than French; and was thought to be related to English having more complex syllables. The current results for L2 speech showed differences between strong and weakly accented English for both the MSCVnorm values, with
Novel Methods for Characterising L2 Speech Rhythm
225
no interaction with whether the accent was from a Korean or French talker. Interestingly, the direction of these differences was opposite to that in the above studies of L1 speech; suggesting that the strong accented speech had more multiscale structure than weakly accented speech or the English L1. In this regard, it may be that the MSCVnorm measure indicates that those with a strong foreign accent are more variable in the timing of their spoken output. In such a case, L2 speech production can usefully be viewed in terms of a complex motor task for which the skill of talkers varies (rather than in terms of being part of the legacy of a talker’s L1).
6 Discussion This chapter introduced three relatively novel ways of measuring the temporal properties of L2 and L1 speech in order to compare strong and weak foreign accents. Before considering the results, it is worth emphasising that it is likely that several properties give rise to a foreign accent; the reason why we have focussed on speech timing is not that it may be a particularly salient property, but that is an often overlooked one. The measures we chose all index temporal properties at multiple time scales, including variation that occurs over a longer term (several seconds or more). That is, the S-AMPH measure is sensitive to the coherence of AM periodicities across syllable and stress durations, while the AF analysis is sensitive to the clustering of AM envelop peaks over a range of time scales (limited by the duration of the sample), and the MSCVnorm to the variability in segment duration over multiple time scales. The MSCVnorm results showed that strong accented speech had more temporal variability than weak accented speech, and this effect was the same for the Korean and French L1 talkers. The results for the other two types of analysis were similar in that the difference between strong and weakly accented speech was clear only for the Korean talkers. What could explain why there was an effect of strong versus weak accent for the S-AMPH and AF analyses for the Korean but not for the French L2 speech? The difference is unlikely to be linked to a putative rhythm class difference, since (as standardly conceived) both French and Korean are syllable-timed languages (Kim et al., 2008). A clue to why these measures did not differentiate between the strong and weak French accented L2 talkers can be gleaned from previous studies where differences have been found using these measures. For example, as mentioned above, both measures showed a difference between IDS and ADS (Falk & Kello, 2017; Leong et al., 2017). A key difference between IDS and ADS is that IDS has a slower speech rate (including pauses) and articulation rate (Fernald et al., 1989). Thus, it may be that these measures are influenced by properties of spoken output that vary with speech/articulation rate. Consistent with this proposal is the result that for the current L2 recordings there was a difference in how fast the strong and weak accented Korean talkers spoke (those with a strong accent spoke more slowly), but there was no such difference between the strong and weak accented French talkers.
226
C. Davis and J. Kim
Both Leong et al. (2017) and Falk and Kello (2017) examined the effects of speech rate on their measures. Leong et al. used a rate normalisation procedure that rescaled the sampling of data from the IDS speech to match the temporal rate of the ADS data and found that the S-AMPH measure still showed a difference between IDS and ADS. This manipulation suggests that the S-AMPH measure is sensitive to speech dynamics, i.e., the hierarchical changes that occur in the timing of slow and fast speech, rather than speech rate per se. Falk and Kello (2017) examined whether the difference between IDS and ADS shown by the AF analysis was due to longer pauses in IDS. They did this by removing pauses longer than 150 ms at phrase boundaries and found that the AF measure produced the same effect of speech style (IDS vs. ADS). Once again this suggests that the measure is sensitive to changes in temporal structure rather than speech rate per se (that includes pauses). To make this clearer, a recent study by Ramirez-Aristizabal et al. (2018) specifically examined the effect that slow and fast speech had on the AF measure. The results confirmed that slower speech had higher AF values at longer time scales. Interestingly, in the same study it was found that there was a decline in the AF at longer time scales when slow speaking rates were induced by using a teleprompter with a slow presentation rate. This drop in the AF was interpreted as being due to the even pace of the teleprompter leading to more isochronous renditions, thereby reducing clustering in spoken output, and hence to a reduction in the AF at longer time scales. The above analysis suggests that the speech of the strong accented Korean L2 talkers was not only slower but more variable than that of the strong accented French ones. One difference that might account for this is that for the Korean talkers, reading the English passage aloud required processing an L2 orthography. If it is assumed that the talkers with the clearest foreign accents were also those who had less fluency in reading, then this may help explain why these talkers produced a different speech rhythm. This effect may be quite subtle and could occur even though the talker may have a perfect declarative knowledge of English orthography (e.g., print to sound correspondences). That is, the real-time pressure of reading aloud may have resulted in these participants adopting a speech style that was more fluent for high frequency shorter words, and slightly more laboured for longer lower frequency ones. This mixed speaking style may have led to an increase in the distributional variation of the timing of peaks in the amplitude envelope (leading to a greater AF), and possibly increased the synchrony between syllables and stress due to more easily read parts being given more prominence. This hypothesis could be tested by conducting the same measurements as above with talkers of another language that does not use the same orthography as English/French. Of course, it might be that the key factor is not the overlap of the writing systems but rather the degree to which the phoneme inventories match. On this account, disfluencies would be due to uncertainty in pronunciation rather than in orthographic decoding. This could be tested by examining a language that had the same (or largely the same) orthography as English but a smaller phoneme inventory. The above suggests that the speech of the strongly accented Korean talkers had more multiscale variability than that of the weakly accented talkers is consistent
Novel Methods for Characterising L2 Speech Rhythm
227
with the finding of the MSCVnorm analysis. That is, using simulation and crosslanguage studies, Abney et al. (2017) showed that lower MSCVnorm scores were associated with time series that had relatively more multi-scaled structures. Thus, for the Korean talkers, all the measures produced a consistent outcome. However, neither the S-AMPH nor the AF analysis showed a difference between the strong and weak accented French talkers; and yet the MSCVnorm analysis showed a difference between strong and weak accented speech that did not interact with the talker’s L1 (Korean or French). At this stage, the precise reason for this apparent difference in measures is unclear. One possibility is that the MSCV measure is more sensitive to durational variability since it is based directly on measures related to articulation (the duration of consonants and vowels). Another is that, as Fuchs and Wunder (2015) have pointed out, measurements of consonant and vowel durations are prone to error and it may be that measures based on these will be more variable due to measurement issues. In summary, the work described in this chapter was motivated by the view that a range of measures are needed to provide an adequate characterization of phenomena like the timing (rhythm) of L2 speech and foreign accent. The measures and the results that we have described provide an impetus for investigating how factors like variation over different time scales, the structure and influence of pauses, and factors that potentially affect spoken fluency (e.g., orthography in the case of read speech), all contribute to foreign accent. In this endeavour, we believe that there is untapped potential for automatic measures that do not require extensive handlabelling, e.g., metrics based on properties such as sonority (Fuchs & Wunder, 2015), or intensity measures that take account of hearing sensitivities and stimulus history (e.g., Cochlea-Scaled Entropy, Stilp & Kluender, 2010; Aubanel et al., 2018). Acknowledgements The authors wish to thank Victoria Leong for her tutorial on conducting the S-AMPH analysis and acknowledge the support of an Australian Research Council grant DP150104600.
References Abercrombie, D. (1967). Elements of general phonetics. Edinburgh University Press. Abney, D. H., Paxton, A., Dale, R., & Kello, C. T. (2014). Complexity matching in dyadic conversation. Journal of Experimental Psychology: General, 143(6), 2304. Abney, D. H., Kello, C. T., & Warlaumont, A. S. (2015). Production and convergence of multiscale clustering in speech. Ecological Psychology, 27(3), 222–235. Abney, D. H., Kello, C. T., & Balasubramaniam, R. (2017). Introduction and application of the multiscale coefficient of variation analysis. Behavior Research Methods, 49(5), 1571–1581. Allan, D. W. (1966). Statistics of atomic frequency standards. Proceedings of the IEEE, 54(2), 221–230. Araùjo, J., Flanagan, S., Castro-Caldas, A., & Goswami, U. (2018). The temporal modulation structure of illiterate versus literate adult speech. PLoS One, 13(10), e0205224.
228
C. Davis and J. Kim
Aubanel, V., Cooke, M., Davis, C., & Kim, J. (2018). Temporal factors in cochlea-scaled entropy and intensity-based intelligibility predictions. The Journal of the Acoustical Society of America, 143(6), EL443–EL448. Bohn, O.-S., & Flege, J. E. (1992). The production of new and similar vowels by adult German learners of English. Studies in Second Language Acquisition, 14(02), 131–158. Cummins, F. (2002). Speech rhythm and rhythmic taxonomy. In The proceedings of Speech Prosody 2002, International Conference, Aix-en-Provence, France, April 11–13, 2002. Dauer, R. M. (1983). Stress-timing and syllable-timing reanalyzed. Journal of Phonetics, 11, 51–62. Davis, C., & Kim, J. (2018). Characterizing rhythm differences between strong and weak accented L2 speech. In Proceedings of Interspeech 2018 (pp. 2568–2572). Davis, C., & Kim, J. (2019). Temporal variability in strong versus weak foreign accented speech. In Proceedings of ICPhS 2019. Dellwo, V. (2006). Rhythm and speech rate: A variation coefficient for /\C. In Language and language-processing (pp. 231–241). Dellwo, V., & Wagner, P. (2003). Relations between language rhythm and speech rate. In Proceedings of the 15th international congress of phonetic sciences (pp. 471–474). Barcelona. Dellwo, V., Steiner, I., Aschenberner, B., Dankoviˇcová, J., & Wagner, P. (2004). The BonnTempoCorpus and BonnTempo-Tools: A database for the study of speech rhythm and rate. In Proceedings of the 8th ICSLP, Jeju Island, Korea. Falk, S., & Kello, C. T. (2017). Hierarchical organization in the temporal structure of infant-direct speech and song. Cognition, 163, 80–86. Fernald, A., Taeschner, T., Dunn, J., Papousek, M., de Boysson-Bardies, B., & Fukui, I. (1989). A cross-language study of prosodic modifications in mothers’ and fathers’ speech to preverbal infants. Journal of Child Language, 16(3), 477–501. Fuchs, R. (2016). Speech rhythm in varieties of English. In Speech rhythm in varieties of English (pp. 87–102). Springer. Fuchs, R. (2014). Integrating variability in loudness and duration in a multidimensional model of speech rhythm: Evidence from Indian English and British English. In N. Campbell, D. Gibbon & D. Hirst (Eds.), Social and Linguistic Speech Prosody. Proceedings of the 7th International Conference on Speech Prosody 2014, Dublin, Ireland (pp. 290–294). Fuchs, R., & Maxwell, O. (2016). The effects of mp3 compression on acoustic measurements of fundamental frequency and pitch range. In Speech Prosody 2016, Boston, USA (pp. 523–527). Fuchs, R., & Wunder, E. M. (2015). A sonority-based account of speech rhythm in Chinese learners of English. In U. Gut, R. Fuchs & E.-M. Wunder (Eds.), Universal or diverse paths to English phonology. Topics in English Linguistics [TiEL] (Vol. 86, pp. 165–184). Grabe, E., & Low, E. L. (2002). Durational variability in speech and the rhythm class hypothesis. In Papers in laboratory phonology (Vol. 7, pp. 515–546). Kawase, S., Kim, J., & Davis, C. (2016a). The relative contributions of duration and amplitude to the perception of Japanese-accented English as a Function of L2 Experience. In Proceedings of the Sixteenth Australasian International Conference on Speech Science and Technology (pp. 81–84). Kawase, S., Kim, J., & Davis, C. (2016b). The influence of second language experience on Japaneseaccented English rhythm. In Proceedings of the 8th International Conference on Speech Prosody 2016b, Boston, USA (pp. 746–750). Kello, C. T., Bella, S. D., Médé, B., & Balasubramaniam, R. (2017). Hierarchical temporal structure in music, speech and animal vocalizations: Jazz is like a conversation, humpbacks sing like hermit thrushes. Journal of the Royal Society Interface, 14(135), 20170231. Kewley-Port, D., Akahane-Yamada, R., & Aikawa, K. (1996). Intelligibility and acoustic correlates of Japanese accented English vowels. Presented at the International Conference on Spoken Language Processing (Vol. 96, pp. 450–453). Kim, J., Davis, C., & Cutler, A. (2008). Perceptual tests of rhythmic similarity: II. Syllable rhythm. Language and Speech, 51(4), 343–359. Knight, R. A. (2011). Assessing the temporal reliability of rhythm metrics. Journal of the International Phonetic Association, 41(3), 271–281.
Novel Methods for Characterising L2 Speech Rhythm
229
Kohler, K. J. (2009). Rhythm in speech and language. Phonetica, 66(1–2), 29–45. Leong, V. (2012). Prosodic rhythm in the speech amplitude envelope: Amplitude modulation phase hierarchies (AMPHs) and AMPH models. Doctoral dissertation, University of Cambridge. Leong, V., Kalashnikova, M., Burnham, D., & Goswami, U. (2017). The temporal modulation structure of infant-directed speech. Open Mind, 1(2), 79–90. Liberman, M., & Prince, A. (1977). On stress and linguistic rhythm. Linguistic Inquiry, 8(2), 249– 336. Low, E. L. (1998). Prosodic prominence in Singapore English. University of Cambridge. Low, E. L., Grabe, E., & Nolan, F. (2000). Quantitative characterizations of speech rhythm: Syllabletiming in Singapore English. Language and Speech, 43(4), 377–401. Munro, M. J. (1995). Nonsegmental factors in foreign accent: Ratings of filtered speech. Studies in Second Language Acquisition, 17(1), 17–34. Ordin, M., & Polyanskaya, L. (2015). Acquisition of speech rhythm in a second language by learners with rhythmically different native languages. The Journal of the Acoustical Society of America, 138(2), 533–544. Polyanskaya, L., Ordin, M., & Busa, M. G. (2017). Relative salience of speech rhythm and speech rate on perceived foreign accent in a second language. Language and Speech, 60(3), 333–355. Ramirez-, A. G., Médé, B., & Kello, C. T. (2018). Complexity matching in speech: Effects of speaking rate and naturalness. Chaos, Solitons & Fractals, 111, 175–179. Ramus, F., Nespor, M., & Mehler, J. (1999). Correlates of linguistic rhythm in the speech signal. Cognition, 73, 265–292. Rathcke, T. V., & Smith, R. H. (2015). Speech timing and linguistic rhythm: On the acoustic bases of rhythm typologies. The Journal of the Acoustical Society of America, 137(5), 2834–2845. Sereno, J., Lammers, L., & Jongman, A. (2016). The relative contribution of segments and intonation to the perception of foreign-accented speech. Applied Psycholinguistics, 37(2), 303–322. Stilp, C. E., & Kluender, K. R. (2010). Cochlea-scaled entropy, not consonants, vowels, or time, best predicts speech intelligibility. Proceedings of the National Academy of Sciences, 107(27), 12387–12392. Trofimovich, P., & Baker, W. (2006). Learning second language suprasegmentals: Effect of L2 experience on prosody and fluency characteristics of L2 speech. Studies in Second Language Acquisition, 28(1), 1–30. Weinberger, S. (2015). Speech accent archive. George Mason University. Retrieved from http://acc ent.gmu.edu