251 41 4MB
English Pages 318 Year 2015
Ralf Vogel and Ruben van de Vijver (Eds.) Rhythm in Cognition and Grammar
Trends in Linguistics Studies and Monographs
Editor Volker Gast Editorial Board Walter Bisang Jan Terje Faarlund Hans Henrich Hock Natalia Levshina Heiko Narrog Matthias Schlesewsky Amir Zeldes Niina Ning Zhang Editors responsible for this volume Volker Gast and Hans Henrich Hock
Volume 286
Rhythm in Cognition and Grammar
A Germanic Perspective
Edited by Ralf Vogel and Ruben van de Vijver
ISBN 978-3-11-037792-7 e-ISBN (PDF) 978-3-11-037809-2 e-ISBN (EPUB) 978-3-11-039424-5 ISSN 1861-4302 Library of Congress Cataloging-in-Publication Data A CIP catalog record for this book has been applied for at the Library of Congress. Bibliographic information published by the Deutsche Nationalbibliothek The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data are available on the Internet at http://dnb.dnb.de. © 2015 Walter de Gruyter GmbH, Berlin/Munich/Boston Typesetting: PTP-Berlin Protago-TEX-Production GmbH, Berlin Printing: CPI books GmbH, Leck ♾ Printed on acid-free paper Printed in Germany www.degruyter.com
Table of contents Ralf Vogel, Ruben van de Vijver Introduction | 1
Part 1: Concepts of rhythm 1
Laszlo Hunyadi Grouping, symmetry, and rhythm in language | 17
2
Dafydd Gibbon Speech rhythms – modelling the groove | 53
Part 2: Linguistic rhythm and cognition 3
Maren Schmidt-Kassow, Kathrin Rothermich and Sonja A. Kotz The role of default stress patterns in German monolingual and L2 sentence processing | 83
4
Gerrit Kentner Stress clash hampers processing of noncanonical structures in reading | 111
5
Ulrike Domahs, Richard Wiese and Johannes Knaus Word prosody in focus and non-focus position: An ERP-study on the interplay of prosodic domains | 137
6
Anne Zimmer-Stahl and Kamila Polisenska Short-term memory in speech perception and prosodic structuring – A syllable span test | 165
vi
Table of contents
Part 3: Rhythm and grammar in Germanic 7
Julia Schlüter Rhythmic influence on grammar: scope and limitations | 179
8
Stephanie Shih, Jason Grafmiller, Richard Futrell, and Joan Bresnan Rhythm’s role in genitive construction choice in spoken English | 207
9
Anton Karl Ingason Rhythmic preferences in morphosyntactic variation and the theory of loser candidates | 235
Ralf Vogel, Ruben van de Vijver, Sonja Kotz, Anna Kutscher, Petra Wagner 10 Function words in rhythmic optimisation | 255 Alexandra Schladebeck 11 Rhythm as a resource to generate phonetic and phonological coherence in lists | 277 Index | 311
Ralf Vogel, Ruben van de Vijver
Introduction This volume grew out of the workshop “Rhythm beyond the word”, which was held at the 31st annual meeting of the Deutsche Gesellschaft für Sprachwissenschaft (German Society of Linguistics) at the University of Osnabrück, March 3–6, 2009. The workshop and this book aim at exploring linguistic rhythm from a broader perspective than just the phonological point of view. There is growing interest in rhythm as a fundamental property of human cognition and in understanding linguistic rhythm in terms of this cognitive universal. Furthermore, rhythm is explored as a principle of linguistic structure which is operative not only in phonology, but also in other linguistic domains such as morphosyntax and turn-taking management. A proper treatment of linguistic rhythm requires the integration of the phonetic/phonological perspective with the cognitive perspective while at the same time taking into account the function of rhythm in other domains of grammar. This book aims to foster such a general understanding of linguistic rhythm. A wide range of phenomena from cognition, phonetics, phonology, morphology, syntax and conversational analysis are covered, with a multitude of methods and models. As the studies focus on the Germanic language family, the contributions of this book together give an impression of the relevance of rhythm for this language family, which serves as the paradigm case of so-called stress timing languages. The book is divided into three parts. The first part, Concepts of Rhythm, includes two chapters (Hunyadi; Gibbon), which discuss rhythm from a general perspective. Part two, Rhythm and Cognition, contains four contributions (SchmidtKassow, Rothermich and Kotz; Kentner; Domahs, Wiese and Knaus; Zimmer-Stahl and Polisenska), which explore the cognitive function of rhythm and its role in phonological and syntactic processing. Part three, Rhythm and Syntax, contains five studies (Schlüter; Shih, Grafmiller, Futrell and Bresnan; Ingason; Vogel, van de Vijver, Kotz, Kutscher and Wagner; Schladebeck), which explore the interaction of rhythm with morphosyntactic principles. In the remainder of this introduction, we introduce the topics of the three parts of this book.
2
Ralf Vogel, Ruben van de Vijver
1 Concepts of rhythm The human mind is an adept pattern recognizer. A few lines assembled in a certain pattern will trigger the recognition of a face, as in, for example: ’:)’ (Sandford & Burton, 2014). Apart from these geometrical patterns there are also temporal patterns – rhythm. The role of rhythm in language is the topic of this book. The focus will be on the influence of rhythm in cognition and various aspects of grammar. The role of rhythm in cognition is evolutionary old: In a recent study Comins & Gentner (2013) found that songbirds are able to learn new songs if they consist of regularly ordered pieces of known songs, but not if the new songs consist of irregularly ordered pieces of known songs. It is even very likely that rhythm is a general cognitive phenomenon extending beyond the auditory realm. We recognize visual patterns and rhythms just as we recognize auditory patterns. This issue is explored by Hunyadi, who addresses the question as to how rhythm is used in grouping in the visual mode, before he addresses rhythm in language. He describes several experiments to assess the role of rhythm in grouping. Partcipants were shown 4 dots which were grouped by means of brackets, for example there were dots organized like this ∙ ∙ ∙ ∙, like this (∙ ∙) (∙ ∙) or like this ∙ (∙ ∙) ∙. They were then asked to characterize the grouping by means of mouse clicks. Participants tended to use short clicks for dots that belong together visually – dots in a pair of brackets – and longer clicks between dots that do not belong together. Dots that are ungrouped were marked by clicks that are temporally evenly spaced. He then adduces evidence from rhythm in Hungarian as marked by subtle length differences as evidence for grouping in language and concludes that the grouping in language follows much the same patterns as in the visual mode; evidence for a general cognitive grouping mechanism. The role of rhythm in language has been explored in the hypothesis that languages belong to one of three rhythmic classes, stress-timing, syllable-timing and mora-timing (Abercrombie, 1967; Pike, 1945). According to this hypothesis, all languages have a tendency for isochrony. In syllable timing languages, the duration of syllables is kept constant, whereas in stress timing languages the distance between stressed syllables is kept constant. Much phonetic research has been undertaken in order to explore this hypothesis, especially in the 1970s and 1980s. It turned out that it is difficult to characterize rhythm as many different factors play a role.¹
1 Detailed summaries of these studies are given by Auer (2001), Auer & Uhmann (1988).
Introduction
3
The results from psycholinguistic studies and from acquisition studies paint a relatively clear picture. They show the relevance of isochrony and the existence of rhythm classes in language perception and acquisition (Couper-Kuhlen, 1993; Cutler & Mehler, 1993; Lehiste, 1977; Nespor et al., 1996; Ramus, 2002b). Several attempts have been made to reconcile these contradictory findings. Auer & Uhmann (1988) and Auer (2001) suggested a phonological reinterpretation of the proposed typology. Languages of the two types have a number of properties in common. These properties conspire to yield the impressions of syllable and stress timing.² Ramus et al. (1999) and Ramus (2002a) proposed a phonetic measure for the different rhythm classes, based on the relative frequencies of consonants and vowels. Typological studies on the validity of these claims lead to rather negative results (Schiering, 2007; Schiering et al., 2012). Schiering (2007) found that three logically independent aspects serve to characterize the rhythm of a language: the salience of stress phonology in the language, syllable complexity and the relevance of length contrasts typical of mora-timing languages. Each of these parameters is gradient: “The synchronic evidence suggests that rhythmic differences across languages are best modeled as a continuum with the prototypical representatives of stress- and mora-based rhythm as focal points. Syllable-based rhythm turns out to be negatively defined by the lack of stress- or mora-based phonology” (Schiering, 2007, 355). It further turned out that the rhythmic characteristics of particular languages are representative only of their language families. The rhythm classes proposed in the literature do not seem to be independent from this. The proposed characteristics of a stress-timing language, for instance, are the rhythmic characteristics of the Germanic languages, the primary example of stress timing languages. These issues raise the question as to how to properly characterize stress. This issue is taken up by Gibbon, who stresses the need to address the deviations from strict synchronicity – the groove. Rhythm in language is not just a matter of equal sequences, but the subtle variances in the sequences are just as important.
2 Syllable timing languages tend to have simpler syllable structures than stress timing languages, mainly CV structures. Gemination is found only in syllable timing languages, whereas ambisyllabicity is found only in stress timing languages. Vowel quality can be reduced in unstressed syllables of stress timing languages. Vowel harmony is only to be found in syllable timing languages. Stressed syllables tend to be heavier than unstressed syllables in stress timing languages, whereas no such asymmetry is found in syllable timing languages. Stress timing languages also often use stress to indicate grammatical properties like compound structure or syntactic category (English “rébel”, noun, vs. rebél, verb). This is rare in syllable timing languages.
4
Ralf Vogel, Ruben van de Vijver
These variances are part of rhythm and are necessary to define rhythm. However, modeling this groove is no easy task, as Gibbon argues. Gibbon proposes a Generic Rhythm Model which consists of several sub-models. In short, rhythm is an organizing force in cognition and its proper characterization is demanding, as deviations from strict rhythm contribute to the overall impression of rhythmicity. The organizing role of rhythm is further taken up in the next part of the book.
2 Rhythm and cognition Cognitive scientists responded to the controversy over the proper characterization of the rhythm classes by focusing on the role of rhythm in perception rather than in production. The focus of this work is not so much on the characterization of rhythm classes but on the role of rhythm in processing language. It has been hypothesized that the sensitivity to rhythm in speech makes its recognition so robust. This argument has been made by Dilley & Pitt (2010), who performed an experiment in which the speech rate was slowed down around a target function word or it was sped up. If the speech rate was slowed down the function word was not perceived. The explanation provided by Dilley & Pitt (2010) is that the speech rhythm provides expectations of word boundaries; speech that is slowed down may have more boundaries than one would expect on the basis of the earlier, faster, rhythm. This hypothesis finds further support in the results of the experiments reported by Morrill et al. (2014). They ran an experiment in which there were three rhythmic environments. In one, the rhythm was flat, in the sense that all syllables had the same pitch and duration. In another condition the rhythm was binary as expressed by alternating high and lower pitch on alternating syllables. In a third condition the rhythm was ternary as expressed by two low pitched syllables followed by a syllable with a relatively higher pitch. It turns out when listeners expect only two syllables on the basis of the global rhythm of the sentence they tend to overhear a function word in a binary stretch, since the function word would not fit the rhythm. When listeners expect three syllables on the basis of the global rhythm of the sentence they tend to hear in a stretch of two syllables a function word that is not there. These results suggest that rhythm is used as a lattice onto which speech is processed. These findings are a strong indication of the role of rhythm as a means to a structured perception of the world. The chapters in this part of the book share this focus.
Introduction
5
Schmidt-Kassow, Rothermich and Kotz investigate the role of rhythm in auditory speech and language processing. They report on a study in which eventrelated potentials (ERP) were used to investigate language processing. They used test sentences in four conditions. In the first condition, the sentence, which consisted of 7 bisyllabic words with initial stress, was realized correctly. In the second condition, the sixth word was realized with stress on the final syllable. This created what the authors call a metrical violation. In the third condition the sixth word was inflected incorrectly. This created a syntactic violation. In the fourth condition, finally, the sixth word was both stressed on the final syllable and it was inflected incorrectly. This created a double violation. The authors conclude that the rhythm of a sentence is used as a trellis for linguistic structure to grow on. Kentner reports the results of an experiment in which he studied the influence of rhythm on the placement of stress in a cognitively straining task. The placement of stress depended on the syntactic and information-structural representations of the sentence. He recorded the reading of so-called right-node-raising constructions. Some of the sentences were created in such a way that proper placement of stress would lead to a clash and in other sentences it would not. He concludes that a stress clash affects the performance of readers in sentences with a clash. Domahs, Wiese and Knaus investigate the relation between rhythm and information structure. They describe an ERP experiment in which they studied the processing of German three-syllabic words with correct or incorrect stressplacement in two conditions. In one condition the target word is in focus and in another condition it is not. Some ERP components that are present when the target word is in focus are not present when the word is not in focus. The authors argue that a pitch on the first syllable of the word is more difficult to process when the previous word in the sentence has a pitch accent then when the previous word has not pitch accent. Differences in duration in the syllables of a word are processed independently of the pitch accents in the sentence. Zimmer-Stahl and Polisenska report on their experiment with which they investigated the influence of rhythm on short term memory. They asked listeners to remember as many nonsense syllables as they could in three conditions. In one condition the nonsense syllables all have the same duration and are not separated by a pause. In another condition a nonsense syllable of 300 milliseconds is followed by one of 200 milliseconds, which, in turn is followed by a pause of 100 milliseconds. In the final condition a nonsense syllable of 300 milliseconds is followed by two syllables, each of 200 milliseconds, which is followed by a pause of 100 milliseconds. Participants tended to remember more items in the last condition and Zimmer-Stahl and Polisenska conclude that rhythm helps to group
6
Ralf Vogel, Ruben van de Vijver
items, which makes it easier to store them in short time memory. The larger the units which are grouped, the more one can remember. The chapters in this section have explored the organizing role of rhythm and found that rhythm allows for anticipation and memorization of linguistic structure. The chapters in the next section explore the role of rhythm in organizing the syntagmatic relations among constituents.
3 Rhythm and syntax in Germanic 3.1 Stress shift The amount of research on the interaction of syntax and rhythm is still rather limited. With respect to the Germanic language family, the situation is a little bit better. The chapters in this part of the book contribute to this growing field of research. Perhaps the best understood instance of an interaction of syntax and rhythm is the phenomenon of stress shift. It can be characterised, adopting the framework of generative linguistics, as a postlexical phonological process that is triggered by a stress clash. Stress clash and subsequent stress shift typically occur when morphemes and words are combined into larger morphosyntactic units, for instance compounds and phrases. A simple case is the noun phrase ‘chínese réstaurant’ which consists of the adjective ‘chinése’ and the noun ‘réstaurant’. The combination of the two words leads to adjacency of the two syllables with word stress. The word stress on ‘chinése’ then shifts from the second to the first syllable to yield ‘chínese’. Stress shift is observed not only at syllable level, but also at foot level. In such a case, primary and secondary stress within a word are swapped. Consider the following case from German (cf. Wiese, 1996, 306), where brackets indicate foot structure. (1)
a.
(Pà.der) (bórn) – “Paderborn” (German town)
b. (Ú.ni) – short form of “Universität” (‘university’) “Paderborn” has word stress on its final syllable, whereas “Uni” is stressed on the first syllable. In order to form a noun phrase out of the two nouns, the town name is suffixed with a derivational suffix,“-er” that turns it into an adjective. The word (Pà.der)(bór.ner) still has the final foot as its strong foot. Thus, the resulting
Introduction
7
phrase has a stress clash at foot level that is resolved by shifting primary stress to the first foot in “Paderborner”:³ (2)
× ×
×
. ( × . ) ( × . ) (× . ) U ni – “the Paderborn university” bor ner die Pa der However, when the two nouns are adjacent, but do not belong to the same phrase, stress shift does not have to occur: (3)
dass Pàderbórner/?Páderbòrner Unis mögen that Paderbornians universities like “. . . that Paderbornians like universities.” a.
dass (Paderborner) (Unis) (mögen)
b. dass (Paderborner) (Unis mögen) In (2), [Paderborner Uni] is a complex noun phrase that projects a single phonological phrase (PhP). In (3), [Paderborner] and [Unis] function as subject and object, respectively. The subject [Paderborner] projects its own phonological phrase, whereas [Unis] either projects its own phonological phrase (3-a) or is combined with the verb in one PhP (3-b). Stress shift in German is obviously structure dependent. This observation corroborates the findings by Hunyadi (this volume) on the relevance of grouping in studying linguistic rhythm. Rhythmic regularity needs to be sustained among elements that belong to the same group – it is constitutive for the prosodic encoding of syntactic grouping. This leads to stress shift as in (2). An unresolved stress clash between adjacent feet as in (3), therefore, is also indicative of a phrase boundary between those feet. As Hunyadi (this volume) shows, the temporal distance between adjacent units that are separated by a group (phrase) boundary is greater than in a case without group boundary. Stress shift can be dealt with in the standard model of generative grammar where syntax and phonology are in a feedforward relation: syntax provides the input to phonology. A more interesting class of phenomena are those that challenge this standard view by suggesting an impact of rhythm on morphosyntax.
3 In (2), the format of bracketed grids proposed by Hayes (1995) is used. Brackets denote foot boundaries and dots stand for unstressed elements within feet.
8
Ralf Vogel, Ruben van de Vijver
Such studies usually are either studies on the impact of rhythm on morphosyntactic change, or studies on the contribution of rhythm in explanations of morphosyntactic phenomena. The papers presented here belong to the second type of studies. Before we introduce them, we will briefly consider a study of the first type.
3.2 Effects of rhythm on morphosyntactic change Szczepaniak (2007) described the process of the shift of German from a syllable (in Old High German, OHG) to a word language, in Auer’s (2001) terms.⁴ OHG was in many respects a typical syllable-timing language. It preferred simple syllable structures, it had inherited from Proto-Germanic relatively free word stress, it had equal vowel quality in all syllables, it showed vowel harmony and had geminates, to mention some of these properties. Contemporary German is quite the opposite of this: we find complex syllables, reduced vowel quality in unstressed syllables, fixed word stress, no more vowel harmony, ambisyllabic consonants instead of geminates. The starting point for this development was the fixation of word accent, which started in Proto-Germanic. Lexical words in German typically consist of one or two morphemes, forming a trochaic foot, the preferred form of inflected words, with stress on the lexical stem. The reduction of unstressed inflectional suffixes proceeded in several steps. Full vowels in suffixes were first reduced to schwa. Further phonetic reduction led to the loss of syllabicity in some suffixes, forming complex syllable codas. This had consequences in morphosyntax. Inflectional suffixes that once contrasted in their vowels became homophonous and thus lost their distinctivity. This was then compensated for by analytic means in a still ongoing drift from synthetic to analytical inflection. For instance, the determiner – still expressing case distinctions – became an obligatory syntactic constituent of the noun phrase in middle high German, which compensated for the loss of distinctive case suffixes on nouns.
4 See Schiering et al. (2012) for a critical evaluation of this notion.
Introduction
9
3.3 Morphosyntactic effects of rhythm The phenomena covered by the contributions in part 3 can be grouped into one of the following three categories: 1. Rhythm as the decisive factor in the choice among alternatives. 2. Rhythm as one factor in a multifactorial explanation of the choice among alternatives. 3. Rhythm as the trigger for the violation of a morphosyntactic constraint. The contribution by Schlüter presents examples of each of these categories from historical and contemporary English. Morphological examples are provided by cases where two variants for the inflection of a word exist, as is the case for the comparative of ‘bad’ in 16th/17th century English (‘worse’ vs. ‘worser’). The alternation between two forms for the past participle of particular verbs like ‘strike’ (‘struck’ vs. ‘stricken’) or ‘drink’ (‘drunk’ vs. ‘drunken’) in Middle to contemporary English is similar. For both cases, Schlüter shows that the disyllabic versions predominantly occur when the following word is stressed on the first syllable. This is especially the case in attibutive uses of ‘worse’ and the participles when they are directly followed by a – typically initially stressed – noun. A syntactic alternation of a similar kind is observed by Schlüter for infinitival complements of verbs like ‘make’ and ‘dare’ in 16th–19th century English. These complements could occur with and without the infinitival marker ‘to’. Schlüter shows that again the occurrence of ‘to’ is significantly higher when the following infinitival verb has initial stress. In these scenarios, the choice of a disyllabic variant or the infinitival marker avoids a stress clash. Such situations of allomorphy usually do not persist for a longer time. While for rhythmic purposes it is beneficial to have a choice between different variants at the disposal, constraints on morphological paradigm architecture disprefer allomorphy for particular grammatical words. As Schlüter also notes, a contrast like ‘worse’ vs. ‘worser’ usually triggers a differentiation in function for the two forms, because morphological paradigms follow the principle of iconicity: the more marked form (i.e., ‘worser’) usually realizes the more marked meaning or feature. The mere fact that such cases can be found at all and may persist at least over several centuries, therefore indicates the relevance of rhythm. Rhythm can be decisive for word order in those cases where order is not determined by syntactic constraints for constituent order. This is the case with syntactic objects like coordinations and lists, the constituents of which may be ordered freely. Müller (1997) found that rhythm is relevant for word order in binomials
10
Ralf Vogel, Ruben van de Vijver
like “bow and arrow” (instead of “arrow and bow”).⁵ Schlüter discusses the case of coordinations of the disyllabic color adjective ‘yellow’ with monosyllabic color adjectives and shows that the preference for ‘yellow’ in second position is reinforced if it is followed by a noun with initial stress. A case where a syntactic constraint seems to be outranked by rhythmic demands is the non-canonical placement of the intensifier ‘quite’ in English, which has developed since the 19th century, as in ‘quite a different view’. Determiners usually occupy the initial position of a noun phrase. The non-canonical placement of ‘quite’ is an exception to this and can again be shown to occur more frequently in contexts where the intensified adjective has initial stress. Again, the non-canonical order avoids a stress clash. The non-canonical order of ‘quite’ has also been correlated with semantic differences, so here rhythm is just one of the factors that together explain the observed syntactic variation. Shih, Grafmiller, Futrell and Bresnan present a quantitative corpus study on the choice of genitive constructions in English. Their data are chosen from the Switchboard corpus of Spoken American English. Several factors are known to influence the choice between the s-genitive (e.g., ‘the teacher’s house’) and the of -genitive (e.g., ‘the house of the teacher’), among them animacy and the length of the possessor. Shih et al. establish a eurhythmy distance measure (ED) for the genitive constructions (roughly, the distance between stressed syllables of possessor and possessum). A high of-ED should prefer the s-genitive, and vice versa for the s-ED. The logistic regression analysis uncovers a small but significant contribution of the of-ED in explaining the variation in the data, but only in the interaction with the factor animacy: for inanimate possessors, a high of-ED (i.e., with non-initial stress on the possessor noun, as in ‘the head of the constrúction’) significantly increases the probability of the s-genitive. Ingason discusses two cases of rhythmically conditioned variation in Icelandic and English. Certain psychological verbs display an interesting variation in contemporary spoken Icelandic. Their subject may occur either with dative or accusative case. Ingason explains this with the preference to form trochaic feet in Icelandic: dative case adds a syllabix suffix when attached to a proper name, whereas accusative case has zero inflection. Therefore, a monosyllabic proper name is preferred with dative case and a disyllabic proper name with accusative case in the relevant contexts. Ingason’s experimental study on these structures shows that this contrast is a contrast in preferences rather than in grammaticality.
5 Compare this with the German binomial ‘Pfeil und Bogen’ (‘arrow and bow’). The relative order of the words for bow and arrow does not seem to matter for this phraseologism, it is determined by the principle of rhythmic alternation.
Introduction
11
Therefore, he reconstructs his findings in a grammatical model that on the one hand derives variable outputs, while on the other hand incorporates the proposed interaction of the relevant grammatical factors. Ingason shows that this is possible with an extended version of Optimality Theory, developed by Coetzee (2004, 2006), called Rank-Ordering model of Eval (ROE). Ingason modifies this model by exploiting what he calls the ranking span as a measure of the difference in grammatical well-formedness between two expressions. The ranking span is used to predict the size of contrasts in empirical studies. Ingason puts this model to a further test on relative clause extraposition in English. Vogel, van de Vijver, Kotz, Kutscher and Wagner present studies on the role of function words in rhythmic optimisation in German. The Germanic languages have fixed word stress on content words like nouns, verbs, adjectives and adverbials. Grammatical elements, like inflectional affixes and in particular function words, do not need to have stress. German function words are often monosyllabic and may occur in metrically weak or strong positions, as well as cliticize on adjacent words. Vogel et al. hypothesize that function words are therefore ideal candidates for the rhythmic optimization of sentences. This is reflected in their finding that function words show more variation in prominence than content words. In a production study with the neuter singular pronoun ‘es’ (‘it’), this result is corroborated. Vogel et al. also present two case studies on word order effects of rhythm. The first study focuses on variable orders in verbal complexes consisting of an auxiliary, a modal and a full verb. For these verb clusters, the preferred order is Aux-V-Mod, but another order, V-Aux-Mod, is also possible, though dispreferred. In a recall task, Vogel et al. show that the marked V-Aux-Mod order is reproduced with greater accuracy when it has a rhythmic advantage. In a second recall experiment, Vogel et al. show the same effect for the placement of pronominal adverbs. Schladebeck in her contribution deals with lists, a particular kind of syntactic conctructs which displays a high degree of rhythmicity. In her analysis of a corpus of conversational data, Schladebeck shows that not only are lists organized in a rhythmic fashion, but the elements of lists tend to be ordered isochronously. Rhythm is used to constitute coherence within lists and to manage turn continuation. The analysis also raises the issue whether speakers orient towards an underlying silent beat in conversation. Can rhythm be considered to be a syntactic principle? The present collection of studies on rhythm and grammar gives a positive answer, but only under an appropriate conception of language and grammar. The following aspects are characteristic of this conception: – a focus on spoken language, production data and language use, – an attempt to integrate the synchronic with the diachronic perspective,
12
– –
Ralf Vogel, Ruben van de Vijver
analytical statistical methods that allow to uncover multiple sources of variation, grammar models that reconstruct the interaction of multiple factors, yielding the observed variation.
Such a conception certainly differs from the standard view of morphosyntax, which is based on the assumption of a phonology-free syntax and largely abstracts away from crucial aspects of spoken language. Research on the interaction of rhythm and grammar therefore also has the potential to provide an inspiring new perspective on morphology and syntax.
References Abercrombie, David. 1967. Elements of general phonetics. Edinborough: University Press. Auer, Peter. 2001. Silben- und Akzentzählende Sprachen. In Martin Haspelmath, Eckehard König, Wulf Oesterreicher, & Wolfgang Raible (eds.), Sprachtypologie und sprachliche Universalien – Language Typology and Language Universals. Ein internationales Handbuch – An International Handbook., 1391–1399. Berlin: Mouton de Gruyter. Auer, Peter & Susanne Uhmann. 1988. Silben- und akzentzählende Sprachen. Zeitschrift für Sprachwissenschaft 7. 214–259. Coetzee, Andries W. 2004. What it means to be a loser: Non-optimal candidates in Optimality Theory. Doctoral Dissertation, UMass Amherst. Coetzee, Andries W. 2006. Variation as accessing ’non-optimal’ candidates. Phonology 23. 337–385. Comins, Jordan A & Timothy Q Gentner. 2013. Perceptual categories enable pattern generalization in songbirds. Cognition 128. 113–118. Couper-Kuhlen, Elizabeth. 1993. English Speech Rhythm. Amsterdam: John Benjamins. Cutler, Anne & Jacques Mehler. 1993. The periodicity bias. Journal of Phonetics 21. 103–108. Dilley, Laura C. & Mark A. Pitt. 2010. Altering context speech rate can cause words to appear or disappear. Psychological Science 21. 1664–1670. Hayes, Bruce. 1995. Metrical Stress Theory. Principles and Case Studies. Chicago: Chicago University Press. Lehiste, Ilse. 1977. Isochrony reconsidered. Journal of Phonetics 5. 253–263. Morrill, Tuuli H., Laura C. Dilley, J. Devin McAuley & Mark A Pitt. 2014. Distal rhythm influences whether or not listeners hear a word in continuous speech: Support for a perceptual grouping hypothesis. Cognition 131. 69–74. Müller, Gereon. 1997. Beschränkungen zur Binomialbildung im Deutschen. Zeitschrift für Sprachwissenschaft 16. 5–51. Nespor, Marina, Maria T. Guasti & Anne Christophe. 1996. Selecting word order: The Rhythmic Activation Principle. In Ursula Kleinhenz (ed.), Interfaces in Phonology, 1–26. Berlin: Akademie Verlag. Pike, Kenneth L. 1945. The intonation of American English. Ann Arbor: University of Michigan Press.
Introduction
13
Ramus, Franck. 2002a. Acoustic correlates of linguistic rhythm: Perspectives. In Speech Prosody 2002, International Conference. ISCA. Ramus, Franck. 2002b. Language discrimination by newborns: Teasing apart phonotactic, rhythmic, and intonational cues. Annual Review of Language Acquisition 2. 85–115. Ramus, Franck, Marina Nespor & Jacques Mehler. 1999. Correlates of linguistic rhythm in the speech signal. Cognition 75. Sandford, Adam & A. Mike Burton. 2014. Tolerance for distorted faces: Challenges to a configural processing account of familiar face recognition. Cognition 132. 262–268. Schiering, René. 2007. The phonological basis of linguistic rhythm: cross-linguistic data and diachronic interpretation. Sprachtypologie und Universalienforschung 60. 337–359. Schiering, René, Balthasar Bickel & Kristine Hildebrandt. 2012. Stress-timed=word-based? Testing a hypothesis in Prosodic Typology. Sprachtypologie und Universalienforschung 65. 157–168. Szczepaniak, Renata. 2007. Der phonologisch-typologische Wandel des Deutschen von einer Silben- zu einer Wortsprache. Berlin: de Gruyter. Wiese, Richard. 1996. The Phonology of German. Oxford: Clarendon Press.
Part 1: Concepts of rhythm
Laszlo Hunyadi
1 Grouping, symmetry, and rhythm in language¹ 1 Introduction When using the word rhythm we often think of a sequence of pulses of some sort that contribute to the structural sensation of an event. We often associate it with the progress of patterns of sound as represented in music or speech, but rhythm can also be interpreted in a context beyond sound, in the arrangement of visual objects. It can also be associated with events involving non-physical objects such as the sequence of meetings, the appearance and disappearance of certain states of the mind and many other phenomena. All this suggests that what is common in what we associate rhythm with is the systematic occurrence of elements or features in general at distinct spaces, intervals or qualities (the latter may include various shades of color, degrees of intensity, illumination and probably any quality that is a basis for distinction between kinds of elements of objects or stages of the same object). Rhythm can be essential both in its structural and perceptual/performative sense. The succession of constituting elements of an object (including physical objects and events alike) should be arranged in such a way that is characteristic of the given object, i.e. the intrinsic structural rhythm of elements can define the object itself. On the other hand, the relation between such elements has its significance in representation as well: perceiving a given structure one interprets it through some performative act (in music by musical performance, in language by speech, etc.). Whereas structural rhythm in this sense is an abstract phenomenon, in surface performance it needs physical means of representation. Whereas in music structural rhythm is most straightforwardly conceived to be associated with the temporal organization of successive elements and in language rhythm is usually understood similarly, time is only one means of representation. Even in music and language intensity also plays an important role: the succession of strong and weak accents as well as beats equally contribute to the perception of the underlying rhythmic structure (see the findings of metrical phonology: Liber1 This paper is based on the results of a series of earlier experiments and theoretical generalizations aimed at identifying the underlying cognitive basis of the faculty of language (cf. Hunyadi 2009, 2010, Hunyadi to appear), further new analyses of these earlier data as well as new experiments to offer an understanding of rhythm in language as a function of the cognitive principles of the faculty of language.
18
Laszlo Hunyadi
man 1975, Liberman and Prince 1977, Nespor 1982, 1986, Selkirk 1984, Hayes 1984, 1995; its relation to music: Lerdahl and Jackendoff 1983, and later work). Again, both in music and language further means of representation include the change of pitch values: playing the same musical tone or speaking at the same pitch level for a longer stretch of time gives the sense of monotony and one looses the perception of structure by being unable to capture the structure building rhythm behind the signal. (For a discussion of the cognitive relation between rhythm, timing and tempo in musical performance and perception as contrasted to musical notation cf. Clark 1999, Honing 2001, 2002). Although time appears to be a fairly general means of representation for rhythm, arrangement of elements across space can also be an important choice, such as in case of architecture or visual art in general. With certain abstract, non-physical objects, like states of mind, emotions etc., systematic variation of degrees of other characteristic properties is a possible form of structural representation, too, just think of, e.g. cases of bipolar disorder. Finally, more than one means of representation often contribute to structural representation in combination: in the case of speech alone, timing, change of F0 and intensity often (but not in a mandatory manner) join in this performative function. In this work on rhythm we will follow the main assumptions of our earlier proposal (Hunyadi 2009, 2010, and to appear) regarding the cognitive basis of the representation of structure across modalities. According to this proposal, structure is represented in terms of groups, grouping itself follows certain cognitive principles, and the representation of grouping is structure dependent. Whatever the physical nature of the means of representation of structural rhythm, rhythm can be conceived as the product of grouping. Namely, rhythm is always about the succession of more than one, actually more than two elements of some sort, and this succession needs to be structured. Structure means that there must be some boundary between the elements, consequently, a rhythmic structure always involves at least two groups of elements. Since, accordingly, the minimal number of elements involving two groups is three, the minimal rhythmic pattern is such that one group consists of two elements and the other group consists of just one element (this single-element group in mathematics is called the trivial group) and there is a boundary of some sort between the two groups. This boundary can then be represented on the surface (alone or in combination) by a change in time, change in pitch, change of spacial coordinates or other relevant physical and measurable features.
Grouping, symmetry, and rhythm in language
19
Considering grouping as a general principle of representing structural rhythm across various modalities (music, language, dance, architecture etc.)², we believe that in order to capture the basic underlying and at the same time meaningful properties of linguistic rhythm we need to study how grouping in language is related to properties of grouping in general. Accordingly, in order to do so, first we need to identify the general properties of grouping in a modality abstracted away from language as far as possible and then identify these, supposedly languageindependent, general underlying properties of grouping in language as well, i.e. in linguistic patterns that are generally believed to have an underlying rhythmic structure.
2 Grouping of abstract visual patterns In order to capture some of the basic and possibly modality-independent properties of grouping we carried out production experiments with stimuli involving sequences consisting of uniform graphical forms of dots (•).³ The modality of a sequence of dots in various combinations was chosen because it was believed that an object taking the shape of a sequence of dots only is abstract enough to represent the underlying structure of virtually any kind of objects: a dot possibly has the simplest geometric form such that within a sequence these dots will only have a single kind of relation (linear sequence), further, it is semantically quite vague (in a usual context it is semantically empty), so there is no context-dependent interpretation that would influence its surface representation. In addition, the expected means of representation was also very simple: mouse clicks for each of the dots of the pattern to be represented. What was measured was also a single feature – time, which in turn can only be captured quantitatively, by absolute values. By choosing patterns of such an abstract kind and their representation by time only we wished to reduce the complexity of the experiment as compared to patterns in more complex modalities involving the measurement of such qualitative features as ‘intensity-change’ or ‘F0-change’ for which intervals, threshold values beyond absolute ones should be determined to be considered significant. The experiments were based on sequences of dots “•” grouped by means of bracketing, such as (1) through (3):
2 For the neurological basis of cross-modality of rhythm cf., Sacks 2007. 3 Below we are referring to some relevant details of a series of earlier experiments we described and evaluated more extensively in Hunyadi 2009, 2010 and Hunyadi to appear.
20
Laszlo Hunyadi
(1)
••••
(2)
(••)(••)
(3)
•(••)•
in which the brackets indicated grouping. The subjects (50 university students) were presented with these and further (altogether seven) patterns on the computer screen, were given the minimal instruction about the general function of the brackets (that they indicated grouping of the dots) and were asked to represent each individually displayed pattern by a sequence of mouse clicks for each dot. No further instruction was given. Patterns were presented in random order. Both the onset time for the first mouse click (that records the onset time of preparation needed for starting the sequences of clicks and the onset time of each clicking was taken and recorded using the Macintosh program PsyScope and various Macintosh computers. In some later runs of experiments and in the case of more complex patterns a ButtonBox with 1 ms accuracy of measurement was also used. As expected, the subjects chose individually variable tempo for performance. In order to exclude the factor of the variability of individual tempo, the received data were normalized. What we calculated was the interval between the individual clicks. Accordingly, out of the four values corresponding to the four clicks we calculated three values, each representing the duration of an interval (called ‘segment’) between two subsequent clicks (segment “a” for the interval between the clicks for the first two dots, segment “b” the interval between clicks for the second and the third dot and segment “c” the interval between clicks for the third and the forth dot). We were interested in the distribution of these intervals as a function of the displayed pattern. Namely, we wished to know a. whether the three intervals were equal in length in case no structural boundary was overtly present in the displayed pattern, as in (1), and b. whether there was any change in the length of the same intervals as a reflection of a structural boundary present in the stimulus (as for the segment b in pattern (2) and for the segments a and c in pattern (3)). Furthermore, we also wished to know if any eventual change at a boundary or else is regular, i.e. if it is structure dependent. The results are shown in Figure 1 below:
Grouping, symmetry, and rhythm in language
21
Segment length in abstract visual patterns
Segment length in % of total pattern length
75 60 45
(1) •••• (2) (••)(••) (3) •(••)•
30 15 0 a
b Segments
c
Figure 1: Basic rhythmic patterns of four abstract visual elements. Structural variation vs. average means of segments
Data shown in the figure reveals two important properties of representation to be discussed below: inherent grouping and symmetry as related to the surface representation of structural relations. a. Inherent grouping Looking first at the segment data of pattern (2) consisting of isomorphic groups we see that segment a and segment c are equal in length and that segment b is longer. We suggest that this lengthening is due to the group boundary between the first and the second pair of dots. Accordingly, underlying structure as denoted in the form of the displayed pattern and its representation by mouse clicks are in full agreement: boundary is denoted by segment lengthening. As the representation of pattern (1) reveals there is a noticeable difference between the structure of the pattern and its representation. As for the structure, the pattern itself does not contain any brackets, accordingly, the structure the pattern stands for consists of a single group of four equal elements. As for representation, we find that the subjects perceived the pattern as consisting of two groups: the intervals shown by segment a as well as segment c are equal in length and shorter than segment b. This fact suggests that the subjects interpreted a group boundary between the first two and the last two dots. Since there was no overt indication of a group boundary at that location in the pattern itself but, still, the subjects did assign it such an interpretation, we suggest that inherent grouping is a cognitive principle that acts upon a set of objects (in our case: a sequence
22
Laszlo Hunyadi
of dots) letting subjects interpret a group boundary within it at some place even if there is no overt boundary marker to denote it. As the comparison of (1) with the analysis of pattern (2) above suggests the place where inherent grouping found in (1) is interpreted is regular: between the first and the last two dots, cutting the pattern into two equal halves. This is the case of inherent grouping. The analysis of pattern (3) suggests that the contrast between structural position (an element (in our case a dot) belonging to a group or being outside it) and segment length are closely related, too. In (3) the shorter segment is b, the one between the first and the forth dot with a group boundary on both sides. Its length, in addition, is very similar to the length of the two short segments in pattern (2), suggesting an important feature of grouping: this temporal relation between the representation of segments within and outside of groups is not only similar in length but is independent of the surface order of the given long and short segments. Namely, a similar relative group length is found in patterns with different linear arrangement but with the same grouping feature. As for the denotation of group boundary, in general we can observe that there is a lengthening at group boundary and shortening within the group itself. We also notice that boundary length of the same type of segment is longer in (3) (segments a and c) than in (2) (segment b). Later, in Section 2c we will suggest that this observation points to a regularity that is structure dependent. Back to pattern (1): the inherent grouping present in pattern (1) has the same character but a different quantity (with a smaller value of relative lengthening at the group boundary) than in (2) and (3). We suggest that this quantitative difference holds for inherent grouping in general. That inherent grouping is general enough to be manifest in longer sequences of elements as well is demonstrated in a separate experiment described in more detail in Hunyadi 2009. In that experiment involving 29 subjects, all university students, the task was similar to the one discussed above but the number of dots ranged between 3 and 9. The patterns were different in the sense that no bracketing was used to indicate structural grouping. We were interested to know if inherent grouping was manifest in such longer sequences of dots as well. Using the paired one-tail t-test the results convincingly (p < 0.5) support this assumption. It appears that in the case of patterns with more than 4 elements three groups are formed with respective group boundaries significantly different; cf. (4): (4)
3 dots: 2 + 1 4 dots: 2 + 2 5 dots: 2 + 2 + 1 6 dots: 3 + 2 + 1
Grouping, symmetry, and rhythm in language
23
7 dots: 3 + 3 + 1 8 dots: 3 + 3 + 2 9 dots: 4 + 3 + 2 b. Symmetry in the surface representation of structural relations Looking at Figure 1 again we notice that inherent grouping in (1) and overt grouping in (2) and (3) have in common three important properties: first, group boundaries are represented by segment lengthening (the distance between clicks for two adjacent dots is longer when there is a group boundary between them), second, the average durational difference between long and short segments appears to be essentially the same without regard to their structural position in the pattern, and third, grouping is done in a way that results in a quantitatively symmetrical structure as well. Accordingly, (1), (2) and (3) are grouped into a sequence of segments where the first and the third segments are of the same length with the middle segment differing. This symmetry lends a noticeable rhythm to the representation of the pattern. This rhythm is, however, already present in the underlying structure of the pattern as well so that symmetry in representation is mapped onto symmetry in the underlying structure. In this sense pattern (1) is a special but not radically different case: the surface representation of symmetry is based on the similar, conceived but not overtly marked symmetrical structure, whereas in (2) and (3) it follows overt marking of symmetry. c. The role of hierarchy in surface representation Whereas symmetry appears to be an essential property of grouping, we do not always find symmetrical segments on the two sides of a boundary. Since segment length is structure dependent so that the length of a segment depends on whether it includes a structural boundary, we may expect that segment length will also vary according to the type of the given boundary. Our experiments with embedded arrangement of groups of dots show that the boundary length (or length of a segment) also depends on whether sequencies of two groups are at the same or different level of the given structural hierarchy. Consider example (5) below: (5)
•(•(••))•
As Figure 2 indicates, the boundaries between the first and the second, and between the second and the third dot, i.e. the length of segments a and b, respectively, are not of the same size: a is longer than b. Furthermore, we also notice that the boundary between the last two dots, i.e. segment d is yet of another length, actually it is longer than a. Checking segment length against structural position we can see that embedding involves shortening of a boundary in such a way that
24
Laszlo Hunyadi
Segment length vs. hierarchical position Segment length in % of total pattern length
35 26 18
(5) •(•(••))•
9 0 a
b
c
d
Segments Figure 2: Embedding in patterns with abstract visual elements: The shortening/lengthening of segment duration according to structural position
recursive application of embedding results in recursive shortening. On the other way, de-embedding (the case of stepping out of embedding after the forth dot in our example) is represented as relative lengthening of the given boundary. Accordingly, for the same hierarchical level, de-embedding proves to be longer than embedding. Namely, as shown in Figure 2, de-embedding being represented by lengthening and the forth dot structurally belonging to the same hierarchical level as the first dot, the boundary of de-embedding represented by segment d is not only longer than the immediately preceding segment of embedding b, but also longer than segment a with the first embedding. On the basis of our examples (1)–(5) we can generalize that grouping is rhythmical, i.e. the establishment of boundaries between adjacent groups lends a certain pulse to a given sequence of groups, and that grouping tends to create symmetry. However, grouping is structure-dependent and this property has its effect on the realization of the actual symmetrical relation as well. Whereas the default case of symmetry is when two groups occupy the same level in a hierarchical structure (coordination relation, such as (1), (2) and (3)) with virtually equal boundary length between them, a sequence of groups at different hierarchical levels (subordination, embedding/de-embedding, such as (5)) requires shortening/lengthening of their boundaries. Accordingly, symmetry of two groups is determined for coordination as a default, represented by equal segment length, and subordination applied to it by requiring a modification of this segment length according to hierarchical position (shortening for embedding, lengthening for de-embedding).
Grouping, symmetry, and rhythm in language
25
Although, as for timing, the prototype of rhythm is based on the equal distribution of pulses based on default symmetrical relations of coordination, we suggest that pulses in which symmetry is modified according to hierarchical position also lends the sense of rhythm. This rhythm will however be the result of certain calculations, the calculation of the relative position of pulses within a hierarchy. Accordingly, we may see a difference between two kinds of rhythm depending on the structure they are generated by: the default rhythm of a sequence of groups at equal hierarchical levels (the case of coordination) is based on the default case of the symmetry of equal groups with equal spacing of pulses, whereas the derived rhythm of a sequence of groups at different hierarchical levels (the case of subordination, embedding/de-embedding) is the result of calculation by modifying the default values for symmetry as a function of the given position in the hierarchy. The output is a modified spacing of pulses but, since structurally determined, regular enough to be percieved as rhythmical. Since in general calculation is expensive, we can expect that the representation of symmetrical patterns with default rhythm requiring less calculation is less expensive than that of patterns with derived rhythm. This is what we see when looking at the processing time required to represent symmetrical vs. nonsymmetrical patterns of the kinds (6) vs. (7) and (8), respectively: (6)
•(••)(••)
(7)
•(•••)(••)
(8)
•(••)(•••)
We measured the onset times for the representation of patterns (6)-(8) in the experiment involving 50 university students we referred to earlier (cf. Hunyadi 2010) and found that the average processing time to represent (6) was 1156.92 ms as compared to 1334.47 ms for (7) and 1834.04 ms for (8). Without listing all other combination here let it suffice to mention that we found that the processing time for the representation of symmetrical structures was significantly shorter than for that of non-symmetrical ones. This observation allows us to suggest that when determining the proper representation of a pattern first we look for the presence of rhythm in the given pattern manifested by symmetrical grouping. When such a rhythm is found, its representation does not need any further calculation, but when no such rhythmic pattern is present, we first need to create the symmetry by adjusting the length of the shorter segment to that of the longer one. Since this latter case requires a process computationally more expensive, it can account for the longer processing time for (7) and (8) as compared to the one for (6) . The
26
Laszlo Hunyadi
fact that (8) needs more processing than (7), can be accounted for by the following. Similarly to (6) with the last two groups overtly symmetrical, the underlying principle of symmetry should also be met in both (7) and (8) with no overt symmetrical structure. While looking for the shorter segment whose length should be adjusted to the length of the longer one, in the case of patterns like (8) forward looking is required (the advance calculation of the duration of the group with more elements on the right so that the duration of the group on the left should be matched with it), whereas in the inverse case of (7) no such planning in the form of forward looking is required. It is for the complexity of forward looking (first look forward, calculate, then go back and adjust length on the left) requires more processing time. Whereas (6), (7) and (8) all follow the principle of symmetry, this complexity of forward looking explains why pattern (8) needs the longest processing time.
3 Grouping of abstract prosodic patterns As mentioned earlier in Section 1, there are a number of means of representation for grouping and most may occur in combination. Any modality has a default means of grouping specific to the given modality. Whereas timing appears to participate in representation across many modalities including language, we suggest that, depending on the modalities, there can be other candidates to be the default as well. Speech being about sound, and the most prevalent feature of sound in speech being F0-variation, we suggest F0-variation to be the default means of the representation of grouping in speech. In earlier experiments (cf. Hunyadi 2010) we showed that whereas timing (variation of segment length) is not obligatorily used for grouping, pitch (F0-variation) is always present. We also showed that group boundary is also denoted by a specific change, this time the change of the direction of pitch movement. Consider (9) and (10): (9)
ABCD
(10)
(AB) (CD)
In these patterns the capital letters stand for dots in abstract visual patterns (1) and (2), respectively. Pronounced using their names as words, they form abstract prosodic patterns whose benefit is that, similarly to the previously studied abstract visual patterns, they are void of semantic content that could potentially have an effect on the representation of grouping by pitch variation. Being prosodic in
Grouping, symmetry, and rhythm in language
27
nature, however, they allow us to study the role of certain sound-related characteristics in the representation of various kinds of structure involving sound. Also, they can function as a bridge between two worlds: the study of abstract prosodic patterns is abstract enough to allow us to relate them to the even more abstract world of visual patterns and specific enough to be related to the more specific world of human speech alike. As for the representation of structure, our observation was that group boundary was represented by an opposite direction of pitch movement at the two sides of the boundary, whereas elements within a single group were represented by a combination of tonal elements resulting in a continuous pitch movement. As for (9) with no overt marking of grouping using brackets, 3 out of the 23 subjects denoted inherent grouping by pitch variation, the rest used a sequence of equal pitch movement (here and through pattern (11) below the number of cases for each kind of pitch variation is shown in brackets): (9)
ABCD
inherent grouping: rise fall rise fall (2) fall rise fall rise (1) no surface grouping: rise rise rise fall (14) fall fall fall fall (6) As for 10, all subjects used some pitch distinction to denote belonging to the same or different group: (10)
(AB) (CD)
fall rise fall fall (15) rise fall fall fall (5) high_level high_level low_level low_level (2) rise fall rise fall (1) The next pattern is the case of hierarchical embedding: (11)
A (B(CD))E
28
Laszlo Hunyadi
As a first approximation, pitch movement reflected group boundaries similarly to the above two patterns: rise rise fall rise fall (12) rise rise fall fall fall (6) rise fall rise fall fall (6) fall rise fall fall fall (2) Recursive embedding was represented by recursive lowering of “fall” or raising of “rise”, accordingly, we found that recursive shortening for embedding characteristic of abstract visual patterns has a similar counterpart as pitch movement in abstract prosodic patterns. Also, the final “rise” for de-embedding, applied in the majority of the cases, rose higher than the “rise” of the first and second embedding, thus showing the same gradual change as found in shortening/lengthening at boundaries. The rest of the cases with a “fall” for de-embedding applied time, i.e. lengthening for the same function. Similarly to the function of timing in denoting rhythm we found that a systematic change of pitch can also contribute to the sense of rhythm (‘tonal rhythm’) when applied to group formation. Accordingly, default rhythm is the effect of the succession of sequences of pairs (groups) of “rise” and “fall” or “fall” and “rise”, whereas derived rhythm (the case of embeddings) is represented as the effect of calculations of relative pitches involving pitch movements of the same kind (either “rise” or fall”). We also found that variation by pitch movement to represent structure is the default means of grouping in prosody: timing found across modalities as means of grouping is not specific to prosody alone and, as such, is not mandatory in the representation of grouping. Our observations give support to Schenker’s principle of musical analysis (cf. Schenker 1935), in which he assumes priority of pitch in denoting rhythm in tonal music. According to him meter and pitch are two distinct phenomena: meter is associated with durational patterning, whereas tonal variation (the sequence of tones) itself contributes to rhythm independent of meter. However, we need to add here the following. Rhythm has at least two distinct functions: one is contextindependent and is structural, the other is context-dependent and is contextual. In our experiments we are dealing with rhythm being related to formal structure, and from this point of view timing (meter) and pitch both contribute to the representation of the rhythm of the underlying structure. When representing individual contextual features, however, timing and pitch variation can have different functions both in speech and music.
Grouping, symmetry, and rhythm in language
29
4 Grouping in actual speech utterances a. the role of pitch variation: the bookmark effect and tonal continuity Undoubtedly, the fundamental structure of an utterance strongly builds on tonality. E.g., according to Beckman and Pierrehumbert (1986) the edge of an intonation phrase is marked by a phrase tone and a boundary tone, whereas that of the intermediate phrase by a phrase tone only. As for the role of tonality in grouping, we have also found that, similarly to using pitch variation for grouping in abstract prosodic patterns, the same pitch variation is a means of representation of group structure in speech utterances as well. Consider the following utterance (12): (12)
Why don’t you –I said– go and find it?
Figures 3 and 4 below show the pitch contour of (12) with the inserted material (I said) and without it (the latter produced by removing I said from the original recording), respectively (the arrow showing the location of the removal)⁴:
Figure 3: Why don’t you – I said – go and find it?
4 The Hertz values in the y-axes of figures 3–7 have to be read as the real values divided by 10.
30
Laszlo Hunyadi
Figure 4: Why don’t you [..........] go and find it?
The sentence itself consists of three distinct segments: (a) Why don’t you, (b) I said and (c) go and find it. In Figure 3 we can notice that segments a and c have similar pitch excursions both in shape and tonal range, whereas segment c is distinct both by shape and tonal range. Figure 4 with segment c removed shows that segment c continues exactly where segment a ends. This is the case of tonal continuity as the result of the so-called bookmark effect as described in Hunyadi 2010. According to this effect, two discontinuous prosodic segments forming a structural unit but separated in their linear sequence by some other material (embedding or insertion) join prosodically in such a way that the second segment continues where the first left. That it involves a certain memory effect as well is shown by the fact that regardless of the length of the intervening material the effect of downdrift, characteristic of the flow of speech, does not take place: the tonal contours of the two discontinuous segments match exactly. We assume that tonal continuity is based on the principle of symmetry, too. Namely, as symmetry is a relation between groups of elements that are equal in terms of some characteristic parameter(s) such as duration for the relation between equal pulses, we can also assume that pitch contour is a similar parameter with the role of establishing a symmetrical relation between two tonal groups (sequences of tonal elements). The effect of this kind of symmetry produced by the tonal similarity of two such tonal groups is the perception of tonal rhythm. Accordingly, similarly to the role of timing in denoting symmetrical groups in
Grouping, symmetry, and rhythm in language
31
such a way that two segments of equal length are divided by a segment of a different length (cf. the rhythmic representation of pattern (2) (••)(••) in Section 1), the insertion of a segment with tonal properties different from the tonally similar two segments on its both sides has a symmetrical rhythmic effect, too. The rhythm produced by such a tonal configuration is an example of default rhythm between two (or more) coordinated segments. Applying the principle of tonal continuity to cases of embedding we get, as expected, an instance of derived rhythm; cf. (13): (13)
The cat that the fox that was rabid bit ran away.
Figure 5 shows the pitch contour of (13):
Figure 5: The cat that the fox that was rabid bit ran away.
Figures 6 and 7 show the pitch contour of (13) with the first as well as the first and second embeddings removed, respectively:
32
Laszlo Hunyadi
Figure 6: The cat that the fox [........] bit ran away.
Figure 7: The cat [........] ran away.
These figures demonstrate the effect of embedding on the tonal organization of the utterance: recursive embedding is represented by recursive lowering of the pitch contour of the embedded segment, whereas the principle of tonal continuity ensures that two discontinuous prosodic segments, such as before and after embedding, should meet by forming a single tonal unit by the help of tonal continuity. The effect of tonal continuity is the tonal symmetry of the two discontinuous
Grouping, symmetry, and rhythm in language
33
segments, the basis for rhythmic grouping (default rhythm), whereas the effect of the recursive lowering of the pitch contour in combination with tonal symmetry involves a calculation on this default rhythm to produce the final derived rhythm of the utterance as a whole. b. the role of timing in the representation of group boundaries As we saw earlier, there is a clear role of tonality in the representation of grouping in abstract prosodic patterns and speech utterances. Duration expressed by pauses are actually a more general means of grouping, not specific to language. Pauses are considered to be also associated with phrase boundaries (cf. Cooper and Paccia-Cooper (1980)). In this view, the duration of a pause is even determined by the hierarchical position of the given boundary. Moreover, according to Selkirk (1984), Taglicht (1998) and Downing (1970), intonation phrase boundaries are themselves considered to be determined on the basis of possible or obligatory pauses. Accordingly, we wanted to find out how pauses (and, more generally, timing) contribute to the representation of grouping in speech utterances. In our reading experiments with 50 students we wished to find out how embedding in sentences was represented by duration. Sentences included the Hungarian equivalents of sentence (13) above containing recursive embedding. Since in case of abstract visual patterns we found segment lengthening as an indication of a group boundary, and with the assumption that grouping in speech would follow this effect of timing, we expected to find a pause of some length at group boundaries. However, as a general observation we found that pausal timing was secondary to pitch variation in the representation of recursive embedding: whereas we had assumed that the most obvious time-related marker of a boundary before and after embedding would be a separate additional pause at the boundary, pause was only clearly manifested at the last de-embedding. In all other cases timing took the form of the less articulated final shortening on the segment before embedding and final lengthening on the segment before de-embedding (for details on various aspects of lengthening/shortening in phonetics, phonology and syntax cf. Beckman 1992, Beckman and Edwards 1990, Beckman et al. 2002, Cooper 1976, Hayes 1989, 1995, Hockey and Fagyal 1998, Ladd and Campbell 1991). At the same time, the structural conditions of lengthening/shortening clearly corresponded to the lengthening and shortening of segments observed in the representation of abstract visual patterns thus indicating that lengthening/shortening was also supported by general principles beyond linguistic structure.
34
Laszlo Hunyadi
5 Considerations on computing the rhythm of speech utterances We have arrived at the point where we have seen that rhythm is determined by the inherent principle of grouping, grouping in its turn is based on the most probably also inherent principle of symmetry, and, finally, the actual values of this symmetry are modified according to the hierarchical position of the given structural units to produce the derived rhythm of the representation of a pattern, including that of a speech utterance. Although we are not in the position to determine the exact content of calculations leading to the final parametrization of the rhythmic pattern of a concrete speech utterance, we assume that these calculations are not language-specific either, rather, they can be applied to any modality, including language. In order to approach this problem with the intent to capture at least some of the modality-independent properties of such a calculation, our starting point will again be the observation of the representation of abstract visual patterns that may lead to generalizations desirably applicable to rhythm in language as well. The main question that we will ask in this respect is what individual factors or relations determine the timing of rhythmic groups, both in its absolute and relative sense. Namely, we assume that each of us follows an internal tempo individually determined (both locally, event dependently and globally, across events) but the relative timing of pattern representations is also structure dependent, following certain universal, subject-independent rules. An example: as we have seen, embedding is represented by the shortening of segment length at the boundary of embedding as compared to boundary length between groups of coordination (cf. Figure 1 and Figure 2 above), and we have strong intuition that the length of the corresponding boundaries or the relative difference between them is not arbitrary. We may suggest that this length is relative to the length of other segments within the pattern, be it the length of the pattern as a whole or that of another segment whether they are adjacent or not. Another dimension of the question is that even if the total length of a pattern of any complexity is known we might want to know if the length of the constituting segments of its groups is distributed evenly and if not, what determines their actual value. The complexity of these issues and the subject matter of this paper only permit us to restrict our attention to a single question: given a sequence of two groups of elements and given that the principle of symmetry is a determining factor behind the temporal arrangement of these groups and elements within each of the groups, does the structure of either of the groups (especially the number of the constituting elements) have an effect on the temporal representa-
Grouping, symmetry, and rhythm in language
35
tion of the other, including the pause (the boundary length) between the groups. Since we know that different patterns have different onset times for representation and that onset time is structure dependent (see the relevant discussion in Section 2), i.e. representation is the output of some calculations, this question implies that representation involves planning. A possible answer to this question should then include the description of this planning mechanism. Let us now examine what principles and under what conditions underly the planning of the temporal arrangement of groups, including abstract visual patterns and speech utterances. a. Principles of planning: inherent grouping, symmetry, and hierarchy As we saw in Section 2 and Section 3 above, there is a clear correspondence between underlying structure and surface representation. As it was shown in previous sections, the fundamental structural distinction between elements forming the same group or belonging to different groups is essentially denoted by the distinction between short vs. long segment duration, representing the lack or presence of a boundary between the constituting elements, respectively. Another fundamental aspect of comparing elements within a structure is whether they are at the same or different level in the given structural hierarchy. The simple case of coordination is the one between elements at the same level of hierarchy, denoted by the simple contrast of long and short segments. Hierarchical embedding builds on this contrast but also differs from it. Namely, embedding also takes place between two groups, accordingly, their representation also includes the contrast of long and short segments, long standing for the boundary segment. However, the hierarchical difference between the same two groups also needs to be represented, and this distinction is again made by a variation of duration, i.e. shortening for embedding and lengthening for de-embedding. The fact that the relative durational difference between short and long segments is structure dependent, i.e. that it is a function of the underlying structure of a pattern rather than its (individual) surface representation is shown by the estimation of correlations between the kinds of segments in the same pattern. This is what we see in the following comparison, using the pairwise method of estimating correlations (here and in all further estimations of correlation significance of probability is < 0.05). Cf. the case of correlation in (2) and (3), repeated here as (14) and (15), respectively, and the case of embedding in (16): (14)
(••)(••) (= (2))
36
Laszlo Hunyadi
Table 1: Estimation of correlations in (14) 14a–14b
14b–14c
14a–14c
0.3097
0.4227
0.883
As we see, there is a strong correlation between the two 2-dot segments (14a and 14c, 0.8830) and weak or moderate correlation between each of the 2-dot segments and the boundary segment (14a and 14b, 0.3097 and 14c and 14b, 0.4227). This difference in correlation is what we expect as the temporal representation of two coordinated groups of equal size. A similar example involving coordination but with opposite arrangement of long and short groups is (15): (15)
(••)(••) (= (3))
Table 2: Estimation of correlations in (15) 15a–15b
15b–15c
15a–15c
0.4199
0.3947
0.8566
Data show that groups that include a group boundary have strong correlation (cf. 15a and 15c, 0.8566) and the same groups have weak to moderate correlation with the groups that have no boundary inside (cf. 15a and 15b: 0.4199 as well as 15c and 15b: 0.3947). Below, we’ll see that a structure with linear arrangement similar to that of (14) consisting of two adjacent groups of two dots each but with different hierarchical positions as a result of embedding yields a modified representation; cf. (16): (16)
•(••(••))
Table 3: Overall Means of individual segment durations in (16) 16a
16b
16c
16d
932.04
286.56
779.02
274.78
The comparison of the overall means shows that first, the two ‘short’ segments, 16b and 16d are not equal in duration and, second, the two ‘long’ segments (representing the boundaries between two adjacent dots), 16a and 16c are also not equal. In both cases we observe shortening as an effect of embedding, the representation of groups similar in linear composition but different in hierarchical
Grouping, symmetry, and rhythm in language
37
position. Accordingly, we do not find strong correlation between the respective pairs of groups in question: Table 4: Estimation of correlations in (16) = •(••(••)) 16a–16b
16a–16c
16a–16d
16b–16c
16b–16d
16c–16d
0.2902
0.5949
0.1053
0.2415
0.4774
0.2837
What we notice first is that this representation lacks the strong correlation between segments 16b and 16d (actually, it is weak to moderate, 0.4774). That is, although these two groups consist of two dots each, their similarity in temporal representation (i.e. structural interpretation) is affected by some effect stronger than the linear composition of these groups. This effect is that of hierarchy. Hierarchy determines the relatively strongest correlation among all possible pairs between segments 16a and 16c, both ranging over a group boundary. What they share in common is not just the equal number of dots within their respective groups but also that they both denote structural embedding. Since embedding is denoted by the shortening of segment length, the recursive application of embedding (16c related to 16a) also results in recursive shortening. Thus the “story” of the respective length and correlation of 16a and 16c is this: being equal in length (by the number of dots in their respective groups), 16a and 16c are expected to have strong correlation (similarly to what we get in case of 14a and 14c above). Their structural position (across group boundary) suggests that the duration between them is ‘long”, such as the case of 15a and 15c. We do not, however, get the strong correlation we find in the linear structures of either (14) or (15): it is due to their relative position in the structural hierarchy. Accordingly, we witness here the result of a calculation starting with identifying compositional similarity (similarity in the number of dots involved in each of the groups) and followed by identifying structural difference (difference in their position in the hierarchy). The former requires similarity in representation (equal duration), whereas the latter requires shortening with respect to the calculated default duration. As a result, the correlation of the two groups with equal composition but different structural position will necessarily weaken – but only to the extent that their correlation does not turn into the negative. The fact that individual pairs with similar linear arrangement have different correlation ranging from weak to strong suggests that there is a certain calculation behind these resulted values. They cannot be the result of some constant, context-free relation (such as one’s tempo), since then, when normalized, structural differences across pairs of patterns would be lost in the estimation of cor-
38
Laszlo Hunyadi
relations. Accordingly, calculations are structure dependent, dependent on the actual structure of the pattern to be represented. It is apparent that any representation is based on groups. Accordingly, when we are looking for cues of the planning process of the representation of structure we can suggest that the first step is to identify groups. The default way of identifying groups is by identifying boundaries between elements. This seems to be straightforward when there are overt boundary markers in the description of the structure (such as brackets in our examples involving abstract visual and prosodic patterns) resulting in overt grouping. However, as we saw earlier, one seeks to identify groups even if there are no overt boundary markers; this is the case of inherent grouping. This is what we see observing cases like (1), repeated here as (17) below: (17)
••••
Table 5: Estimation of correlations in (17) 17a–17b
17b–17c
17a–17c
0.253
0.1402
0.9365
Although we would expect a relatively even correlation between subsequent 2-dot segments in a sequence of dots with no overt boundary markers, example (17) above shows that this is not the case. As we see, there is a very strong correlation between segments 17a and 17c (0.9365) and weak or very weak correlation between 17a and 17b (0.2530) and 17c and 17b (0.1402). Since identifying groups is the precondition for the identification (and, subsequently, representation) of any structure, the analysis of (17) suggests that, indeed, the application of the principle of inherent grouping is expected to be the basis of any further calculations determining the actual surface representation of a structure. The following two examples reveal the condition for inherent grouping as well; cf. (18) and (19): (18)
(•••)•
Table 6: Estimation of correlations in (18) 18a–18b
18b–18c
18a–18c
0.8083
0.5574
0.6283
Grouping, symmetry, and rhythm in language
(19)
39
•(•••)
Table 7: Estimation of correlations in (19) 19a–19b
19b–19c
19a–19c
0.2812
0.8655
0.3045
Tables 6 and 7 show that, if an overt group consists of three elements, no further grouping takes place. Namely, both 18a and 18b in (18) and 19b and 19c in (19) have a strong correlation (0.8083 and 0.8655, respectively). It suggests that inherent grouping only takes place when the resulted groups are compositionally similar. Thus, the next task after establishing groups by identifying overt group boundaries (between groups of any kind of composition) is to identify (further) groups by compositional similarity. The principle behind it is that of symmetry. We suggest that symmetry in its default form is simply determined quantitatively, involving the size of groups in question (the number of constituting elements and/or their other eventual one-dimensional characteristics) without regard to the actual structural (hierarchical) characteristics of the groups. The output of the calculation of symmetry then will be the input to the calculation of values of representation according to the actual position of groups in the given structure. This second principle is that of hierarchy. The default case is coordination, i.e. the lack of hierarchical difference, in which case no further calculation on the output of symmetry is needed. In the presence of hierarchical differences, such as embedding, however, the operation of shortening/lengthening (depending on the position to be represented) is applied on the output of hierarchy. Considering symmetry as a principle rather than some concrete measurable quantitative characteristics we believe that, if applied, symmetry will not be canceled by a subsequent operation, instead, such a sequence of operations will result in a derived symmetry. Accordingly, the symmetry between segments 14a and 14c, or 15a and 15c, all above, show default symmetry, while the comparison of 9a and 9c, as well as that of 16b and 16d show derived symmetry. Interestingly, we may have the impression that pairs of derived symmetry are equally “symmetrical”, even though in sense of duration or other measurable characteristics they are more different than pairs of default symmetry. As an account for this observation we may suggest that this “adjustment mechanism” applied to the default symmetry is built in our cognition in form of principles and rules and we take it into consideration when identifying and interpreting underlying hierarchical structure on the basis of surface representation. The place in cognition and functioning of this adjustment mechanism is similar to that of interpreting phonemes through allo-
40
Laszlo Hunyadi
phones as their surface representations: based on rules of mapping allophones onto phonemes, we “hear” the phonemes beyond the actual physical sign. As the above analyses show, when planning the surface representation of a structure we follow the principles of inherent grouping, symmetry, and hierarchy and in this very order. Next, we will find out if there is any directionality also involved in such a planning. Namely, we wish to find whether, representation being linear and sequential, there is any indication if we have forward or backward looking, if any. b. Directionality as a condition underlying the planning of the temporal arrangement of groups Consider examples in Table 8 and Table 9 below having one group with four dots across patterns and another with variable number of dots⁵. The position of the constant four-dot group in patterns in Table 8 is in the left, whereas in Table 9 it is in the right. Below find the estimated correlations between pairs of groups: (20)
•(••••(•))
(21)
•(••••(••))
(22)
•(••••(•••))
(23)
•(••••(••••))
Table 9: Estimation of correlations between similar segments of (20)–(23)
20b+20c+20d 21b+21c+21d 22b+22c+22d
(24)
21b+21c+21d
22b+22c+22d
23b+123c+23d
0.8445
0.8326
0.9019
0.8223
0.7765 0.8274
•(•(••••))
5 The patterns we are considering are, to some extent, the abstraction of patterns of rhythmic succession in music as described in Narmour (1990): groups with variable number of dots suggest duration moving cumulatively. What we wish to find, however, is how this variable duration suggested by the number of dots relates to ‘meter’, i.e. how a symmetrical relation can be established between the two groups.
Grouping, symmetry, and rhythm in language
(25)
•(••(••••))
(26)
•(•••(••••))
(27)
•(••••(••••))
41
Table 9: Estimation of correlations between similar segments of (24)–(27)
24c+24d+24e 25d+25e+25f 26e+26f+26g
25d+25e+25f
26e+26f+26g
27f+27g+27h
0.6457
0.4679
0.3686
0.7916
0.6421 0.6689
Comparing the two sets of patterns that only differ in the position of the constant 4-dot group in the given pattern we see an important difference. Namely, Table 8 with the constant 4-dot group on the left shows strong correlation of all instances of this group across all occurrences with a variable number of dots in the right group, suggesting that the composition of the right group does not have a significant effect on the representation of the left group. On the other hand, Table 9 with the opposite linear arrangement of groups with the constant 4-dot group on the right shows that there is mostly only moderate or weak correlation between the instances of this constant 4-dot group. Accordingly, the representation of the group on the right appears to be under the effect of the composition of the group on the left whose representation is, as Table 8 showed, highly systematic on its own. From these data we can then conclude that a. there is a directionality for planning of structural representation and b. this direction is backward looking i.e. the representation of the right group following the quantitative characteristics – in our case timing – of the group on the left. (That backward looking is the default direction of panning, is also supported by the comparison of onset times for planning we found in Section 2: according to this measurement, forward looking proved to be computationally expensive.) What can be the basic mechanism of backward looking? We may suggest that it is symmetry again. The comparison of the left and right groups inside the same pattern supports this suggestion: whereas, according to Table 9, the duration of the same kind of group (with the same number of dots) on the right has weak or moderate correlation within its similar instances in different patterns, the correlation of the left and the right group inside the same pattern increases with being closer to each other in terms of the number of dots. This can be accounted for by following the principle of symmetry: a relative similarity of group durations can
42
Laszlo Hunyadi
be achieved more easily if there is less compositional difference between the prospective pairs of the symmetry. This is how we can account for the relative difference between the corresponding short and the corresponding long segments in patterns analysed in Section 1: The overall least square means (in ms) for segments in (2) – repeated here again as (14) and (3) – repeated here as (28) – are shown in Table 10: (28)
•(••)• (= (3))
Table 10: Least square means for segments in (14) and (28) (••)(••).14a
(••)(••).14b
(••)(••).14c
•(••)•.28a
•(••)•.28b
•(••)•.28c
317.229167
862.020833
308.770833
847.520833
273.3125
718.0625
Accordingly, the short segments a and c of pattern (14) (••)(••) are somewhat longer than the short segment b of pattern (28) •(••)•, and, also, the long segment of (14) is also somewhat longer than the long segments of (28). The segments have the partial correlations given in Table 11. Table 11: Estimation of correlations between segments of (14) and (28)
(••)(••).14a (••)(••).14b (••)(••).14c •(••)•.28a •(••)•.28b
(••)(••).14b
(••)(••).14c
•(••)•.28a
•(••)•.28b
•(••)•.28c
0.5828
0.8609
0.5268
0.872
0.4576
0.4352
0.7459
0.4643
0.7199
0.316
0.8292
0.2933
0.4042
0.8527 0.3777
The above correlation estimates further clarify that strong correlation only exists between segments of the same type (either both ‘long’ or ‘short’) and without regard to their position in the pattern. However, looking at the effect of the order of groups in the same pattern (i.e. whether one group precedes or follows the other in the same pattern) we observe the following. Using the Manova LS Means method and comparing the two short segments 14a and 14c occurring in the same pattern by the Within Subjects Contrast, we find that the effect of GroupOrder (position of the short segments in the pattern) is not significant with regard to 14a and 14c. Accordingly, the position of the short groups in pattern (14) does not have an effect on their length. However, compar-
Grouping, symmetry, and rhythm in language
43
ing by the same method the short segments 14a and 28b as well as 14c and 28b belonging to different patterns, the effect of GroupOrder (position of the short segments in the pattern) is proved to be significant both between 14a and 28b and between 14c and 28b. Accordingly, long and short segments have a different relation across patterns like (14) and in (28). Assuming that representation follows the same principles in the case of both patterns, the observed difference in the significance of the effect of GroupOrder allows us to suggest that the two patterns have a structural difference that plays a role in their surface representation. We see this difference in the following: pattern (14) (••)(••) is perceived as consisting of two symmetrical groups (14a and 14c) and a boundary between them. The principle of symmetry requires the boundary to be as long as the two symmetrical groups. However, 14b is longer than 14a or 14c, due to the fact that, in addition to applying symmetry to the representation requiring equal length (appearing as equal beat), this segment 14b is lengthened to indicate group boundary as well. Accordingly, the representation of 14b is the case of derived symmetry: the length of segment 14b, determined, as a default, by symmetry, is lengthened following the rule of boundary formation by lengthening. In contrast, in the case of (28) boundary formation (lengthening) in segment 28a is not performed on the output of the default symmetry operation (and, accordingly, does not have the effect of additional lengthening). That is why the ‘long’ segment 28a proves to be shorter than the corresponding ‘long’ segment 14b. (That the second ‘long’ segment in (28), 28c is even shorter, may be the effect of final shortening). The assumed role of symmetry and the preference for backward looking in performing the surface representation of the various patterns by mouse clicks is supported by the comparison of onset times already referred to (the time subjects needed for processing before starting the representation of the respective patterns). In all cases processing time was shorter for patterns with the right group consisting of fewer dots than the number of dots in the left group, and longer in all other cases. If we assume that planning again involved the principle of symmetry, then, it appeared computationally less expensive to follow backward looking in planning, i.e. to adjust the duration of the right group to that of the left group by virtually adding the difference to the duration of the (shorter) right group in form of a pause than to eventually follow forward looking by first calculating the duration of the (shorter) right group, and then shortening the estimated default duration of the (longer) left group in order to match duration. (This observation supports an earlier finding according to which planning both in speech and music is facilitated by the events’ metrical similarity and serial/temporal proximity and by developmental changes in short-term memory (cf. Palmer & Pfordresher, 2003)).
44
Laszlo Hunyadi
c. Planning the temporal representation of speech utterances Let us now look at examples with actual speech utterances to find out if the realization of certain kinds of rhythmic units in speech follows the abstract cognitive principles of the planning of structural representation. We carried out a reading experiment with 21 subjects, 19 of them Hungarian university students and 2 further adults involving 15 Hungarian sentences having the following structure (A = first segment, B = second segment, C = third segment of a sentence): A: one of 3 verbs, each with 2, 3 or 4 syllables + an article B: one of 3 nouns, each with 4, 5, 6, 7 or 8 syllables + an article C: invariably the same noun with 3 syllables
The words were: A: Álltam ‘I stood’, beszéltem ‘I spoke’ ‘beszélgettem’ ‘I talked [to]’ + a ‘the’ B: rajztanárral ‘with the arts teacher’, énektanárral ‘with the music teacher’, fizikatanárral ‘with the physics teacher’, történelemtanárral ‘with the history teacher’, matematikatanárral ‘with the math teacher’ + a ‘the’ C: a folyosón ‘in the corridor’
The 15 sentences were the result of the permutation of A, B and C and followed the patterns below: (29)
Álltam a rajztanárral a folyosón. ‘I stood with the arts teacher in the corridor.’
(30)
Beszéltem a fizikatanárral a folyosón.’I talked with the physics teacher in the corridor.’
(31)
Beszélgettem a matematikatanárral a folyosón. ‘I talked with the mathematics teacher in the corridor.’
Since all these sentences were neutral (context-independent), it was expected that subjects would read out each of the 15 sentences with emphasis on the first syllable of each group, a property of Hungarian, as e.g. with emphasis on álltam, rajztanárral and folyosón. We intended to measure the duration of each of the segments and find out if there was an either forward or backward looking effect
Grouping, symmetry, and rhythm in language
45
between adjacent segments in determining segment duration. Since we wished to exclude the effect of the established temporal structure of the first utterance on the next ones, the subjects were presented with the 15 sentences in random order and, in addition, the reading of each of the sentences was followed by a small interval of briefly discussing mostly but not exclusively the content of the actual reading with the subject. We were interested in the following: a. does the length of a particular word in a segment (measured in number of syllables) effect the duration of the given segment (measured in ms) and b. does the length of a particular word in a segment effect the duration of a word in another segment. If there were such an effect, we also wished to find out the directionality of planning. By answering these questions we expected to identify some quantifiable cues to the perceived rhythm of utterances where the succession of emphasis on segment initial syllables offers the sense of pulsation. The same kinds of measurement were carried out on the speech material as in the case of abstract visual patterns with dots, giving us the opportunity to check if the general principles identified in the latter also underly temporal organization in speech. We have to note that the number of subjects participating in these experiments were fewer than what, following a rule of thumb, would be expected to satisfactorily exclude some biases, yet we hope that these data will at least suggest some tendencies pertaining both to grouping in general and temporal grouping in speech. Constant: segment A – partial correlation with variable segment B: Compare the following correlation data in Tables 12–14: Table 12: Estimation of correlations of first segments of five test sentences
A
B
C
(32A) Álltam a
rajztanárral a
folyosón.
(46A) Álltam az
énektanárral a
folyosón.
(43A) Álltam a
fizikatanárral a
folyosón.
(35A) Álltam a
történelemtanárral a
folyosón.
(38A) Álltam a
matematikatanárral a
folyosón.
(46A)
(43A)
(35A)
(38A)
0.6036
0.8399
0.8299
0.5962
0.5046
0.6581
0.4942
0.7436
0.6381 0.5596
46
Laszlo Hunyadi
Table 13: Estimation of correlations of first segments of five test sentences
A
B
C
(37A) Beszéltem a
rajztanárral a
folyosón.
(34A) Beszéltem az énektanárral a
folyosón.
(41A) Beszéltem a
fizikatanárral a
folyosón.
(44A) Beszéltem a
történelemtanárral a
folyosón.
(40A) Beszéltem a
matematikatanárral a folyosón.
(34A)
(41A)
(44A)
(40A)
0.6537
0.7378
0.6697
0.6288
0.4793
0.5438
0.6237
0.585
0.6181 0.6742
Table 14: Estimation of correlations of first segments of five test sentences (39A)
(33A)
(42A)
(36A)
A
B
C
(45A) Beszélgettem a
rajztanárral a
folyosón. 0.7725
0.5173
0.8566
0.5794
(39A) Beszélgettem az énektanárral a
folyosón.
0.6174
0.8368
0.8106
(33A) Beszélgettem a
fizikatanárral a
folyosón.
0.6286
0.8985
(42A) Beszélgettem a
történelemtanárral a
folyosón.
(36A) Beszélgettem a
matematikatanárral a folyosón.
0.7419
Constant: segment b – partial correlation with variable segment A: Compare the following correlation data in Tables 15–19: Table 15: Estimation of correlations of second segments of three test sentences
A
B
C
(32B) Álltam a
rajztanárral a
folyosón.
(37B) Beszéltem a
rajztanárral a
folyosón.
(45B) Beszélgettem a
rajztanárral a
folyosón.
(37B)
(45B)
0.8307
0.6624 0.717
Grouping, symmetry, and rhythm in language
47
Table 16: Estimation of correlations of second segments of three test sentences
A
B
C
(46B) Álltam az
énektanárral a
folyosón.
(34B) Beszéltem az
énektanárral a
folyosón.
(39B) Beszélgettem az
énektanárral a
folyosón.
(34B)
(39B)
0.7035
0.8788 0.6467
Table 17: Estimation of correlations of second segments of three test sentences
A
B
C
(43B) Álltam a
fizikatanárral a
folyosón.
(41B) Beszéltem a
fizikatanárral a
folyosón.
(33B) Beszélgettem a
fizikatanárral a
folyosón.
(41B)
(33B)
0.8229
0.8094 0.8304
Table 18: Estimation of correlations of second segments of three test sentences
A
B
C
(35B) Álltam a
történelemtanárral a
folyosón.
(44B) Beszéltem a
történelemtanárral a
folyosón.
(42B) Beszélgettem a
történelemtanárral a
folyosón.
(44B)
(42B)
0.8585
0.8024 0.8051
Table 19: Estimation of correlations of second segments of three test sentences
A
B
C
(38B) Álltam a
matematikatanárral a
folyosón.
(40B) Beszéltem a
matematikatanárral a
folyosón.
(36B) Beszélgettem a
matematikatanárral a
folyosón.
(40B)
(36B)
0.7614
0.7827 0.6998
48
Laszlo Hunyadi
First of all, we observed no pairs of patterns where the correlation was weak, rather, it was mostly moderate and in some cases strong. From this it already follows that the directionality (the computational preference of backward looking) we observed in abstract visual patterns could not be identified. Accordingly, there was a significant probability of correlation in terms of duration in segment A for each of the selected verbs as compared to variable nouns in segment B suggesting that the length of B did not have a significant effect on the duration of A (the case of backward looking). But we also found a significant probability of correlation in terms of duration in segment B for each of the selected nouns regardless of what verbs they were preceded by (the case of forward looking). So at first glance it seems as if the planning of the temporal structure of the utterance subjects does not involve either backward or forward looking. Examining the data a bit closer, however, we find that the correlation estimates of the duration of segment B are higher than the correlation of the duration segment A involving all combinations of corresponding words. It allows us to suggest that establishing the correlation of the duration of segment B has preference over that of segment A while maintaining the correlation of duration within each of the segments. But why? We suggest the following answer: in all cases words in segment B are longer than words in segment A, and, similarly to what we found in abstract visual patterns with more dots in the right group than in the left group, the principle of symmetry is in effect. Namely, there is a tendency to establish a symmetry between the duration of segment A and segment B and, segment B being longer, the computationally more expensive solution comes into effect: the lengthening of the duration of the shorter segment A. However, this is, in the end, a two-way process: while lengthening segment A (adding to the length of the final syllable of the verb) there is also an attempt to “speed up” the utterance of the word in segment B, too. This latter can be accounted for by the observation that mispronunciation exclusively happened with the longer segment B-words and even there with the longest one, matematikatanárral ‘with the mathematics teacher’. Summing up, as our observations (even if further data with a larger number of subjects is still welcome) suggest the temporal representation of speech utterances also follows the mechanism of planning identified in abstract visual patterns. Inherent grouping, the basis for any kind of grouping (visual and non-visual) is present in speech among others, in the metrical assignment of secondary and main stress (even though this observation is beyond our present data and discussion). Symmetry is clearly attested in our data through the identification of the tendency to establish relatively equal durations of subsequent segments. Since this symmetry is the result of further calculations, we can identify the case of derived symmetry in speech as well. Finally, directionality is also relevant and is dependent on the configuration of shorter vs. longer segments and there is an
Grouping, symmetry, and rhythm in language
49
attempt to establish (derived) symmetry even in computationally more expensive cases. That is what is suggested by the fact that correlations within words in segment B are systematically stronger than correlations within words in segment A. This is the case of looking both backward and forward to establish (derived) symmetry for a given pattern. Our perception interprets such pulses of derived symmetry introduced by initial emphasis on segments of different durations as rhythmic succession due to this very mechanism of calculation. Being aware of these calculations and applying them to the surface data one can “reverse engineer” them and identify the underlying default symmetry hidden away from quantitative, surface measurements only. This is how calculations to produce derived symmetry produce derived rhythm, that surface form behind wich we identify the underlying rhythm of a pattern of any complexity.
6 Summary In this paper we presented data and analyses of experiments directed to the identification of general, modality independent properties of rhythm in language based on the cognitive foundations of the language faculty. It was pointed out that rhythm has two main aspects: structural and perceptual, such that the latter is the mapping of the former through representation. It was found that the perception of rhythm is based on the cognitive principle of grouping that is, in its turn, based on the cognitive principle of symmetry. It was suggested that grouping is structure dependent and that hierarchical grouping requires calculations effecting the rhythm of the representation of a given pattern. In this connection two kinds of rhythm were distinguished: default rhythm as the product of default symmetry and derived rhythm as the product of derived symmetry, the latter being the result of calculations on the default rhythm as an effect of the hierarchical structural postion of the groups involved in the given symmetrical relation. It was suggested that there is a planning mechanism for representation in which backward looking is computationally less expensive, therefore it should be considered as default. In speech utterances, depending on the respective length of adjacent groups (each group with a single emphatic element) a combination of both directions is possible. The duration of adjacent groups being potentially different, the percieved sense of rhythm is the effect of the principle of symmetry on derived rhythm.
50
Laszlo Hunyadi
References Beckman, Mary. 1992. Evidence for speech rhythms across languages. In: Yohichi Tohkura, Eric Vatikiotis-Bateson & Yoshinori Sagisaka (eds.), Speech Perception, Production and Linguistic Structure, 457–463. Oxford: IOS Press. Beckman, Mary E., & Janet Pierrehumbert. 1986. Intonational structure in English and Japanese. Phonology Yearbook 3: 255–310. Beckman, Mary & Jan Edwards, Lengthenings and shortenings and the nature of prosodic constituency. In: John Kingston & Mary E. Beckman (eds.), Papers in Laboratory Phonology I, 179–200. Cambridge, UK: Cambridge University Press. Beckman, Mary, Manuel Díaz-Campos, Julia McGory & Terrell Morgan. 2002. Intonation across Spanish, in the Tones and Break Indices framework. Probus 14, 9–36. Clarke, Eric F. 1999. Rhythm and Timing in Music. In: Diana Deutsch (ed.), Psychology of Music, second edition, pp. 473–500. San Diego: University of California. Cooper, William Edwin. 1976. Syntactic control of timing in speech production: A study of complement clauses. Journal of Phonetics 4, 151–171. Cooper, William and Jeanne Paccia-Cooper. 1980. Syntax and Speech. Cambridge, MA: Harvard University Press. Downing, Bruce T. 1970. Syntactic structure and phonological phrasing in English. Ph. D. diss., University of Texas Austin. Hayes, Bruce. 1984. The pohonology of rhythm in English. Linguistic Inquiry 15, 33–74. Hayes, Bruce. 1989. The Prosodic Hierarchy in Meter, Phonetics and Phonology. In: Paul Kiparsky & G. Youmans, eds., Rhythm and Meter (Phonetics and Phonology 1), 201–260. San Diego: Academic Press. Hayes, Bruce. 1995. Metrical stress theory. Chicago: The University of Chicago Press. Hockey, Beth Ann & Zsuzsanna Fagyal. 1998. Pre-boundary lengthening: Universal or languagespecific? The case of Hungarian. U. Penn Working Papers in Linguistics 5.1, 71–82. Honing, Henkjan. 2001. From time to time: The representation of timing and tempo. Computer Music Journal, 35(3), 50–61. Honing, Henkjan. 2002. Structure and Interpretation of Rhythm and Timing. Tijdschrift voor Muziektheorie, 7/3. 227–232. Hunyadi, László. 2009. Experimental evidence for recursion in prosody. In: Marcel den Dikken & Robert M. Vago, eds., Approaches to Hungarian, Vol. 11, 119–144. Amsterdam: John Benjamins. Hunyadi, László. 2010. Grouping, the cognitive basis of recursion in language. In: Harry van der Hulst, ed., Recursion and human language (Studies in Generative Grammar 104), 343–370. Berlin: Mouton de Gruyter. Hunyadi, László. To appear. Frázisstruktúra és prozódia. Interfész vagy közös kognitív alapok? [Phrase structure and prosody. Interface or common cognitive foundations?]. To appear in Általános Nyelvészeti Tanulmányok Ladd, Robert and Nick Campbell. 1991. Theories of prosodic structure: Evidence from syllable duration. Proceedings of the XII ICPhS. Aix-en- Provence, France Lerdahl, Fred. and Ray Jackendoff. 1983. A generative theory of tonal music. Cambridge, Mass: MIT Press. Liberman, Mark. 1975. The intonational system of English. PhD thesis, MIT, Distributed 1978 by IULC.
Grouping, symmetry, and rhythm in language
51
Liberman, Mark & Alan Prince. 1977. On stress and linguistic rhythm. Linguistics Inquiry 8. p. 249–336. Nespor, Marina & Irene Vogel. 1982. Prosodic domains of external sandhi rules. In: Harry van der Hulst and N. Smith, eds., The structure of Phonological Representations. Part I, 225–255. Dordrecht: Foris Publications. Narmour, Eugene. 1990. The Analysis and Cognition of Basic Melodic Structures: The Implication Realization Model. Chicago: University of Chicago Press. Nespor, Marina and Irene Vogel. 1986. Prosodic phonology. Foris Publications: Dordrecht. Sacks, Oliver. 2007. Musicophilia: Tales of Music and the Brain. Revised and Expanded (2008). New York: Vintage Books. Selkirk, Elizabeth. 1984. Phonology and Syntax: The relation between sound and structure. MIT Press: Cambridge, MA. Schenker, Heinrich. 1935. Neue musikalische Theorien und Phantasien, vol. III Der freie Satz (Vienna: Universal Edition; ed. and rev. Oswald Jonas 2/1956. English translation: Oster, Ernst, ed. & transl., Free Composition (Der freie Satz), 2 vols (New York: Longman, 1979; repr. edn Hillsdale, NY: Pendragon, 2001) Taglicht, Joseph. 1998. Constraints on intonational phrasing in English. Journal of Linguistics 34: 181–211.
Dafydd Gibbon
2 Speech rhythms – modelling the groove 1 Rhythm – listening closely Thesis: There is no speech rhythm. There are only speech rhythms.
1.1 Background The rhythms of speech are an intensely debated topic, and have attracted increasing attention since the early studies by Pike (1945) and Jassem (1949, 1952). Detailed overviews of previous approaches have been provided at different times (Gibbon & Richter 1984; Gibbon 2006; Gibbon, Hirst & Campbell 2012). Solutions to the problem of characterising rhythms, ‘modelling the groove’, are many, yet none is definitive. Consequently, more intensive ‘listening’ is called for. The objective of this study is not to apply a priori models or simple ‘rhythm metrics’, but to take a fresh view of the nature of speech rhythms from intuitive, phonetic and formal points of view. On this basis, approaches to capturing the rhythms themselves will be outlined, rather than particular properties such as isochrony, regularity, ‘smoothness’, with the aim of developing a more comprehensive and integrative view of rhythms than has been available so far. The present study has a methodological focus, rather than reporting on specific descriptive issues. Binary, ternary and other rhythm models are considered, first from a pre-theoretical point of view, and in some detail, with reference to the ternary rhythm model of Jassem and the binary rhythm model of Abercrombie, and a generic pre-peak peak post-peak basic alternation template for rhythm patterns is proposed. On the basis of the discussion of rhythm models, the physical properties of speech rhythms are investigated, and a generic model of syllables and feet is proposed, permitting the integration of notions such as syllable-timed and foot-timed rhythms into a single framework: the Alternating Syllable Model (ASM) and the Syllable Extension Model (SEM). Some popular quantitative models are queried in respect of their validity as rhythm metrics, and the outloook for the future development of new ‘genuine rhythm models’ based on recently proposed oscillator systems is sketched.
54
Dafydd Gibbon
1.2 Rhythm schema, rhythm interpretation and rhythm performance The problem is, though, that rhythms are elusive, in speech as in music. In music, pedantic iterations are not rhythm: whatever the style, there is a ‘feeling’, ‘swing’ or ‘groove’, with subtle accelerandi and rallentandi, with anacruses, ‘grace notes’ and syncopations which conspire against an exact definition. It is useful to make a three-way distinction which is important for disentangling different perspectives on the rhythms of speech and music. First, there are the conventional types of metre in rhetoric and in traditional poetry, such as the iambic pentameter, and the times and metric patterns in music and dance, such as the waltz and the samba. Second, there are specific interpretations of these metres and structures in context, such as syncopations in music which modify the underlying beat, or like adaptations to grammatical patterns in speech (the ‘THIRteen MEN rule’, as opposed to ‘there were thirTEEN’). Third, there are the perceivable, and measureable, spontaneously varied performed rhythms of the performance of speech and music, which I will refer to as the groove. Figure 1 shows the three distinctions in music. The first and second lines in each example correspond. The basic rhythmic and melodic template of Gershwin’s I got Rhythm (Figure 1 top) does not reflect the syncopations noted in the lead sheet (Figure 1 centre) of a jazz musician’s interpretation and even less the melodic and rhythmical swing of Ella Fitzgerald’s performance (Figure 1 bottom). The categorial properties of standard musical notation mirror the categorial properties of more abstract linguistic notations, particularly phonological and prosodic notations, while the pitch trace is a more appropriate expression of individual melismata and glissandi. The following discussion will deal with similar issues in speech. Speech rhythms, like rhythms in music, are hard to capture. But rhythms evidently have a physical and physiological reality, and not illusions or cognitive constructs alone. If this were so, we would not be able to identify ‘wrong rhythms’ – but we can do this. Rhythms have a physical component, even though this component is notoriously hard to pin down, in distinction to general principles of structured timing which may or may not be rhythmic. We have to start somewhere, though. And starting points have been very different, so that rhythm studies have varied greatly over the years in the methods used, and consqeuently phoneticians have come to very different conclusions. Earlier studies were based on perceptual impressions of rhythmically relevant prominence relations, such as ‘primary stress’ and ‘secondary stress’, by pedagogical phoneticians, but also by academic phoneticians (Pike 1945; Jassem 1949,
Speech rhythms – modelling the groove
55
Figure 1: “I got rhythm”, Gershwin (1930): schema (top), syncopation (middle, Guy Bergeron, 2010), performance (bottom, Ella Fitzgerald, 1996).
1952). Later phonological ‘rhythm’ studies combined these with data structures such as trees and histograms (‘grids’, i.e. visualisations of numerical vectors) to express relative prominence relations (Chomsky & Halle 1968; Liberman & Prince 1977) and to derive the relative prominence relations systematically from word and sentence structure. Later still, phoneticians introduced quantitative ‘rhythm metrics’ in order to quantify the physical basis of perceived rhythms by means of measurements of the acoustic signal. A classification of approaches to such rhythm models and metrics was proposed by Gibbon (2006), and some of these will be referred to in the present study. The central insight in the present context is, first, that by concentrating on isochrony, the evenness of event durations, the essential character of rhythms, that ‘dum-de-de-dum-de’ factor which distinguishes rhythms from other sequences, has been lost, and second, that models must be developed which take ‘genuine rhythms’ into account.
56
Dafydd Gibbon
The structure of the study is as follows. In Section 2, ‘Grooves and models’, the intuitive empirical starting points are outlined in terms of the Jassem Rhythm Model (JRM) and the Abercrombie Rhythm Model (ARM). In Section 3, ‘The syllable is the mother of the groove’, an integrative model for relating syllable-timed rhythms and foot-timed rhythms is developed, with a model of syllables as rhythmically alternating units, the Alternating Syllable Model (ASM), and a model of feet as extensions of syllable Onsets and Codas, the Syllable Extension Model (ASM). In Section 4, ‘Groovy phonetics’, the phonetic properties of rhythms are discussed using a specific example as an illustration and source of experimental hypotheses, relating the timing patterns of phone sequences, syllable sequences, and sequences of ternary (Jassem type) and binary (Abercrombie type) foot sequences. In Section 5, ‘Measuring the groove’, the validity of a number of popular ‘rhythm metrics’ as rhythm models is questioned, and in Section 6, ‘Rocking the groove’, the outlook for developing ‘genuine rhythm models’ on the basis of recent oscillator rhythm systems is briefly discussed, before a summary and outlook is presented in Section 7, ‘The future of the groove’.
2 Grooves and models 2.1 Definition and explicandum This intuitive definition is taken to be consensual: Rhythms are temporally regular iterations of events which embody alternating strong and weak values of an observable parameter.
The concept of ‘observable parameter’ is neutral between rhythm as an epiphenomenon in human perception and cognition on the one hand, and physical measurements on the other. The alternating values of the observable parameter are commonly referred to as strong-weak, light-dark, loud-soft, stressed-unstressed, conspicuous-inconspicuous, prominent-nonprominent, consonant-vowel, hand raised vs. hand lowered, and in many other ways. The parameter may be a single feature type or a complex combination of many, it may be hierarchical in structure, it may appear in any modality, perhaps even the gustatory and olfactory modalities. An essential distinction must be made between physical rhythms and semiotic rhythms.
Speech rhythms – modelling the groove
57
Physical rhythms may be natural rhythms (waves, regular limb movements during locomotion) or artefactual rhythms such as the rhythms of motors, the ticking of clocks, or the visual ‘rhythms’ (of fences or moiré silk). The basic function of semiotic rhythms is to mark cohesion in event sequences, whether in speech or in other modes of behaviour. Semiotic rhythms may be aesthetic rhythms (as in metre, the rhythms of music and dancing, the patterns of abstract art) or communicative rhythms (the patterns of speech and gesture). Physical rhythms (like other physical events) may be interpreted, for example in religious or poetic contexts, as semiotic, and any of these rhythm types may be ‘reconstructed’ on the basis of cognitive expectations and superimposed on physiological sensations and percepts. Starting with the intuitive definition of rhythms as temporally regular alternations, the form of a rhythm (whether physical or semiotic) may be analysed in terms of several key properties, being (a) a time series of (b) rhythm events, with (c) each event containing (at least) a pair of different observable values of a parameter over (d) intervals of time of relatively fixed perceived duration. Finally, (e) ‘it takes (at least) two to make a rhythm’: one alternation of parameter values is not yet a rhythm. The key properties of rhythm form are visualised as a basic Generic Binary Rhythm Model (GBRM) in Figure 2, which shows rhythms as a prominent peak events followed by less prominent post-peak events. The GBRM visualises an intuitive understanding of rhythms according to the initial definition, as a regular
Figure 2: Visualisation of a Generic Binary Rhythm Model (GBRM).
58
Dafydd Gibbon
iteration of alternating strong and weak values of a parameter. The definition of event as a pair of an interval and a property is taken from event logic. However, speech rhythms are much more complex than the intuition-based Generic Binary Rhythm Model (GBRM) permits, as plausible pronunciations of examples (1)–(4) demonstrate (the comments are in the terminology of poetic metre, with no implication that the lines are poetic). (1)
This | fine | bear | swam | fast | near | Jane’s | boat. (Singlets, syllabletimed, dum dum …)
(1)
And then | a car | arrived. (Iambs, de-dum …)
(2)
This is | Johnny’s | sofa. (Trochees, dum-de …)
(3)
Jonathan | Appleby | carried it | awkwardly. (Dactyls, dum-de-de …)
(4)
It’s a shame | that he fell | in the pond. (Anapaests, de-de-dum …)
(5)
A lady | has found it | and Tony | has claimed it. (Amphibrachs, de-dumde …)
In fast speech, the numbers of unstressed syllables surrounding a stressed syllable may be greater than the one or two illustrated here. These speech rhythm variants suggest that a model of basic speech rhythm units should not have a purely binary structure, i.e. peak post-peak, but a ternary structure, i.e. pre-peak, peak, post-peak, as a more comprehensive Generic Rhythm Model. A descriptively adequate model of speech rhythms in general needs to go beyond simple binary foot structures, and to be sufficiently flexible to capture not only trochaic rhythms but the other types, too.
2.2 The Jassem Rhythm Model and the Abercrombie Rhythm Model Two classic speech rhythm types have been proposed: syllable-timed rhythms and foot-timed (or stress-timed) rhythms. The basic units of each type, syllable and foot, are claimed to be ‘isochronous’: the syllable and the foot, respectively, are said to be spoken in relatively constant temporal intervals. There are two prominent phonetic models for foot timing: the Jassem Rhythm Model and the Aber-
Speech rhythms – modelling the groove
59
crombie Rhythm Model (Jassem 1949, 1952; Abercrombie, 1967). Phonetic properties of the models have been discussed in detail by Hirst & Bouzon (e.g. 2005). The ARM is binary, and its Rhythm Unit (RU) contains two constituents, the Ictus (stressed syllable) followed by the optional Remiss (sequence of unstressed syllables up to but not including the next Ictus. The patterns captured are just the trochee and dactyl types. Where a sequence starts with unstressed syllables, a ‘silent Ictus’ is postulated. The JRM is hierarchical with two hierarchical divisions: the Total Rhythm Unit (TRU) is a ternary sequence of Anacrusis (ANA), Narrow Rhythm Unit (NRU) and Rhythmical Juncture (RJ). The Anacrusis is a rapidly pronounced sequence of unstressed syllables between the last RJ and the next NRU, does not have the property of isochrony, and is largely determined by the coincidence of a grammatical break with the preceding RJ. The definition of Jassem’s NRU is the same as the definition of Abercrombie’s RU: a stressed syllable followed by a sequence of unstressed syllables. Jassem does not use the terms Ictus and Remiss for the constituents of the NRU, but they will be used here for convenience of reference. The difference between the JRM and the ARM is that the syllables in the JRM sequence of ANA and NRU constitute a ternary structure of ANA, Ictus and Remiss, while in the ARM the syllables constitute a simpler binary structure of Ictus and Remiss. In each case, only the Ictus is obligatory. Given the rhythm patterns documented in (1)–(6), it appears prima facie that the ARM is inadequate to describe most of the patterns, while the JRM captures all of them, and also the longer stretches of unstressed syllables found in fast speech. The properties of the two models are summarised and discussed in detail by elsewhere (Gibbon, Hirst & Campbell 2012). Without prejudicing the issue of whether these two types are systemically and phonetically valid or not, it is straightforward to interpret the Generic Rhythm Model in terms of the JRM and ARM rhythm models (1), with a syllabification of the sequence it was terrifying to see. Table 1: Comparison of Generic Rhythm Model (GRM), Jassem Rhythm Model (JRM) and Abercrombie Rhythm Model (ARM). –
it
GRM:
–
Jassem (JRM):
–
Abercrombie (ARM): Ictus Grammar:
–
was
TER
ri
pre-peak
peak
ANA
Ictus
Remiss
Ictus
to
SEE
post-peak
pre-peak
peak
Remiss + RJ
ANA
Ictus
it was terrifying
fy
ing
Remiss
Ictus to see
60
Dafydd Gibbon
The direct comparison in table 1 shows clear structural and functional differences between the ternary Jassem model and the binary Abercrombie model. First, tbe binary Ictus-Remiss model forces Abercrombie to postulate a silent Ictus in cases where a sequence begins with unstressed syllables. Jassem’s model obviates such multiplication of entities praeter necessitate by introducing the ANA category, which is independently empirically motivated by having its own timing pattern which differs from timing in the NRU (Jassem, Hill & Witten 1984), and thus is not introduced praeter necessitate. With the ternary ANA Ictus Remiss pattern, the Jassem model fulfils the requirements for a GRM pre-peak peak post-peak pattern. Second, the Jassem model differs from the Abercrombie model by introducing an explicit RJ boundary category, which is motivated phonetically as (a) ending an NRU (for example by ‘final lengthening’ at the end of a rhythm sequence, and (b) separating an NRU from a following ANA, and thus is also not introduced praeter necessitate. Third, Jassem’s examples show that the JRM differs in systemic motivation from the ARM by implicitly recognising grammatical properties of rhythm patterns (cf. also Pike 1945:33ff.), in two ways: (a) in the tendency for ANA to mark proclitic or premodifying grammatical items in English right-headed constructions, and (b) in the tendency for the RJ to correspond with grammatical boundaries. Correspondence of grammatical and prosodic boundaries is neither a necessary nor a sufficient condition for a rhythm model, since grammatical boundaries may occur within rhythm units in fast speech, and rhythm units may occur within larger grammatical units in deliberative discussion styles (cf. contributions to Dechert & Raupach 1980). A priori there is no necessity for grammatical and prosodic units to coincide at the grammatical and rhythmical junctures. Nevertheless, the relation between grammatical and prosodic boundaries is a frequently noticed (and actually rather obvious) tendency, and inclusion of this well-motivated systemic correspondence property, together with phonetic motivation for the ternary pattern, and without introducing silent entities, weighs in favour of the Jassem model.
3 The syllable is the mother of the groove 3.1 The Case of the missing ictus and the missing remiss There is a puzzling feature of foot-timed rhythms in English and prosodically similar languages, which has not been adequately discussed in the extensive lit-
Speech rhythms – modelling the groove
61
erature so far. If rhythm is alternation between stressed and unstressed syllables, then how are cases such as (1) accounted for, in which only stressed syllables occur? The Abercrombian model postulates that in sequences which begin with unstressed syllables, there is a ‘silent Ictus’, and sequences which end with an Ictus have a ‘silent Remiss’, yielding yet another entity praeter necessitate. But in the Jassem model, too, there is no account of what is rhythmical in the pattern of the missing Remiss: to state that ANA and Remiss are optional is a descriptive statement with no systemic or phonetic explanatory value. The ‘puzzle of the absent Remiss’ which is shared by both the JRM and the ARM models is representative of the entire literature.
3.2 The Alternating Syllable Model: syllables as feet More generally, what are the units which alternate in rhythms of languages (and corpus events in general) which are syllable-timed? The answer to this question is the key to resolving the puzzle of the missing Remiss in both the Abercrombie and the Jassem models, with its apparent lack of alternation and therefore lack of rhythm. The solution to this problem is surprisingly simple and, once stated, surprisingly obvious, but leads through a detailed discussion of syllable phonotactics and morphonotactics. First, in cases such as (1), and in syllable timing in general, the pre-peak, peak post-peak alternation is between vowels and consonants. The CVC alternation is evidenced most clearly in languages like English which have complex syllables, such as splints /splɪnts/, with complex consonant clusters in both Onset (O) and Coda (C) of the syllable, which alternate with the sonorant Nucleus (N) in the pattern (O N C)+ (the superscript ‘+’ indicates a sequence consisting of at least one occurrence of the pattern in parentheses). Interactions between the Nucleus and the Coda within the Kernel (the term ‘rhyme’ is a misleading metaphor in view of its specific and different meaning in poetry) are discussed below. English has many closely interrelated syllable templates; the splints /splɪnts/ type is just one. The sequence /s/+/p/+/l/ɪ/+/n/+/t/+/s/ can be categorised as s+O+L+V+S+O+s, i.e. /s/ followed by obstruent, liquid, vowel, sonorant, obstruent, and terminated by /s/, with each subsequence subject to complex co-occurrence constraints (Gibbon 2001). The category sequence s O L V S O s is parsed on distributional grounds as s O L (Onset), V S (Nucleus), O s (Coda). Distributionally, the Nucleus is /ɪn/ with the pattern V S, a short vowel followed by a sonorant consonant. Likewise, the Coda is distributionally relatively simple, with the template Os for obstruent fol-
62
Dafydd Gibbon
lowed by the morphologically determined /s/. The parse /spl/ + /ɪn/ + /ts/ is done differently in traditional syllable analyses, but is justified at the lexical level by distributional constraints: the V S pattern of /ɪn/ is substitutable by a long vowel pattern V V, a vowel plus approximant V A, or a short vowel V, retaining the same Onset and Coda structures. Syllable models which use an a priori phonetic categorisation of sonorants as consonants rather than a systemic distributional motivation assign the S component to the Coda, which conflicts with the distributional facts about English syllable structure. However, using a criterion of perceptual prominence, which is relevant for rhythm, the distributionally motivated parse /spl/ + /ɪn/ + /ts/ has to give way to /spl/ + /ɪ/ + /nts/: /ɪ/ is a prominence peak and /n/ has intermediate prominence between /ɪ/ and /t/ on sonority scales. There is no ‘right’ and ‘wrong’ between distribution based analysis and sonority based analysis. There is no single syllable structure: in systemic terms, there are simply two perspectives on syllable structure, justified by independent empirical criteria of distribution and sonority, respectively. The lexical distributional pattern /spl/ + /ɪn/ + /ts/ is thus reanalysed as /spl/ + /ɪ/ + /nts/ in terms of sonority. The sonority alternation criterion provides valuable further information: the sonority criterion which explains – rather than just describing – the preferences of languages of the world for CV patterns over V syllable patterns (Jakobson 1941 et multi): the CV pattern guarantees a rhythmic alternation of vowels and consonants. In cases where both Onset and Coda are missing, as in sequences of V-only syllables, this is not the end of all alternation: other low sonority phonetic means intervene to preserve the alternation, for example glottal stops, approximants, or amplitude and pitch modulations. In English, the second mora of long vowels, or the glide of diphthongs, serves as the weak element. Where vowels are adjoined with liaison, a transition between sonority levels occurs. In interjections such as A-a-a-a! [aʔaʔaʔa], a glottal stop intervenes and preserves alternation.
3.3 The Syllable Extension Model The basis for distinguishing foot-timed rhythms and syllable-timed rhythms has now been explained. But how is the co-occurrence of syllable-timed cases such as example (1) in English with various different kinds of foot-timed rhythms accounted for? The solution to relating syllable-timed and foot-timed events is, as with the justification of the JRM, partly systemic: the Onset and the Coda of the basic
Speech rhythms – modelling the groove
63
syllable event is extended by extrametrical ‘weak syllables’ with very different phonotactic and often morphotactic status from the ‘strong’ stressed syllable. Some weak syllables are morphologically determined (affixes), some are historically opaque and purely phonological (e.g. in latinate words such as complaint or solid), others are short grammatical words such as determiners, pronouns, auxiliary verbs, prepositions, conjunctions, which have clitic-like behaviour in informal and fast speech styles. The word splints is already an example of morphological extension: the sequence /ts/ has a morphonotactic juncture {t+s}. Words like solidly /ˈsɔlɪdlɪ/ have post-Coda syllables (the first /l/ is ambisyllabic, i.e. both Coda and Onset), which have distributionally highly constrained ‘weak syllables’, in one case morphologically opaque, /lɪd/, in the other morphologically transparent, /lɪ/, a derivational morpheme. The morphophonotactic structure is {ˈsɔlɪd+lɪ}. Other words like complaint /kəmˈpleint/ have a pre-Onset syllable such as /kəm/ whose original Latin morphological status is opaque in English. In cases such as unlikely / ənˈlaɪklɪ/ {ən+ˈlaɪk+lɪ} the pre-Onset extension is morphologically transparent as a derivational morpheme. Typically, the weak Syllable Extensions are either unstressed or acquire alternating secondary stresses.
3.4 Syllables as feet, feet as syllable extensions Having characterised syllables as alternating units with the ASM analysis, and as Onset extensions and Coda extensions in the SEM analysis, the remaining step to linking syllable-timed and foot-timed rhythms is a very small one: the basic rhythm event type is the ASM. In the SEM, the Onset and the pre-Onset extension join as the Anacrusis, the pre-peak component of the GSM pattern, and the Coda and the post-Coda extension join as the Remiss. Consequently, where there is no Remiss in the traditional sense as a sequence of unstressed syllables, the Coda steps in as the minimal Remiss. The principles of syllable-timed and foot-timed therefore no longer need to be seen as mutually exclusive categories: they are the ends of a continuum which, at the ‘foot’ end may be extended arbitrarily by segment and syllable reduction in informal and fast speech. The problem of showing the relationship between syllable-timing and foot-timing, both when they co-occur in a given language and typologically when they occur in corpora of different languages, is therefore solved in a coherent and explanatory way by combining the Alternating Syllable Model (ASM) with the Syllable extension Model (SEM): differences arise from differing phonotactic and timing constraints on Onset Extension and Coda Extension sequences.
64
Dafydd Gibbon
It is tempting to search for further similarities between syllables and their components and feet and their components: syllables and TRUs share a similar hierarchical structure; the Onset has properties of distributional independence from the Kernel and the Anacrusis has properties of temporal independence from the NRU; the peak property is shared by Nucleus and Ictus (which for stressed vowels are in any case identical by definition); the Nucleus and Coda are distributionally interdependent and Ictus and Remiss are temporally interdependent; both the Coda and the Remiss are variable in respect of timing, lenitions, assimilations and reductions. On the basis of the preceding argumentation, the conclusion is drawn that the syllable is at the core of all rhythmical timing patterns. But rhythms are not necessarily based only on durational relationships at one level: there are other contributory factors to peak prominence such as timing properties of constituents, as well as pitch patterning. An obvious relation to look for is the patterning of durations between hierarchically close constituents of a rhythmic pattern. Accordingly, discussion of phonetic timing properties of syllables and their constituents is the next step.
4 Groovy phonetics 4.1 Timing patterns: durations and duration differences In the preceding sections, discussion has been purely on phonotactic and morphonotactic lines. A detailed quantitative empirical analysis is not feasible in the present context, but the main lines of ASM-SEM based research can be outlined straightforwardly, starting by pointing out structural similarities between ternary syllable structure and the ternary JRM as the basis for rhythmic patterning (Figure 3). Next, the timing patterns of syllables and syllable components are discussed, followed by discussion of the phonetic consequences of the JRM and ARM timing patterns. The methods start with an annotated speech recording (Figure 4).
Speech rhythms – modelling the groove
65
Figure 3: Structural similarity between syllable and rhythm unit patterns.
Figure 4: Recording of “a tiger and a mouse were walking in a field”, manually annotated on 9 tiers (syllable structure, the Jassem Rhythm Model and the Abercrombie Rhythm Model).
The procedure involves manual annotation of the data, and automatic processing of the manually annotated data: 1. Manual annotation of recorded speech on separate tiers, by syllable and syllable constituent labels, and by the categories of the JRM and the ARM. 2. Extraction of the durations of intervals on each tier. 3. For the n intervals on each tier, calculation of the n-1 differences between duration of each interval k and the following interval k+1, starting with the first interval. 4. Calculation of the absolute difference between neighbouring durations (i.e. conversion of negative difference values into positive difference values). 5. Calculation of linear regression over duration patterns. 6. Display of the resulting patterns.
66
Dafydd Gibbon
This procedure, with normalised durations, was designed for quantitative studies. In the present context, no normalisation of durations or duration values is performed because the displays are intended for ‘eyeballing’ visual interpretation rather than for quantitative analysis. The rationale behind these visualisations of timing patterns is that popular ‘rhythm metrics’ rely on these kinds of empirical data as a starting point for measurement (cf. Section 6).
4.2 Syllable timing patterns Using the method outlined above, the duration relations in phoneme sequences and in syllable sequences in the extract from a read-aloud spoken narrative shown in Figure 4 were calculated. The ultimate constituents of syllables are segmental phones (corresponding to phonemes); the durations and duration differences in the phone pattern are shown in Figure 5.¹ Inspection of the phone duration sequence shows, as expected, that the durations peak on the stressed long vowels, i.e. the diphthongs /aɪ/ and /aʊ/, and the long vowels /ɔː/ and /iː/. The Onset consonants /t/, /m/ and /f/ are approximately the same length as the following vocalic segment (the /f/ is even slightly longer); the Sonorant /l/ in /fiːld/ is also longer than the /iː/. The difference function is an ‘edge detection’ function, and effectively indicates boundaries or transitions between longer and shorter units; the plain difference function shows the weak-strong or strong-weak directionality of the transition, and the absolute difference function simply shows boundaries of whatever direction. Prima facie, the absolute difference function shows boundaries which occur in relatively even distances in terms of segment counts (not necessarily in terms of time; in this case the regularity may be an accident of the phonotactic structure of the utterance.
1 Interpretation of lines in Figure 5, Figure 6 and Figure 7: dark line: durations; dark grey line extending below zero: differences between neighbouring durations; light grey line above zero, meeting the difference line on positive peaks: absolute – positive – difference between durations; straight line: linear regression over durations.
Speech rhythms – modelling the groove
67
Figure 5: Phoneme and syllable sequence durations and duration differences.
Another obvious regularity is found in the durations of the stressed vowels, which are between about 110 and 120 milliseconds. The flat regression line shows an unexpected property: that the duration distributions of the phones tend to remain constant throughout the utterance. The syllable timing patterns, on the other hand, show a different picture: there is no obvious regularity of in the syllable boundary distribution, unlike the phone distribution. Further, syllable lengths increase over the length of the utterance, both overall (as shown by the slightly rising regression line), but also the stressed syllables /taɪ/, /maʊs/ increase in length, a pattern which is repeated in the second half, also overall slightly higher, with the syllables /wɔːk/ and /fiːld/, marking a hierarchical timing structure (cf. Campbell 1992; Gibbon 2006). This increase in the duration of syllables in general and of the stressed syllables in particular, in the first part ‘a tiger and a mouse’, in the second part ‘were walking in a field’ reflects a grammatical Subject-Predicate structure (‘a tiger and a mouse’, ‘were walking in a field’), a systemic correspondence already noted in the cases of the Pike (1945) and Jassem (1949, 1952) approaches. Before moving on to an analysis of the foot timing patterns, a higher level patterning can thus already be discerned. However, there may be a very simple
68
Dafydd Gibbon
explanation for these increases in length: final syllable lengthening, shorter in the initial group, longer in the final group.²
4.3 Jassem timing patterns The same data collection, analysis and duration display method was applied for the JRM, looking first at the ternary sequences ANA-Ictus-Remiss (top) and then at the two types of rhythmical unit ANA and NRU (Figure 6).
Figure 6: Foot and foot constituent durations and duration differences – Jassem Rhythm Model.
The regression lines in both the displays of the Jassem model confirm the impression given by the syllable pattern (though not the phone pattern), that the units
2 A strict caveat is in order: the example was chosen with the aim of illustrating a method method, and therefore represents only an informal demonstration of hypotheses leading to possible pgeneralisations. Further, the read-aloud data may not be generalisable to other, more spontaneous speech registers. For the generalisations themselves, quantitative analysis is required, and this is the topic of separate studies (Gibbon 2006).
Speech rhythms – modelling the groove
69
increase in length over the course of this utterance in each case. In the top display of Figure 6, the lengths of the Ictus instances reflect the increase in length of the stressed syllables which has already been noted, unsurprisingly, since the Ictus is a stressed syllable by definition. There is no obvious pattern over the entire ternary sequence (top). However, the bottom display, which just relates the two rhythm unit types ANA and NRU shows a very striking distribution indeed: the durations show an unexpected conspicuous alternating ‘zig-zag’ pattern, with NRU lengths of about 360 milliseconds and ANA lengths of just over 100 milliseconds (including the initial ANA ‘a’), with an overall fairly even TRU length of about 460 milliseconds. Obviously, if both ANA and NRU sequences separately tend to isochrony, then the TRU also tends to isochrony and there is at first glance no clear advantage to regarding the ANA as a separate category in this example: duration (ANA+NRU) = duration (NRU+ANA) = duration(TRU). What does come out very clearly, though, is that – unlike the length of the Anacrusis – the length of the NRU is not a function of the number of syllables: the NRU retains its length whether there is a Remiss (as in ‘tiger’, ‘walking’) or not (‘mouse’, ‘field’); in the latter cases, the ASM-SEM model of ‘syllables as feet’ comes to bear. In consequence, what also comes out is that there is a clear difference between the Remiss sequence of unstressed syllables, whose length is interdependent with the length of the Ictus, and the ANA sequence of unstressed syllables, whose length is independent of that of the Ictus. It is an interesting new possibility that Anacrusis and NRU may be equi-durational. These are useful hypotheses for quantitative studies, though of course they may be refuted. Again, the read-aloud data are used with illustrative, not quantitive intention.
4.4 Abercrombie timing patterns The ARM was investigated (Figure 7) using the same criteria as were used for the JRM. As with the syllable and the JRU analyses, the regression lines show a slight increase in durations during the utterance. However, the ARM bundles Jassem’s ANA together with the Remiss and this does not result in any obvious kind of evenness or alternation. The bottom display, on the other hand, repeats the result of the JRM investigation: when ANA is bundled together with NRU (= Ictus + Remiss), as expected, very even patterns result. The ‘odd-man-out’ is the initial unstressed syllable ‘a’, which does not fit comfortably into the ARM, while it is easily handled by the JRM.
70
Dafydd Gibbon
Figure 7: Foot and foot constituent durations and duration differences – Abercrombie Rhythm model.
4.5 Conspectus of syllable and foot models A number of interesting observations have emerged as a source of possible generalisations and hypotheses for quantitative modelling: 1. The distribution of phone segment durations tends to remain constant over an utterance. 2. The distribution of durations of syllables and foot based units tends to increase over an utterance (which may be accidental and due to the chance distribution of phone segments within syllables). 3. Ictus durations tend to increase hierarchically over grammatical units such as Subject sequences, Predicate sequences and Subject-Predicate sequences. 4. Anacrusis sequences tend to be equal in length, i.e. isochronous (a hypothesis which is contrary to Jassem’s idea that they are not isochronous. 5. Narrow Rhythm Units tend to be equal in length, i.e. isochronous (which is Jassem’s hypothesis).
Speech rhythms – modelling the groove
6.
7.
71
Anacrusis unstressed syllable sequences tend to be as long as the entire Narrow Rhythm Unit (and a fortiori longer than Remiss sequences of unstressed syllables). The JRM produces a range of very clear hypotheses, while the ARM only produces one hypothesis, which is a subset of the Jassem Rhythm Model hypotheses, namely that Abercrombie’s RU and Jassem’s TRU both tend to be isochronous in this particular example, but that Abercrombie’s model fails to integrate the initial unstressed syllable into the pattern.
5 Measuring the groove Starting with a clear understanding of the nature of rhythm, and in particular of the relationship between the syllable-timed and foot-timed rhythm styles, a number of popular ‘rhythm metrics’ of recent years can be briefly investigated. These (and many other) metrics and their empirical significance have been discussed in detail elsewhere (Gibbon 2006; Gut 2012), so discussion in the present context can be restricted to just a few representative metrics and their formal validity for ‘modelling the groove’.
5.1 Mean Foot Length (MFL) and Percentage Foot Deviation (PFD) metric The metric proposed by Roach (1982) is a little difficult to extract from the textual description. A plausible interpretation is that the MFL-PFD metric defines the Mean Foot Length in the obvious way by averaging foot lengths, and then derives the Percentage Foot Deviation as a simplified analogy for standard deviation. While for standard deviation the square root of the squared difference between each value and the mean is calculated, in the Roach formula the sum of the absolute differences between each length and the mean length is divided by the overal length of the sequence and converted to a percentage: MFL =
∑ni=1 |footi | n
PFD = 100 ×
∑ni=1 |MFL − len(footi )| n × MFL
The expression |footi| here means the length of footi, not its absolute value. The total differences from the mean length are expressed as a fraction of the overall length. Evidently, if all feet are equal in length, the differences are zero, the sum of differences is zero, and the fraction of the overall length is zero. Therefore if
72
Dafydd Gibbon
PFD approaches zero, the feet are more perfectly isochronous, and the more the PFD differs from zero, more irregular the timing is. That is the theory. But it does not work as advertised, for several reasons: 1. The measure is not normalised for speech rate: if the speech rate varies over the utterance, this can lead to an artificially high PFD, even though the local duration differences are rather small. 2. More seriously: the more the PFD differs from zero, the less we know about what it is actually measuring, because it is a global measure, averaging over the lengths of all feet (or whatever unit is used) within the entire unit. Different – even random – orderings of the same duration values, can be scattered over the utterance in arbitrary orders, still yielding the same PFD. Evidently, the PFD is measuring something like ‘smoothness’ of durations averaged over a whole utterance. It says nothing about the actual alternating rhythmic structure. Consequently, for very low values, while the PFD can be a useful indicator of ‘smoothness’ of sequences of a particular type of unit such as the foot, but higher values are uninterpretable, meaning simply ‘roughness’. As a measure of relative isochrony, i.e. of the relative ‘smoothness’ of syllable and foot timing the PFD metric has successfully and consistently discriminated between corpora of different languages, but it has neither measured rhythms nor explained them.
5.2 Rhythmic Irregularity Measure (RIM) metric The Rhythmic Irregularity Measure, RIM, (Scot et al. 1986) calculates the sum of the logarithm of the ratios between all durations of non-identical intervals in the utterance: RI = ∑ log i=j̸
Ii Ij
Although the ratio looks like a very different kind of measure, it is also a global measure which can also only differentiate along a scale between between ‘smooth’ and ‘rough’. As it stands, it is also dependent on the length of the utterance: the log of the ratios is summed for each pair of intervals. Crucially, like the PFD, the ordering of the intervals does not matter: a random ordering of the same values yields the same index. Consequently, the RIM, like the PFD, measures relative isochrony, i.e. relative ‘temporal smoothness’ as opposed to ‘temporal roughness’ or inequality, and not rhythm. Nevertheless, in this capacity, the RIM has consis-
Speech rhythms – modelling the groove
73
tently succeeded in discriminating between corpora of languages with different temporal patterning, but it has neither measured nor explained rhythms.
5.3 The normalised Pairwise Variability Index metric One of the most popular ‘rhythm metrics’ (Low, Grabe & Nolan 2000) is the normalised Pairwise Variability Index (nPVI), which averages the normalised durations differences between neighbouring intervals and multiplying by 100. Normalisation is carried out by dividing the difference between durations by their mean duration: m−1 dk − dk+1 nPVI = 100 × ∑ /(m − 1) (d + dk+1 )/2 k=1 k
The nPVI ranges from 0, for totally equal durations, towards an asymptote of 200 for ever ‘noisier’ sets of unequal durations. The limit of 200 arises from normalisation by average (here: division by 2). If normalisation were by the sum of durations, then the asymptote would be 100, a percentage. The nPVI has also been used successfully to classify corpora of different languages relatively consistently, but cf. Gut (2012) for discussion of inconsistencies. Like the other metrics, the nPVI has its problems as a model of rhythm, though in principle the model looks as though it would work for binary rhythms, where neighbours alternate. But this turns out to be a vain hope: 1. It was noted in cases (1) to (6) that speech rhythm in English is not binary. 2. Taking the absolute values of differences destroys the strong-weak versus weak-strong directionality of duration change which characterises rhythms: there is no alternation any more. 3. There is another formal problem with taking the absolute difference: many different kinds of duration value can produce the same index. It is easy to check that a sequence such as , a regular, rhythmical alternation, produces a nPVI value of 66.66'. It is also easy to check that monotonically increasing or decreasing geometrical series such as or yield the same nPVI value of 66.66'. 4. Also any combination of such series, such as etc. yields the same value of 66.66'. Similarly, the series yields a nPVI of 100, and so do corresponding geometrical series such as . Normalisation for speed rate evidently has a serious down side. The nPVI is thus also a measure of ‘smoothness’ and ‘roughness’, either of alternation or of geo-
74
Dafydd Gibbon
metrical progression, though it implicitly embodies a constraint against random orderings, unlike the other metrics, and it normalises for speech tempo changes. Like the other metrics, the low values, which indicate evenness, may be related to rhythm, but for the high values it is not clear what is being measured, apart from a degree of unevenness in the duration set. Again, like the other metrics, the nPVI has been used successfully in discriminating different timing patterns in corpora of different languages, though it has been neither measuring rhythms nor explaining them.
5.4 ΔC, %C; ΔV, %V segmental sequence ratio metrics A number of descriptive statistical measures were introduced by Ramus and associates (2002; 1999), and were successfully used to discriminate between corpora of different languages. However, a close look shows that the measures essentially reflect the phonotactic structure of the language: 1. The standard deviation of consonantal interval durations (ΔC) relates directly to the complexity of these clusters: in CV languages (often associated with ‘syllable timing’), ΔC may be predicted to be low, in CCCVCCC languages like English (often associated with ‘foot timing’ or ‘stress timing’), ΔC may be predicted to be very high. 2. The standard deviation of vocalic interval durations also relates directly to syllable phonotactics: if a language has a vowel length contrast (English, German), then ΔV may be predicted to be higher than if a language does not have a vowel length contrast (Polish). 3. The percentage of consonantal intervals (%C) in relation to vocalic intervals, or the converse (%V) is a function of the complexity of the phonotactics of the language. Like the other metrics, the Ramus metrics are ‘smoothness’ metrics, but in addition, being easily relatable to the phonotactics of languages, they can potentially reflect the genuinely rhythmical property of alternation, since we know – a priori, as it were – that consonants and vowels alternate with each other: syllable-timed corpora will evidently tend to cluster towards the low ends of the ΔC and ΔV scales for reasons which are independent of any phonetic measurements, and to have a lower %C and higher %V ratios. And this is indeed what the studies find by phonetic measurements. Like the other metrics, the segment sequence ratio metrics have also been successfully used to discriminate between corpora of different languages. But, like the other metrics, the segment sequence ratio metrics also do not measure rhythm.
Speech rhythms – modelling the groove
75
5.5 What do we do with the ‘smoothness metrics’? The ΔC, ΔV, %C and %V ratio metrics reflect the syllable patterns in speech corpora and hence, in principle, they can be used to detect the presence or absence of different kinds of alternating pattern or rhythm, though in practice they are not used in this way: if all the consonants are put together in random order, and all the vowels are put together in random order, the metrics still yield the same values. The situation with the other metrics which are concerned with ‘smoothness’ versus ‘roughness’ measures is a little different. They are general metrics, and can in principle be applied to any kind of flow, whether units of speech or distances between cars on the road. They can provide a measure for whether units at a given level are more or less isochronous, but they cannot tell what non-isochrony, i.e. temporal inequality, means, and are therefore ignorant about rhythmic alternations or lack of rhythmic alternations. In order to obtain information about rhythmic alternations, the temporal properties of the constituents of the units measured must be considered (Asu & Nolan 2006; Nolan & Asu 2009). So if, for example, only one size of interval, such as the foot, is investigated, not much can be said about the internal structure of the interval. But if metrics are used to compare the ‘smoothness’ of interval sequences at different levels, such as syllables as well as feet, then the results can at least be used to determine whether the data are located on a scale between syllable-timed and foot-timed rhythms: if the syllable sequence has the lower PFD, RIM or nPVI, then the utterance is more syllable-timed, and if the foot has the lower PFD, RIM or nPVI, then the utterance is more foot-timed. By triangulating different locations in the architecture of prosody in this way, from phones through syllables and feet to larger units, general statements about the ‘syllableness’ or the ‘footness’ of timing can be proposed. The ASM-SEM model of syllable alternation and extension developed in the present study predicts that the position of a speech corpus on a scale between syllable-timed events and foot-timed events is a function of the number Syllable Extensions and their duration. Nevertheless, the metrics still do not explain, in the sense in which, for example, the Jassem model together with the ASM-SEM explain, the mechanisms by which syllable-timed or foot-timed rhythms operate.
76
Dafydd Gibbon
6 Rocking the groove The remaining open issue is, then: if the descriptive ‘smoothness metrics’ cannot capture rhythm in any transparent fashion, what can? What would actually count as a model of an alternating, ‘rocking’ rhythmic sequence, as opposed to a model of temporal regularity and irregularity? The answer lies in recognising the underlying rhythm mechanism as an oscillator. A number of oscillator models of rhythm have been proposed (O’Dell & Nieminen 1999; Cummins 2001; Barbosa 2002; Barbosa & da Silva 2012; cf. also Wachsmuth 2002; Inden et al.). Barbosa’s Coupled Oscillator Model will be singled out for brief mention (Figure 8).
Figure 8: Two-level oscillator model (after Barbosa 2002).
In Barbosa’s model, two basic rhythms, the phrase rhythm and the syllable rhythm are postulated (for Brazilian Portuguese – different constructions must be provided for English), and the syllable rhythm is influenced, ‘entrained’, by the higher level phrase rhythm. It is this interaction of rhythms at different levels which accounts for some of the complexity of the ‘groove’ of naturally performed speech rhythm. A similarly structured two-level model has been postulated by Fujisaki (1988) for the pitch patterning of accent and intonation. The oscillator models still have a long way to go, as they are too simple to account for many of the patterns already discussed in this study, but at least they model ‘genuine rhythm’, and have a well-defined formal, mathematical basis. The converse issue of oscillator models for analysing rather than generating rhythm is dealt with by Tilsen & Johnson (2008), who propose a ‘Rhythm Comb’ for identifying the low frequencies in speech which represent rhythms (Figure 9). The Rhythm Comb Model is essentially a spectrum analysis of long term properties of the speech signal in order to determine not the frequencies of the vocal spectrum, but the frequencies involved in rhythm variation. The Rhythm Comb Model of Tilsen & Johnson and Barbosa’s Coupled Oscillator Model are mathematically closely related.
Speech rhythms – modelling the groove
77
Figure 9: Tilsen & Johnson Rhythm Comb Model (2008).
But the issue of how to relate these oscillator models to explanatory models of prosodic structure is still open. A first step along the road to integrating different approaches to modelling and measuring into an explanatory model has been developed by Wagner (2001), using Finite State Machine (FSM) models. A related approach in the form of an extension of the Jassem Rhythm Model to permit the required iterations of syllables and feet, also using FSM models (Figure 10) is discussed in detail elsewhere (Gibbon 2006, 2012).
Figure 10: The Rhythm Oscillator Model as an extension of the Jassem Rhythm Model.
78
Dafydd Gibbon
An empirical basis for FSM models of rhythmic oscillation may be expected in statistically enhanced FSMs, which are widely used for many purposes, for instance in the form of Hidden Markov Models (HMMs) in speech technology (and many other fields). The standard HMMs used in speech technology require modification for use in studying duration, however, since in general they factor duration out in favour of generalising over phone intervals.
7 The future of the groove In this study, methodologies used in the study of rhythm have been queried. It turns out that many of the metrics and models used are not really concerned with ‘modelling the groove’, that is, the alternating beats of rhythm, but with the temporal evenness or ‘smoothness’ of sequences, focussing on the criterion of isochrony. Modelling conventions have been explored in some detail, with the aim of developing systemically and phonetically explanatory methods for integrating perspectives which have often been seen as irreconcilable. Starting with a simple model of alternating binary rhythm, the Jassem and Abercrombie models of foot or stress based English rhythm were investigated. Moving beyond the limitations of these models, an integrative model of syllable and foot based rhythms was proposed, the Alternating Syllable Model coupled with the Syllable Extension Model, in which the syllable functions as the minimal foot, with a pre-peak peak post-peak alternating pattern. The foot-timed patterns are derived from syllable-timed patterning by Onset Extension (Anacrusis), stressed Nucleus (Ictus) and Coda Extension (Remiss), in the pre-peak peak post-peak positions of the Generic Rhythm Model, respectively. With the ASM-SEM approach, a coherent and comprehensive explanation of how variation between syllable-timed and foot-timed rhythms take place both in one and the same language, and among languages, is available for the first time. The field is still wide open. Results of rhythm analysis do not only vary by the methods used and by the modelling conventions followed. Speech rhythm is highly complex and has semiotic functions of cohesion creation, in addition to its formal features of isochrony and alternation. Variation of these forms and semiotic functions by language and dialect, by social formality of style and by functional register of use, have contributed to many of the inconsistencies found in rhythm analyses to date, and are as yet largely unexplored avenues. There is much still to do in ‘modelling the groove’.
Speech rhythms – modelling the groove
79
References Abercrombie, David. 1967. Elements of General Phonetics. Edinburgh: Edinburgh University Press. Asu, Eva Liina & Francis Nolan. 2006. Estonian and English rhythm: a two-dimensional quantification based on syllables and feet. In Proceedings of Speech Prosody 2006, Dresden, Germany. Barbosa, Plínio A. 2002. Explaining Cross-Linguistic Rhythmic Variability via a Coupled Oscillator Model of Rhythm Production. In: Proceedings of Speech Prosody 2002, Aix-enProvence, 163–166. Barbosa, Plínio & Wellington da Silva. 2012. A New Methodology for Comparing Speech Rhythm Structure between Utterances: Beyond Typological Approaches . In Helena Caseli, Aline Villavicencio, António Teixeira, Fernando Perdigao, eds. Proceedings of Computational Processing of the Portuguese Language: 10th International Conference, PROPOR 2012, Coimbra, Portugal, April 17–20, 2012. Berlin: Springer. Campbell, Nick. 1992. Multi-level timing in speech. Ph.D. thesis, University of Sussex. Chomsky, Noam & Morris Halle. 1968. The Sound Pattern of English. New York: Harper & Row. Cummins, Fred. 2002. Speech Rhythm and Rhythmic Taxonomy. In Proceedings of Speech Prosody 2002, Aix-en-Provence, 121–126. Dechert, Hans W. & Manfred Raupach, eds. 1980. Temporal variables in speech. Studies in Honour of Frieda Goldmann-Eisler. The Hague: Mouton. Fujisaki, Hiroya. 1988. A note on the physiological and physical basis for the phrase and accent components in the voice fundamental frequency contour. In Osamu Fujimura, ed., Vocal physiology: voice production, mechanisms and functions, 347–355. New York: Raven. Gibbon, Dafydd. 2001. Preferences as defaults in computational phonology. In Katarzyna Dziubalska-Kołaczyk, ed., Constraints and Preferences (Trends in Linguistics, Studies and Monographs 134), 143–199. Berlin: Mouton de Gruyter. Gibbon, Dafydd. 2006. Time Types and Time Trees: Prosodic Mining and Alignment of Temporally Annotated Data. In: Stefan Sudhoff, Denisa Lenertová, Roland Meyer, Sandra Pappert, Petra Augurzky, Ina Mleinek, Nicole Richter & Johannes Schließer, eds., Methods in Empirical Prosody Research, 281–209. Berlin: Walter de Gruyter. Gibbon, Dafydd, Daniel Hirst & Nick Campbell, eds. 2006. Rhythm, Melody and Harmony in Speech. Studies in Honour of Wiktor Jassem, 83–94 Poznań: Polskie towarzystwo Fonetychne (Polish Phonetics Association). Gibbon, Dafydd & Helmut Richter, eds. 1984. Intonation, Accent and Rhythm: Studies in Discourse Phonology. Berlin: Mouton de Gruyter. Gut, Ulrike. 2012. Rhythm in L2 speech. In Dafydd Gibbon, Daniel Hirst & Nick Campbell, eds, Rhythm, Melody and Harmony in Speech. Studies in Honour of Wiktor Jassem, 83–94. Poznań: Polskie towarzystwo Fonetychne (Polish Phonetics Association). Hirst, Daniel & Caroline Bouzon. 2005. The effect of stress and boundaries on segmental duration in a corpus of authentic speech British English. In Proceedings of Interspeech 2005, Lisbon, 29–32. Inden, Benjamin, Zofia Malisz, Petra Wagner & Ipke Wachsmuth. 2012. Rapid entrainment to spontaneous speech: A comparison of oscillator models. In Naomi Miyake, David Peebles & Richard P. Cooper, eds. Proceedings of the 34th Annual Conference of the Cognitive Science Society. Austin, TX: Cognitive Science Society.
80
Dafydd Gibbon
Jakobson, Roman. 1940. Kindersprache, Aphasie und allgemeine Lautgesetze. Uppsala: Almqvist & Wiksells. Jassem, Wiktor. 1949. indikeiʃn əv spiːtʃ riðm in ðə traːnskripʃn əv edjukeitid sʌðən inɡliʃ (Indication of speech rhythm in the transcription of educated Southern English). Le Maître Phonétique, III/92, 22–24. [Republished in Journal of the International Phonetic Association with the original IPA version and a new orthographic version by D. Hirst]. Jassem, Wiktor. 1952. Intonation of Conversational English (Educated Southern British). Prace Wrocławskiego Towarzystwa Naukowego (Travaux de la Société des Sciences et des Lettres de Wrocław). Seria A. Nr. 45. Wrocław: Nakładem Wrocławskiego Towarzystwa Naukowego. Jassem, Wiktor, D. R. Hill & I. H. Witten. 1984. Isochrony in English Speech: its Statistical Validity and Linguistic Relevance. In Dafydd Gibbon & Helmut Richter, eds. Intonation, Accent and Rhythm: Studies in Discourse Phonology, 203–225. Berlin: Mouton de Gruyter. Liberman, Mark Y. & Alan Prince. 1977. On stress and linguistic rhythm. Linguistic Inquiry 8: 249–336. Low, Ee Ling, Esther Grabe & Francis Nolan. 2000. Quantitative characterisations of speech rhythm: Syllable-timing in Singapore English. Language and Speech 43, 4, 377–401. Nolan, Francis & Eva Liina Asu. 2009. The Pairwise Variability Index and coexisting rhythms in language. Phonetica 66, 64–77. O’Dell, Michael L. & Tommi Nieminen (1999): Coupled oscillator model of speech rhythm. In Proceedings of the International Congress of Phonetic Sciences, San Francisco 1999. Pike, Kenneth L., 1945 The Intonation of American. English. Ann Arbor, Michigan: University of Michigan. Press. Ramus, Franck (2002): Acoustic correlates of linguistic rhythm: Perspectives. In Proceedings of Speech Prosody 2002, Aix–en–Provence. 115–120. Ramus, Franck, Marina Nespor & Jacques Mehler. 1999. Correlates of linguistic rhythm in the speech signal. Cognition 73, 3, 265–292. Roach, Peter. 1982. On the distinction between ‘stress-timed’ and ‘syllable-timed’ languages. In Crystal, D., ed., Linguistic Controversies: Essays in Linguistic Theory and Practice. London: Edward Arnold. 73–79. Scott, Donia R., Stephen D. Isard & Bénédicte de Boysson-Bardies. 1986. On the measurement of rhythmic irregularity: a reply to Benguerel. Journal of Phonetics 14, 327–330. Tilsen, Sam & Keith Johnson. 2008. Low-frequency Fourier analysis of speech rhythm. Journal of the Acoustical Society of America, 124, 2: 34–39. Wachsmuth, Ipke. 2002. Communicative rhythm in gesture and speech. In McKevitt, P., C. Mulvihill & S. O’Nuallain eds. Language, Vision and Music, 117–132. Amsterdam: John Benjamin. Wagner, Petra. 2001. Rhythmic alternations in German read speech. In Proceedings of Prosody 2000, Poznan, 237–245.
Part 2: Linguistic rhythm and cognition
Maren Schmidt-Kassow, Kathrin Rothermich and Sonja A. Kotz
3 The role of default stress patterns in German monolingual and L2 sentence processing One potato, two potato, Three potato, four; Five potato, six potato, Seven potato more.
1 Introduction Nursery rhymes such as this one can be found in many languages of the world and share one commonality: They are repeated in a strictly metric and repetitive sense. However, meter is not only present in nursery rhymes, it is a phenomenon which we encounter daily: we dance in step to music, we clap in time to a beat, when we ride a bicycle, we pedal in a certain tactus (1–2–1–2), and metric structure facilitates sequence-learning such as telephone numbers. We even try to “rhythmatize” monotonous, regular sounds such as drops of water from a faucet, with the first of two or more drops being perceived as more stressed than the others: a phenomenon called subjective rhythmization (Bolton 1894, Brochard et al. 2003). Without doubt, our ability to perceive rhythmic, or rather metric structures, leads rhythm and meter to impact our daily life. In the following chapter we want to take this incidental evidence one step further and reflect upon the relevance of metric patterns in auditory speech and language processing. In contrast to the phenomena cited above, the role of meter in speech perception is not self-evident at first glance. Intuitively, speech seems to be unsystematic (unless one thinks of poems) and without any underlying meter like there is in a piece of music. However, there is compelling evidence that languages differ according to their inherent speech rhythms and metric structures. About fifty years ago, Abercrombie (1967) claimed that languages of the world should be assigned to one of two rhythmic classes: stress-timed languages and syllable-timed languages. In stress-timed languages, stress pulses should reiterate at equal points in time, while syllable-timed languages are characterized by isochronally distributed syllables. However, this isochrony hypothesis has been refuted: largely because most researchers focused on production data and therefore failed to find measurable indications of isochrony in speech. Nonetheless,
84
Maren Schmidt-Kassow, Kathrin Rothermich and Sonja A. Kotz
it is indisputable that languages of the world have different underlying speech rhythms (Auer 1993, Nazzi and Ramus 2003). Hence, linguists usually adhere to the classification of languages as either stress- or syllable-timed. However, the distinction between the two language groups on the basis of physical parameters seems to be more complex than initially assumed (see Auer 1993; Roach 1982). Rhythm classes may have a huge impact on how we perceive a given speech input: Coping with a continuous speech stream requires that a listener can segment the incoming acoustic signal, i.e., separate the ongoing input into single words. The successful segmentation of speech is not only necessary to access single words and their respective meanings, it is also a precursor of sequencing an utterance, that is, predicting the order of future signals (“What next?”: Large and Kolen 1994). According to a rhythm-based segmentation account (Nazzi et al. 2006), segmentation strategies in languages differ depending on the corresponding rhythm class. German, as a stress-timed language, offers a prominent alternation of stressed and unstressed syllables: the so-called trochaic meter. The “trochee” is considered to be the default meter in German (Eisenberg 1991; Féry 1997) and plays a significant role in grouping speech into smaller units. Hence, trochaic meter guides speech segmentation and has an impact on auditory speech and language comprehension in stress-timed languages (Lee and Todd 2004; Cutler and Norris 1988). Here we will address the question of how trochaic meter influences speech perception beyond speech segmentation. We will review neurolinguistic evidence supporting the role of meter extraction during auditory sentence processing, mainly focussing on the interplay between metric and syntactic processing. The chapter is organized as follows: First, we will introduce the terminology used in the overview, taking into account the different definitions of the terms meter and rhythm. We then provide some background on the interaction of prosody and syntax, before we finally report a series of experiments suggesting a high dependence of syntactic on metric processing.
2 Rhythm and meter: definition and delineation “Every investigator… in his definition of rhythm is very certain and unequivocal as to its complexity. This is practically the only point that all are agreed on.” (Isaacs 1920)
Isaacs’s statement is more than 80 years old, but regrettably, still applies today. A precise terminological delineation is, therefore, essential in order to investigate the impact of meter and rhythm on speech perception. For instance, Cummins
The role of default stress patterns in German monolingual and L2 sentence processing
85
and Port (Cummins and Port 1998) showed in the speech cycling task that language underlies the same rhythmical constraints as motor activities, e.g., finger tapping. Subjects were asked to repeat a short phrase, such as “big for a duck”, in time with a regular series of metronome beeps. The authors looked for regularities in the speech timing as a function of the inter-beep interval, beep rate, and other factors. Targeted Speech Cycling is a development of this basic idea, in which the metronome series consists of alternating high and low beeps. Subjects have to try to align the beginning of a phrase with a high beep, and the onset of the second stress (“duck”) with the low beep. By varying the relative timing of the high and low beeps, the authors looked into how subjects are constrained in the form of speech timing they can produce. For English speakers three patterns were found, each of which corresponds to a ’simple’ rhythmic pattern in which the stress foot (interval between the onsets of stressed syllables) is neatly nested an integral number of times within the Phrase Repetition Cycle. Cummins and Port use the term speech rhythm to describe temporal patterns or grids in which speech is produced. However, other authors use the term speech rhythm to describe a part of prosody, namely: an interplay of pause duration, transitions, vowel reductions, stress successions, etc., which differentiates language sounds across languages. This definition is also the basis for the different rhythm classes as described above. Stress-timed languages such as German or English seem to have another duration pattern of vocal and inter-vocal intervals and have a greater degree of vowel reduction (Ramus et al. 1999), as well as more complex syllables compared to syllable-timed languages, such as French. Some authors (Thiessen and Saffran 2003; Jusczyk et al. 1993) investigated the importance of rhythm in language acquisition and demonstrated that even 5-day-old newborns (Nazzi and Ramus 2003) are able to discriminate their native language from other language solely based on rhythm. Clearly, different linguistic aspects that are associated with temporal processing are labeled speech rhythm. However, concrete parameters defining speech rhythm or a unique definition are lacking. This has also led to the fact that ― as noticed by Handel (Handel 1989) ― meter is what most people characterize as rhythm. In the following section, a working definition for this chapter will be given.
2.1 Rhythm, beat, and meter Our definitions of rhythm and meter in language are rooted in the music theory of Lerdahl and Jackendoff (Lerdahl and Jackendoff 1983), who defined rhythm as “the whole feeling of motion in time which compasses pulse, phrasing, harmony
86
Maren Schmidt-Kassow, Kathrin Rothermich and Sonja A. Kotz
and meter”. Hence, the term ‘rhythm’ designates the temporal pattern of event durations in an auditory sequence, i.e., a psychological construction consisting of beat, meter, timing and random variations. Thereby, beats are perceived pulses which mark regularly distributed points in time, either in form of sounds (i.e., stressed syllables), or purely hypothetical points in time, i.e., accents that are perceived without an existing physical correlate. The beat of a succession of auditory events arises from points of focused attention in a temporal sequence. Beats form the metrical grid of an auditory sequence ― in music as well as in language ― as they group single events into perceptive chunks (e.g., two successive stressed elements could never belong to one phrase in a language). Thus, meter is an abstract structure that is based upon a regular succession of strong and weak beats. The simplest meter is an isochronous regular metronome pulse. However, there is also meter with two or three nested periodicities in integer frequency ratios. For instance, in the case of a waltz-meter, in which different individual beats are formed into groups of three ― every third beat being more strongly accentuated than the others. This meter contains one faster periodical cycle (every beat) and a slower one (on the measure-level). Metric groups with an initial prominent element are preferred even if the physical stimulus is a regular succession of identical elements (Hyman 1977). We assume that, in speech, regularities are based on perceptual metric patterns. We define perceptual patterns as recurring formal events (stressed syllables or beats) that may or may not be aligned with the temporal event structure. Everyday speech usually displays dynamic variations and is not as temporally regular as classical western music. However, default metric patterns and distributions (like the trochaic unit in German) are used to create implicit predictions about when the next stressed syllable should occur in an acoustic signal. To sum up, we define meter in speech as the perceptually regular recurrence of stressed and unstressed syllables (i.e., strong and weak beats), while speech rhythm describes the whole arrangement of phonemes, syllables, and prosodic phrases in time.
2.2 Meter and predictability Based on these working definitions, we interpret the speech cycling experiment (Cummins and Port 1998), described above, as providing evidence for the relevance of meter rather than rhythm in speech production. Liberman and Prince (1977) stated that rhythm constitutes an organizational principle and that every temporarily ordered behavior is metrically organized. This means that rhythm, or rather meter, becomes manifest in the temporal binding of events to specific and
The role of default stress patterns in German monolingual and L2 sentence processing
87
predictable phases of a higher-ordered periodic circuit. Furthermore, Cummins and Port propagated the idea that meter in speech is functionally determined, and therefore, emerges in those linguistic conditions which require a narrow temporal coordination of events encompassing more than one syllable. As a coordinative strategy, meter has the advantage that those parts which group together to form metric structure are determined in their relative timing, which in turn results in a reduced number of degrees of freedom in the motor program. In this context, we assume that speech is not completely unpredictable, but comprises perceptual metric patterns that lead to prediction, and hence facilitation of information processing. Moreover, it is well known that the perception of regular stressed and unstressed syllable patterns is perceptually beneficial (Cutler and Foss 1977) and guides the listeners’ attention to important points in a sentence (Pitt and Samuel 1990). As a result, it is not incidental that metric stress takes on a functional role in language acquisition (Jusczyk et al. 1999), speech segmentation (Mattys and Samuel 1997), and lexical grouping (Dilley and McAuley 2008). However, perceptual regularity in speech has received much less attention compared with the investigation of perceptual regularities in music. This is probably due to a least two reasons: (i) most speech rhythm researchers focus on speech production, and report rare or no evidence of periodicity, and (ii) speech perception data primarily focus on the relevance of stressed syllables in segmentation (for e.g. see Jusczyk et al. 1999). Therefore, the investigation of perceptual regularities in longer linguistic sequences has taken a back seat because speech is considered as temporally irregular. However, the investigation and analysis of rhythmic or metric structure has reentered the limelight in psycholinguistic research over the past years and has become more and more focused on perceptual aspects of meter perception in speech.
2.3 Meter and syntax We conducted a series of experiments to investigate the impact of metric structure by means of predictable syllable stress in auditory syntactic processing. Why should meter interact with syntax during auditory language processing? We argue that meter and syntax share similar structural properties (Patel et al. 1998). Although metric and syntactic structure cannot be reduced or derived from one another (Jackendoff 2002), both processes are based on predictions concerning the continuation of a given speech input. For instance, in German the listener predicts the next incoming element to be a nominal phrase (NP) as soon as they hear an article (Art). Hence, the rule “An article is followed by a nominal phrase” elicits syntactic predictions. In other words, such syntactic structures allow
88
Maren Schmidt-Kassow, Kathrin Rothermich and Sonja A. Kotz
predictions on the basis of normative rules (Chang, Dell and Bock 2006). It is particularly important that syntactic predictions are not dependent on the individual’s world knowledge, as it is the case with semantics. For example, semantic predictions mainly rely on inter-individual differences and social contexts, while syntactic predictions do not because syntactic rules are automatically acquired during first language acquisition. Concerning meter perception, we assume that the processing of metric structure guides the listener’s attention to important points in a sentence, for example, time points at which salient stressed syllables are expected to occur. This is in line with the attentional bounce hypothesis (Pitt and Samuel 1990), where attention is thought of as moving from one stressed syllable to the next. Hence, metric predictions allow the listener to focus attention on relevant aspects in the continuous speech stream, and thus lead to efficient information processing. We conclude that both meter and syntax allow the continuous speech stream to be structured and help generate predictions about timing and form of an upcoming event. A metric pattern serves as a “framework” that enables the listener to structure linguistic input and to build-up syntactic hierarchies (Large and Kolen 1994).
3 Method: Event-Related Potentials Before we report some previous evidence on syntax-prosody interaction we will give a brief introduction into the method most frequently used in related research, namely event-related potentials (ERP). ERPs are patterned voltage changes in the on-going electroencephalogram (EEG) which are time-locked to specific cognitive processing events. They occur previous to, during, or after a motor, sensory or mental event. Event-related voltage fluctuations are very subtle compared to the spontaneous EEG. They are invisible to the naked eye and hence a lot of signal segments related to events of the same type have to be averaged in order to get an ERP-waveform. This is based on the assumption that the underlying mental process remains equal across several stimuli of the same type while background activity is random. The ERP-waveform is a sequence of negative- or positive-going peaks and valleys that are also labeled as components, deflections, peaks, or waves. They are described in terms of their polarity, distribution, latency and experimental variability. Traditionally, ERP-components are subdivided into endogenous and exogenous components. Exogenous components are reactions to simple physical stimuli occurring between 0 and 100 ms after stimulus-onset, which alter due to the physical features of a stimulus. All components arising 50 and more mil-
The role of default stress patterns in German monolingual and L2 sentence processing
89
liseconds after stimulus-onset are reactions related to mental changes and thus do not only occur because of physical differences. These components are called endogenous. Up to 300 ms after stimulus onset, processing steps are thought to be non-attended. One great advantage of the non-invasive ERP-method is its high temporal resolution. Speech processing is characterized by its high speed as well as its complexity. Thus, many types of information, namely phonetic, prosodic, semantic, and syntactic information have to be processed within a very short space of time. Hence, ERPs are a convenient method to study the sub-processes of language which can be separated from decoding the acoustic signal to the understanding of an utterance. Time plays an important role in separating processes, especially with regard to a potential interaction between two language levels (in the current case meter and syntax). Using ERPs makes it possible to study the interaction between meter and syntax and to look at both processes separately in order to examine differences in latency or distribution, respectively. In the following sections, relevant ERP components for syntactic and prosodic processing will be introduced, but the focus of this overview is directed towards the P600, a late positive component.
3.1 Early Left Anterior Negativity (ELAN) and Left Anterior Negativity (LAN) Both components, the ELAN as well as the LAN are connected to the detection of syntactic violations. Several studies reported an ELAN in response to phrase structure violations (e.g., Hahne and Friederici 1999, Neville et al. 1991). Consequently, based on ideas of Frazier (Frazier 1987), initial sentential structure building is assumed to be based on word category information. The ELAN starts approximately 100–200 ms after stimulus onset and is followed by a late posterior positivity (P600) making it part of a biphasic pattern (ELAN - P600). This component seems to be automatic in nature since neither the predictability of a violation, nor the context, or task demands influence the amplitude of the early negativity (Hahne and Friederici 1999; Hahne and Friederici 2002). Thus, the ELAN is independent of attentional factors (Friederici et al. 1999). The LAN, a component elicited about 300 to 500 ms after onset of the critical item (Friederici 1999) has been observed in response to morphosyntactic violations, such as gender, number and tense disagreement (Coulson et al. 1998; Deutsch and Bentin 2001; Gunter et al. 2000). Comparable to the ELAN, it is almost always frontally distributed and more pronounced over left than right electrode sites. This component often occurs in a biphasic pattern as it is followed by a late posterior positivity (P600). The LAN has been discussed to be sensitive to working
90
Maren Schmidt-Kassow, Kathrin Rothermich and Sonja A. Kotz
memory load (Kluender and Kutas 1993; King and Kutas 1995) as well as being a pure morphosyntactic component (Friederici 1995). A recently conducted study by Hoen and Dominey (Hoen and Dominey 2000) revealed a LAN in response to an abstract sequence violation. This indicates that a LAN is not necessarily language specific, but rather a reflection of a general rule-based sequencing mechanism related to error detection in a hierarchically structured sequence (independent of being a language or a non-language sequence).
3.2 N400 The N400 is a negativity that occurs around 200 to 700 ms after the presentation of a critical stimulus (Coulson 2004). The N400 is largely distributed over the scalp, however, it is most pronounced over centro-parietal electrode sites, with a slight rightward shift (Kutas and Hillyard 1980). This component is affected by a large number of manipulations: It is sensitive to word frequency because less frequent words elicit a larger N400 than frequent words (Smith and Halgren 1989); it is sensitive to repetition (van Petten and Kutas 1991) and to words with close probability (Kutas and Hillyard 1984); and its amplitude size decreases with increasing probability of a certain word in a given context. Furthermore, an amplitude decrease of the N400 has been reported for semantic priming (Holcomb and Neville 1990) because the N400 amplitude becomes smaller when the target word is preceded by a related word. Additionally, pseudowords also evoke an N400 while nonwords do not (Kutas and Hillyard 1980). Recently conducted studies reveal that the N400 is sensitive to syntactic contexts; this negativity can be evoked when thematic role assignment is impossible (Frisch et al. 2004). Thus, the N400 has been interpreted as an index of processing difficulty “the more demands a word poses on lexical integration processes, the larger the N400 component will be” (Coulson 2004).
3.3 P600 The P600 is a late positivity with centro-parietal distribution and with a maximal peak latency of 600 ms. This positivity has been connected to several types of syntactic violations, namely subject-verb-agreement (Vos et al. 2001), verb inflection (Gunter et al. 1997), case inflection (Münte et al. 1998), wrong pronoun inflection (Coulson et al. 1998), and phrase structure violations (Hahne and Friederici 1999). However, such a positivity is not only evoked by syntactic violations; it has also been found in response to ambiguities (Frisch et al. 2002; Mecklinger et al.
The role of default stress patterns in German monolingual and L2 sentence processing
91
1995), garden-path-sentences, and sentences with increased syntactic complexity (Kaan et al. 2000). Friederici et al. (Friederici et al. 2002) further distinguish between a violation-P600, which is centroparietally distributed and a complexityP600, which is mostly pronounced at frontal electrode sites. Accordingly, van Herten and colleagues (van Herten et al. 2005) summarize the reported syntactic functions of the P600 as follows: “The P600 has been claimed accordingly to reflect various kinds of syntactic processing difficulties, such as the inability of the parser to assign the preferred structure to incoming words, syntactic reanalysis or syntactic integration difficulty.” However, the P600 has recently been found in response to semantic violations as well (Kuperberg et al. 2003; Hoeks et al. 2004), e.g., in Dutch sentences like The cat that fled from the mice… (Kolk et al. 2003; van Herten et al. 2005). Additionally, the authors found solely a P600 and no N400. The authors concluded that the reader used a plausibility strategy which ended in two different theta-role assignments. Whereas the first assignment is due to plausibility (cats: agent, mice: patient) the second assignment is due to the syntactic parsing algorithm (mice: agent, cats: patient). Consequently, there is a conflict between these two possibilities and this conflict resulted in a P600 as soon as the reader detected the error and refocused his/her attention to the unexpected event in order to reanalyze the sentence. Additionally, a P600 has been found in response to music violations (Patel et al. 1998; Besson and Faita 1995). Patel et al. (Patel et al. 1998) looked at the P600 in response to manipulated target chords. Once the target was within the key of a phrase and once it was out of the key. Targets that were out of key elicited a positivity around 600 ms post-target onset (P600). This result prompted the authors to interpret the P600 as a domain-general response to “knowledge-based structural integration”. Besides the domain specificity of the P600 there has been ample discussion that the P600 is solely a delayed P300. Representatives of this opinion (Coulson et al. 1998; Gunter et al. 1997) argue that the distribution of the P600 is very similar to that of the P3b, because both are mostly pronounced at centro-parietal sites. Furthermore, both components are sensitive to the degree of probability that a certain target stimulus may both occur in and behave in an additive fashion (Osterhout et al. 1994). The latency differences are thought to result from differences in complexity as linguistic stimuli are much more complex than single tones (i.e., a classical P3-oddball paradigm). However, Frisch and colleagues (Frisch et al. 2003) provide data that call this possibility into question. In an auditory ERP-experiment, patients with focal lesions of the basal ganglia were tested in a syntactic violation paradigm as well as in a P3-oddball task. While healthy controls showed a P600 in response to the syntactic violations and a P300 in response to the deviant stimuli in the oddball paradigm the patients
92
Maren Schmidt-Kassow, Kathrin Rothermich and Sonja A. Kotz
showed no P600 but only a P300. Thus, the authors stated that the basal ganglia play a crucial role in the modulation of the P600, but not in the modulation of the P300 component and concluded that a neural and functional distinction between these two ERP-components speaks for two different components.
4 Experiments Since meter, in contrast to intonational or phonological phrases (Selkirk 1984), is a low level aspect of speech prosody, we will first review some ERP evidence on the prosody-syntax interface before presenting our findings regarding the interplay between meter and syntax. In the last decade, results from several ERP studies indicated a strong interaction of syntax and prosody during auditory sentence processing. For instance, Steinhauer and colleagues (Steinhauer et al. 1999) directly tested the prosodysyntax interface and reported that prosodic boundaries are actively used during sentence parsing. In their experiments, the prosodic structure induced a syntactic garden-path (i.e., a syntactic structure without expectancy for a direct object, but a direct object is presented). This mismatch led to a syntactic re-analysis of the sentence in form of a biphasic ERP pattern (N400-P600 complex). Astésano and colleagues (Astésano et al. 2004) showed that a prosodic mismatch (i.e., a statement that ends with pitch contour typical for questions) can elicit a positive deflection that peaks around 800 ms (P800), comparable to the P600 elicited by syntactic violations. Eckstein and Friederici (Eckstein and Friederici 2005), on the other hand, reported a P600 effect for sentence-final and prosodically incongruent words. The data by Eckstein and Friederici indicate that syntactic and prosodic information processing interact during sentence processing. They reported an over-additive P600 effect in a combined prosodic–syntactic violation condition, compared to the sum of a single syntactic and prosodic P600 effect. In a follow-up study, Eckstein and Friederici (Eckstein and Friederici 2006) replicated this late interaction of syntax and prosody, but also reported an early interaction of prosody and syntax. They found an early left-lateralized negativity in the syntax-only condition, but a bilateral early negativity when both prosody and syntax mismatched. Thus, the data provide strong electrophysiological evidence that the syntactic parser can be directly influenced by prosodic information. In this chapter, we focus on the influence at a lower prosodic level, that is, the metric pattern of a given sentence (i.e., the succession of stressed and unstressed syllables) on syntactic processing. We argue that the underlying metric structure of a given language is a prime candidate to facilitate syntactic sequencing. This
The role of default stress patterns in German monolingual and L2 sentence processing
93
should be the case if sequencing goes hand in hand with the precursor segmentation, as has been previously discussed (e.g., Quené and Koster 1998; Cutler 1994; Quené 1993; Slowiaczek 1990; Cutler and Carter 1987). To examine whether the trochee, as default metric pattern in German, plays a significant role in speech processing that goes beyond speech segmentation, we obtained electroencephalographic recordings from 59 Ag AgCl electrodes mounted in an elastic cap (Electro-Cap Inc., Eaton, Ohio, USA) from 24 healthy young German adults (Schmidt-Kassow and Kotz 2009). While they listened to metrically regular sentences, they performed either a syntactic task (grammaticality judgment) or a metric task (metric homogeneity judgment). All sentences had a trochaic pattern (regular succession of stressed and unstressed syllables). However, in the metrically incorrect condition, stress was shifted from the first to the second syllable in the critical item. In the syntactic condition, syntactically violated sentences were included where the expected infinitive was replaced by an inflected verb. In an additional double violation condition, both violations were combined (see Table 1). Table 1: Experimental Conditions Condition
Example
Correct
’Vera ’hätte ’Christoph ’gestern ’morgen ’schubsen ’können. Vera could have hustled Christoph yesterday morning
Metric violation
’Vera ’hätte ’Christoph ’gestern ’morgen schub’SEN ’können. Vera could have hustled Christoph yesterday morning
Syntactic violation
’Vera ’hätte ’Christoph ’gestern ’morgen ’schubste Vera could have hustle Christoph yesterday morning
’können.
Double violation
’Vera ’hätte ’Christoph ’gestern ’morgen schubs’TE Vera could have hustle Christoph yesterday morning
’können.
Under both task conditions accuracy rates were above 80 %, indicating that all participants understood both tasks. ERPs were quantified and time-locked to the onset of the critical item (e.g., schubsen). As shown in Figure 1 and Figure 2, all violation conditions elicited a similar ERP pattern: that is, an early negativity at frontal electrode sites followed by a late positivity at parietal electrode sites.
94
Maren Schmidt-Kassow, Kathrin Rothermich and Sonja A. Kotz
Figure 1: ERPs elicited by syntactically, metrically, and doubly violated sentences in the metric task condition.
Figure 2: ERPs elicited by syntactically, metrically, and doubly violated sentences in the syntactic task condition.
The role of default stress patterns in German monolingual and L2 sentence processing
95
4.1 Early Negativity Both components were visible in both task conditions: the syntactic task and the metric task. However, we found differences in the latency of the early negativity. Hence, the critical verb elicited an anterior negativity whose latency varied as a function of violation type. Metrically evoked negativities deflect between 250 and 400 msec earlier, compared to the syntactically evoked negativities (LAN). This finding indicates that metric deviations are processed prior to syntactic violations, independent of the task demands. One could argue that the critical cue for syntactic violation (-te) may not be detected until the onset of the second syllable, while the critical cue for the metric violation is already available in the first syllable. We addressed this open issue in a follow-up study in which we conducted a gating experiment with the same material. In a gating experiment, spoken words are usually presented in segments of successively increasing duration. The first segment usually starts at the word beginning and the last one corresponds to the entire word (Grosjean 1980). Here, we presented the whole sentence up to the critical item and successively increased the number of segments¹ of the critical item (e.g., sch, schu, schub, schubs, schubse, schubsen). Subjects were either asked to indicate whether they felt that the critical item was correctly stressed or not, or they were asked to indicate whether the item was grammatically correct or not. Interestingly, neither violation was detected until the onset of the second syllable (see Figure 3). This is not surprising, since stress is perceived relative to
Figure 3: Detection point of syntactic (left figure) and metric (right figure) errors. 1 As we were not interested in the recognition point but in the number of phonemes that are necessary to detect one or the other violation, we decided to successively increase the number of segments instead of increasing duration of a word.
96
Maren Schmidt-Kassow, Kathrin Rothermich and Sonja A. Kotz
another event; that is, a metric unit always consists of two syllables, a strong and a weak one. The follow-up study confirms that the previous results, indicating an early negativity, are not due to the fact that critical cues for one or the other condition were available at an earlier time point. Clearly, metric cues are processed prior to syntactic cues at an early processing stage, as reflected in an early frontally distributed negativity. Thus, the ERP results support the relevant role of metric cues in auditory sentence processing. We interpret this metrically induced negativity as a correlate of the ongoing segmentation process (Schmidt-Kassow and Kotz 2009). However, based on the reported experiment, this component may also be an N400. Incorrect stress assignment may result in effortful lexical access as reflected by an N400 component. To unravel this open issue, we conducted a second ERP experiment utilizing jabberwocky sentences (Rothermich et al. 2010). We created opaque pseudoword sentences with the same CV-syllable and metric structure as sentences describes above (see Table 2). Table 2: Stimulus examples Condition
Example
Correct
’Trogeumpf ’hätte ’Knäpfeint ’peile ’dögent ’napen ’sollen Trogeumpf have Knäpfeint peile dögent napen should
Metric violation
’Schlopfzu ’hätte ’Fligme ’peile ’dögent na’PEN ’sollen
Syntactic violation
’Schrüpsi ’hätte ’Knolpfe ’peile ’dögent ’napte ’sollen
Double violation
’Flutfü ’hätte ’Schrubli ’peile ’dögent nap’TE ’sollen
As in the previous experiment, participants were asked to listen carefully to the sentences and performed either a grammaticality or a metric homogeneity judgment. Participants performed with accuracy rates above 84 %. ERP results showed that metric violations evoke an early negativity in both the tasks, as plotted in Figure 4. Therefore, the early negativity does not seem to rely on lexicosemantic content and is, consequently, unlikely to be an N400. We suggest that the early negativity reflects an error detection mechanism that comes into play during the processing of rule-based sequences in language, music and arithmetics, where similar negativities have been reported. Common to these processes is predictive behavior and a shared neurophysiological basis (Dominey 1997). For example, Hoen and Dominey (Hoen and Dominey 2000), who provide evidence for a similar negativity in a non-linguistic sequencing context, argued that the “negativity is not specific to language, but is rather a more general neurocomputational capability for treating structural complexity in
The role of default stress patterns in German monolingual and L2 sentence processing
97
Figure 4: ERPs elicited by metrically violated Jabberwocky-sentences under metric and syntactic task conditions.
rule-governed sequences”. Additional support for a domain-general prediction hypothesis comes from Abecasis et al. (Abecasis et al. 2005) and Brochard et al. (Brochard et al. 2003), who report an early negativity in response to metric deviations in tone sequences that can be interpreted as a correlate of metric error detection. Given that similar negativities in latency and topography have been found in other domains, we propose that the metric negativity is not language specific per se, but rather reflects a general error detection mechanism (Rothermich et al. 2010).
98
Maren Schmidt-Kassow, Kathrin Rothermich and Sonja A. Kotz
4.2 P600 The findings concerning the late positivity (P600) were particularly interesting. Not only did we find that meter and syntax elicit ERP-patterns that are comparable in morphology and distribution, but also that meter and syntax interact and both elicit a P600. We included a double violation condition to be able to investigate this issue. If two processes operate independently, the amplitude of the P600 in the double violation condition should be equal to the sum of the P600s in the single violation conditions (Gondan and Röder 2006; Barth et al. 1995) because independent neural generators have additive effects on the amplitude of an ERP component. If, on the other hand, two processes interact, the amplitude of the double violation condition should be over- or under-additive. To evaluate the expected under-additivity of the P600 component, we summed the difference waves of the syntactic and the metric P600. This calculated difference wave for the double violation condition should equal the evoked difference wave in the double violation condition. Interestingly, the amplitude of the calculated P600 was larger than the amplitude of the evoked P600. This finding suggests that meter and syntax interact in the processes underlying the P600 (Schmidt-Kassow and Kotz 2009).
4.3 Bilinguals Further evidence for a meter-syntax interaction comes from recent bilingual data (Schmidt-Kassow et al. 2010). We tested French late-learners of German with the same paradigm as described above. All participants were very proficient in evaluating the syntactic correctness of the sentences (syntactic task) and showed a native-like P600 in response to syntactic violations. However, French bilinguals’ performance broke down when they were asked to evaluate the metric pattern of the respective sentences. This metric evaluation is obviously no challenge for German native speakers, as their performance was above 85 percent correct. In contrast, French–German bilinguals had huge problems in judging metric homogeneity and seemed to have relied on syntactic cues instead: they show neither a P600 in response to metric violations, nor a syntactic P600 under metric task instructions. We have looked at the data in more detail and found that half of the participants performed pretty well, while the other have had huge problems with handling the metric task. That is to say, good performers performed nearly native-like (> 80 % correct), while poor performers were far below chance level (