193 114 5MB
German Pages 128 Year 2009
Adriana Hanulíková Lexical segmentation in Slovak and G e r m a n
studia grammatica Herausgegeben von Manfred Bierwisch unter Mitwirkung von Hubert Haider, Stuttgart Paul Kiparsky, Stanford Angelika Kratzer, Amherst Jürgen Kunze, Berlin David Pesetsky, Cambridge (Massachusetts) Dieter Wunderlich, Düsseldorf
studia grammatica 69
Adriana Hanulíková
L G X I C a l
segmentation in Slovak and German
Akademie Verlag
Bibliografische Information der Deutschen Nationalbibliothek Die Deutsche Nationalbibliothek verzeichnet diese Publikation in der Deutschen Nationalbibliografie; detaillierte bibliografische Daten sind im Internet über http://dnb.d-nb.de abrufbar.
ISBN 978-3-05-004632-7
eISBN 978-3-05-006227-3 ISSN 0081-6469
© Akademie Verlag GmbH, Berlin 2009 Das eingesetzte Papier ist alterungsbeständig nach DIN/ISO 9706. Alle Rechte, insbesondere die der Übersetzung in andere Sprachen, vorbehalten. Kein Teil des Buches darf ohne Genehmigung des Verlages in irgendeiner Form - durch Fotokopie, Mikroverfilmung oder irgendein anderes Verfahren - reproduziert oder in eine von Maschinen, insbesondere von Datenverarbeitungsmaschinen, verwendbare Sprache übertragen oder übersetzt werden. All rights reserved (including those of translation into other languages). No part of this book may be reproduced in any form - by photoprinting, microfilm, or any other means - nor transmitted or translated into a machine language without written permission from the publishers. Druck und Bindung: MB Medienhaus Berlin Printed in the Federal Republic of Germany
Table of contents
Acknowledgments
10
1 Introduction
12
1.1 Speech comprehension
13
1.2 Basic unit of speech perception
14
1.3 Explicit speech segmentation
16
1.4 Speech segmentation through word recognition
18
1.5 Models of spoken-word recognition
19
1.5.1 Cohort model
20
1.5.2 Trace
21
1.5.3 Shortlist
22
1.6 A universal principle in speech segmentation
23
1.7 Slovak and German
25
1.8 Methodologies
27
1.8.1 Word-spotting
28
1.8.2 Auditory lexical decision
28
1.8.3 Syllabification reversal task
28
1.9 Structure of the thesis 2 Segmentation of Slovak speech
29 30
2.1 Introduction
30
2.2 Experiment 1
35
2.2.1 Method
35
Participants
35
Materials and Design
36
Procedure
37
6
Table of
2.2.2 Results
contents
38
2.2.2.1 Experiment 1A: Word-spotting
39
2.2.2.2 Experiment IB: Word-spotting
40
2.2.2.3 Experiment 1C: Lexical decision
41
2.2.3 Discussion
42
2.3 Experiment 2
44
2.3.1 Method
44
Participants
44
Materials and design
44
Procedure
44
2.3.2 Results and discussion 2.4 General discussion 3 Native and non-native segmentation
45 47 51
3.1 Introduction
51
3.2 Experiment 1
57
3.2.1 Method
57
Participants
57
Materials and design
58
Procedure
59
3.2.2 Results
60
3.2.2.1 Experiment 1A: Word spotting
60
3.2.2.2 Experiment IB: Word spotting
61
3.2.2.3 Experiment 1C: Lexical decision
62
3.2.3 Discussion
63
3.3 Experiment 2
64
3.3.1 Method
64
Participants
64
Materials and design
64
Procedure
64
3.3.2 Results and discussion 3.4 General discussion
65 72
Table of contents 4 The role of syllabification in speech segmentation
7 76
4.1 Introduction
76
4.2 Experiment 1
82
4.2.1 Method
83
Participants
83
Materials and design
83
Procedure
83
4.2.2 Results and discussion
84
4.3 Experiment 2
86
4.3.1 Method
87
Participants
87
Materials and design
87
Procedure
87
4.3.2 Results and discussion 4.4 General discussion 5 Summary and conclusions
87 89 93
5.1 Summary
93
5.2 Conclusions
96
Resümee Appendices
99 104
Appendix A: Stimulus material
104
Appendix A. 1 : Chapter 2
104
Appendix A.2: Chapter 3
106
Appendix A.3: Chapter 4
107
Appendix B: Questionnaires
108
Appendix B.l: Chapter 2, Chapter 4.2
108
Appendix B.2: Chapter 3 (Experiments 1 A, IB, and 1C), Chapter 4.1
108
Appendix B.3: Chapter 3 (Experiment 2: Slovak learners in Slovakia)
109
Appendix B.4: Chapter 3 (Experiment 2: Slovak learners in Germany)
111
References
113
List of tables
Table 1. Slovak and German minimal words and possible syllables in comparison
15
Table 2. Experiments 1A, IB, 1C, and 2 (Chapter 2)
29
Table 3. Experiment IB (Chapter 2): Paired t-test comparisons
30
Table 4. Splicing technique example
34
Table 5. Experiment 2 (Chapter 2): Paired t-test comparisons
35
Table 6. Experiments 1 A, 1B, and 1C (Chapter 3)
49
Table 7. Experiment 2 (Chapter 3)
55
Table 8. Experiment 2 (Chapter 3): Paired t-test comparisons
56
Table 9. Experiment 2 (Chapter 3): Regression analysis
57
Table 10. Experiments IB and 2 (Chapter 3)
58
Table 11. Experiment 2 (Chapter 3): Paired t-tests for nouns and verbs
60
Table 12. Experiment 2 (Chapter 3): Main effects of context and interactions
60
Table 13. Experiments 1 and 2 (Chapter 4)
73
List of figures
Figure la. Waveform and spectrogram of the Slovak word vrane
3
Figure 1 b. Waveform and spectrogram of the English phrase she sells sea shells
5
Figure 2. Pattern of inhibitory connections between candidate words
8
Figure 3. Experimental procedure: Presentation of stimuli and timing
27
Figure 4. Experiments 1A and IB (Chapter 2)
28
Figure 5. Experiment 1C (Chapter 2)
30
Figure 6. Experiment 2 (Chapter 2)
34
Figure 7. Experiments 1A and IB (Chapter 3)
50
Figure 8. Experiment 1C (Chapter 3)
51
Figure 9. Experiments 1Β and 2 (Chapter 3)
54
Figure 10. Experiments 1 and 2: Nouns and verbs comparison (Chapter 3)
59
Figure 11. Experiments 1 and 2 (Chapter 4)
74
Acknowledgments
Táto kniha je venovaná Anne Hanulíkovej.
Ich bin zahlreichen Menschen sehr dankbar für ihre Bereitschaft, diese Arbeit fachlich und menschlich zu unterstützen. Von Anfang an wurde ich bei dieser Arbeit, die für mich das Eintauchen in ein völlig neues Feld bedeutete, von Rainer Dietrich begleitet. Er zweifelte keineswegs daran, dass ich schnell zu schwimmen lernen würde und gab mir bei diesem Lernprozess viel Freiraum zum Ausprobieren. Er konnte mich aber gleichzeitig fördern und fordern. Ich bin ihm sehr dankbar für das Schaffen von den besten Arbeitsbedingungen, die sich ein Doktorand nur wünschen kann. Ihm verdanke ich auch, dass ich mehrere Monate in der Comprehension Gruppe von Anne Cutler am Max-PlanckInstitut für Psycholinguistik in Nijmegen verbringen durfte. Hier habe ich mein Thema abgeändert und war dank der guten Arbeitsbedingungen in der Lage, meine Experimente innerhalb von kurzer Zeit abzuschließen. James McQueen übernahm dabei die offizielle Rolle des Zweitbetreuers, obwohl seine Agenda kaum noch Raum für eine neue Studentin erlaubte; umso mehr bin ich ihm dafür dankbar. Ich danke allen Mitarbeitern des Instituts für deutsche Sprache und Linguistik und ganz besonders Felix Golcher, Silke Hamann, Katja Kühn, Max Möller und Katja Suckow dafür, dass sie in entscheidenden Momenten für mich da waren, ohne es unbedingt mitzubekommen. Für weitere menschliche, fachliche oder technische Unterstützung in- und außerhalb der Universität in Berlin danke ich insbesondere Mascha Averintseva, Olinka Blauová, Jana Brunner, Yasmin Dalati, Martin Gebauer, Ingrid Häfner, Chung-Shan Kao, Guido Kicker, Manfred Krifka, Anke Lüdeling, Blanka Mongu, Birgit Monreal, Bernd Pompino-Marschall, Daniel Pape, Eva Poll, Jessica Rosenberg, Benny Weiß, sowie den Stipendiaten der KAS. Marcela Adamíková und Katja Kühn verliehen meinem Experimentalmaterial ihre Stimmen, vielen Dank dafür! Ich danke auch allen Studenten in Berlin für ihre Teilnahme an meinen Experimenten. Es ist kaum möglich, eine Dissertation innerhalb von absehbarer Zeit zu schreiben, wenn kein Geld da ist. Glücklicherweise blieb mir diese Sorge erspart dank der finanziellen Unterstützung der Konrad-Adenauer-Stiftung. Vielen Dank an Karsten Sinner für seine Ermutigung mich zu bewerben. Anne Cutler und Rainer Dietrich verdanke ich die finanzielle Hilfe bei der Durchführung einiger Experimente. Außerdem wurden viele meiner Konferenzreisen von der Konrad-Adenauer-Stiftung, der Deutschen Forschungsgemeinschaft und des Instituts für deutsche Sprache und Linguistik in Berlin finanziert. Ohne deren Hilfe hätte kaum jemand von meiner Arbeit erfahren und ich hätte nie von
Acknowledgments
11
den zahlreichen Ideen, Vorschlägen und Kommentaren profitieren können, die mich immer wieder zum Nachdenken gefuhrt und letztlich auch meine Arbeit verbessert haben. I want to thank James McQueen for taking over the role of the second supervisor and for all his unfailing support in the last stage of my PhD. His comments and advice were inspiring. His comments of my drafts considerably improved this dissertation and clarified the project to myself. I was very lucky to have the opportunity to learn from James not only how to work and think scientifically but also about the psychological approach to language. I also want to thank Holger Mitterer for all his support throughout the project and for his patience in explaining complicated statistical and programming issues, and Anne Cutler for her enthusiasm and her constructive feedback. For friendship and wonderful evenings in Nijmegen I want to thank Bettina, Carmel, Jenny, Silke and Paul, Tilman, Claudia, Katja and Victor, and Aneta. For spotting errors in parts of my manuscript I am grateful to Bettina Braun, Felix Golcher, Jenny Green, Silke Hamann, Holger Mitterer, Claire Plarre, Carmel O'Shannessy, and Kevin Russell. I owe the biggest thanks to Doug Davidson and James McQueen who read the whole manuscript and provided me with many invaluable comments. I am the only one to be blamed for all remaining errors. I also want to thank all participants in my experiments and the PhD students in Trenòin and Bratislava. I am grateful to the departments where I tested for allowing me to do so, especially to prof. Mária Vajícková, prof. Dusan Maga, prof Stanislav Tóth, prof. Jan Garaj, and to the secretaries Katarina Hrncárová and Alexandra Kerndlová. Za bezpodmienecnú podporu pri prevedení experimentov ν Trencíne som vd'acná rodine Kuchtovej, hlavne Marike za úzasnú pohostinnost', a Majke a Sebinkovi za spestrenie vecerov. Dni a roky by boli t'azsie, keby nebolo blízkych. Za realisticky pohl'ad na svet a za krásny vyhl'ad na Vysoké Tatry d'akujem Andrei Gemzovej. Doug so mnou neúnavne krácal a povzbudil ma, ked' som strácala trpezlivost'. Thank you, Doug, for all the wonderful time. A na záver d'akujem mojej rodine, Anne, Mirkovi, a Pet'ovi, za ich podporu dnes, pred rokmi, a neustále. Bez ich dôvery a dobrej nálady by som nebola tam kde som.
1 Introduction
Communication is essential for most living organisms such as birds, tigers, plants, and humans (Hauser, 1997). The system for human communication evolved over time with specific features and with neurobiological and physiological constraints that are similar across users of any language from any culture. For example, all humans are equipped with perceptual and articulatory mechanisms, which (under healthy circumstances) allow them to learn to perceive and produce speech. There also seems to be a common need among humans to use these mechanisms to communicate, and most often they do so using speech, as indicated by the quote from Jakobson, Fant, and Halle (1976: 13): "[...] we speak to be heard in order to be understood". But how do humans understand speech? Do they share similar underlying language-comprehension mechanisms, or are these fundamentally different due to the diversity of languages and speakers? To understand the nature of these mechanisms of speech comprehension, it is necessary to study not only an individual language and its users, but also to compare features of several languages. This thesis provides a cross-linguistic examination of human speech comprehension by investigating processing in users of different languages. Humans normally perceive and comprehend continuous speech effortlessly. Comprehension of speech involves the transformation of a physical speech signal into meaningful units. The conversion of sound to meaning works so well in one's native language that most people take its efficiency for granted. It is only when comprehension fails, for example during the exposure to a foreign language, that one senses the complexity of the cognitive achievement. This dissertation examines one aspect of the speech comprehension process: the segmentation of continuous speech. Segmentation in this dissertation refers to the detection of word boundaries and the beginning of lexical access on supposed word onsets. Several experiments are presented that investigate a proposed universal segmentation principle, the Possible-Word Constraint (PWC; Norris, McQueen, Cutler, & Butterfield, 1997), on two languages: Slovak and German. This thesis further addresses a) the influence of Slovak prosodie structure on the segmentation of a native and a non-native language, and b) the role of syllables in speech segmentation of German and Slovak.
Chapter 1
13
1.1 Speech comprehension For spoken information transfer to be successful, the listener needs to understand the meaning of the spoken utterance, thus, the message. The message a listener receives is encoded in an acoustic signal, which results from the physiological movements involved in speech production (see, e.g., Ladefoged, 1975). The process of listening begins once this signal reaches the ear. After the initial psychoacoustic processing of the input, the listener separates speech from other sensory input that might reach the ear (see Bregman, 1990, for a review). The acoustic signal is sent to the auditory cortex via the auditory nerves and is then converted into an abstract representation used to access the mental lexicon, the stored representations of words. The next processing stage is called word recognition. At this stage, the listener has to segment the signal into meaningful discrete units. Once the words are recognized, the following processing stages are concerned with integration: listeners determine the syntactic and semantic properties of individual words and the syntactic and semantic relationships among them, and use this knowledge as well as pragmatic and world knowledge to understand and interpret the utterance (see Dietrich, 2007, for an overview). The conversion of sound to meaning is however not as simple as it might appear from this brief description (see Cutler & Clifton, 1999, for a detailed description). One factor that challenges any simple description of speech processing is the nature of the speech signal itself. While orthographic language is distributed in space and all letters of a word are available to the reader for processing, speech information unfolds over time. Spoken language does not remain available over time, so that one could re-listen to an earlier chunk of speech. Only a short sequence of speech can be retained in the auditory system, for example, in an echoic form (e.g., Crowder & Morton, 1969; Huggins, 1975). The word recognition and comprehension processes must therefore work quickly and efficiently in order to extract the message from the signal before it fades from memory. In addition to its temporal features, the speech signal is continuous and highly variable. There are no audible breaks after every spoken word that would be comparable to white spaces after every printed word. Moreover, phonetic segments are not produced as discrete units (e.g., Liberman, Cooper, Shankweiler, & Studdert Kennedy, 1967). They are produced in context and hence coarticulated with neighboring segments. For example, the acoustic properties of the phoneme /d/ in the syllable /di/ differ from lál in /du/ (Delattre, Liberman, & Cooper, 1955), and reflect not only the segment being produced, but also the following (or previous) segments. Surrounding speech units therefore affect the quality of a particular phoneme. Nonetheless, listeners recognize different variants as tokens of the same phoneme. To further complicate matters, just like individual samples of hand-writing differ from each other, the acoustic realization of phonetic segments and words also varies across speakers, and changes with age, gender, emotional state, dialect, or speech rate. Crucially, the spectral characteristics of the same phoneme or word change even within one speaker. This can be seen in an acoustic analysis of speech provided by the waveform and the spectrogram in Figure la. It shows the Slovak word [vrqpe] ('crow' + Singular Dative inflection) pronounced in a sentence twice in a row by the same speaker. The horizontal axis indicates at what point in time a certain sound occurred. The waveform
14
Introduction
indicates the air pressure variation over time, i.e., the amplitude (plotted on the vertical axis). A spectrogram is the translation of the sound and its component frequencies into a visual representation. The vertical axis of the spectrogram shows frequencies present at the times indicated on the horizontal axis; and the amplitudes of the component frequencies are plotted in lighter or darker shades of grey. Noticeably, the spectral characteristics of the words do not match perfectly.
Figure la. Waveform (upper panel) and spectrogram (lower panel) of the Slovak word [vrqpe] spoken twice by a female native speaker. An additional property of spontaneous speech is the occurrence of deletion, reduction, or assimilation of segments (and even words). The realization of a word therefore rarely resembles its canonical form, i.e., a form that is provided by pronunciation dictionaries. Despite all this variability, a listener usually does not have great difficulty recognizing words, whether produced by the same or different speakers.
1.2 Basic unit of speech perception Before auditory input can be segmented into words, listeners are faced with the problem of how to map the variable acoustic signal onto a lexical entry in the mental lexicon. Previous research clearly shows that listeners are sensitive to the variability of the speech input and are able to cope with the lack of invariance (e.g., Repp & Liberman, 1987; Remez, 1987). What remains challenging, however, is to understand the complex mechanism of mapping the acoustic signal to linguistic units such as phonemes or words. Speech perception research over the last fifty years has devoted much attention to this question, resulting in several different approaches (for reviews, see Diehl, Lotto, & Holt, 2004; Jusczyk & Luce, 2002; Miller & Eimas, 1995).
Chapter 1
15
One solution is provided by episodic theories (e.g., Goldinger, 1998; Klatt, 1979), which suggest that people rely on stored traces of every heard word, memorized in the mental lexicon. Word recognition is achieved directly by accessing the auditory representation from stored traces that cover the variability occurring in the signal. It is not clear, however, how such theories can explain recognition of speech produced, for instance, with an unfamiliar accent. In addition, several studies have shown that listeners can quickly accommodate unusual pronunciations of a specific phoneme and, importantly, that this knowledge generalizes to the perception of new, previously unheard words (e.g., Norris, McQueen, & Cutler, 2003; McQueen, Cutler, & Norris, 2006). In contrast to direct acoustic-lexical mapping theories, other proposals assume that a transformation of the speech signal into some abstract representation is first accomplished. Such prelexical representations are then used in lexical access. The advantage of such an abstract phonological representation is that the invariance problem could be solved at the prelexical level. An abstract unit could serve as a mediating representation for the mapping process and would allow, for example, matching of the same phoneme uttered in different contexts (e.g., the phoneme /s/ in one spade, once paid, etc.). Most models of speech recognition assume some kind of a prelexical level but differ in the fundamental unit of perception. The following suggestions for mental representations have been made: acoustic-phonetic features (e.g., Eimas & Corbit, 1973; Stevens, 2002, 2005), articulatory gestures (e.g., Liberman et al., 1967; Liberman & Mattingley, 1985), phonemes (e.g., Foss & Gernsbacher, 1983; Marslen-Wilson & Welsh, 1978; Nearey, 1997; Norris, 1994; Pisoni & Luce, 1987), syllables (Massaro, 1974; Mehler, 1981; Mehler, Dommergues, Frauenfelder, & Seguí, 1981; Savin & Bever, 1970; Seguí, 1984), stressed units (e.g., Grosjean & Gee, 1987), demisyllables (e.g., Fujimura, 1976; Fujimura & Lovins, 1978; Samuel, 1989), or features and phonemes (e.g., McClelland & Elman, 1986). Yet other researchers propose more abstract underspecified representations and claim that listeners process only certain information relevant for the phonetic details that are stored with a specific word (e.g., Lahiri & Marslen-Wilson, 1991). The question about the exact nature of the fundamental unit of perception will not be answered in this dissertation.1 However, as most recent approaches and models of spoken-word recognition assume a prelexical level as an intermediate stage between auditory and lexical processing (e.g., McClelland & Elman, 1986; Norris, 1994; Norris, McQueen, & Cutler, 2000), this thesis will make the same assumption without specifying what the unit of perception is. Note, however, that in this thesis I will distinguish between prelexical units of perception (such as phonemes or syllables, as described above) and the segmentation of speech. The outcome of speech processing at the prelexical level is an abstract copy of the acoustic signal in the form of, for example, a phoneme or syllable. Lexical segmentation, which is the focus of this dissertation, operates on this out-
A consensus about whether a mediating perceptual unit exists and if so, which form it has, has not yet been reached. Norris and Cutler (1985) argued that the question whether the syllable or the phoneme plays a role in speech perception and word recognition might not be the right one to ask. It could well be that both are important and neither is primary. They further point out a necessary distinction between the classification and the segmentation o f the input into such units. This will be discussed further in Chapter 4.
Introduction
16
come and includes processes assisting lexical access such as those which locate word boundaries in the signal. However, a clear separation is not always possible. The segmentation process might also include prelexical units such as syllables as possible segmentation points at which lexical search can be initiated. Chapter 4 will address the possible role of syllables in speech segmentation.
1.3 Explicit speech segmentation After the decoding of the acoustic signal, the listener still has to segment the stream of sounds and hence recognize words that have been produced. But how do listeners know when in the signal one word ends and the next begins? A look at a waveform and a spectrogram of a phrase such as she sells sea shells (Figure lb, spoken by Peter Ladefoged) reveals that there are no invariant markers for the discrete linguistic units that the listener hears (but there are some possible cues; e.g., Nakatani & Dukes, 1977). As mentioned earlier, the acoustic signal is continuous and words are coarticulated with neighboring words without insertion of pauses. Nonetheless, several attempts have been made to provide an account of the segmentation problem at the sublexical level (by explicit cues) and at the lexical level (by word recognition), even though these accounts are not necessarily mutually exclusive.
0.7224
-0.6308—I 5000
0
J
ε
I s
0
T i m e (s)
0.914776
0
T i m e (s)
0.914776
Figure lb. Waveform (upper panel) and spectrogram (lower panel) of the English phrase she sells sea shells. Dashed lines show approximate word boundaries.
Chapter 1
17
The main idea of an explicit segmentation approach is that the word recognition process would work more efficiently if likely word boundaries could be identified prelexically, hence prior to lexical access. Analyses of production data have shown that the speech signal contains several physical properties which can correlate with word boundaries (Lehiste, 1960; Nakatani & Dukes, 1977), as for example, the aspiration of word-initial plosives (e.g., the phoneme /k/ is aspirated in the English word onset of this card but not in the word medial position in discard) or durational differences (as in the earlier example, /s/ in one spade is longer in its phonetic realization than in once paid; e.g., Shatzman & McQueen, 2006). Listeners could thus exploit such distributional regularities to determine word boundaries, at which the lexical access could be initiated. Indeed, several studies have shown that listeners can use various cues to guide segmentation. For example, listeners rely on cues such as stress, phonotactics, and allophonic details (see McQueen, 2007, for an overview), they exploit information about the duration of segments or syllables (e.g., Beckman & Edwards, 1990; Gow & Gordon, 1995; Klatt, 1974, 1975; Oiler, 1973; Quené, 1992), vowel harmony (e.g., Suomi, McQueen, & Cutler, 1997; Vroomen, Tuomainen, & de Gelder, 1998), and they also use silence as a cue (e.g., Norris et al., 1997). These cues are however not always present and can be exploited only probabilistically. In addition, as languages differ in their phonological structures, all these sublexical cues contribute to segmentation in a language-specific manner. For example, the knowledge about phonotactics, the combinations and restrictions of phoneme sequences within words or syllables, are exploited prelexically in a language-specific manner. Every language has a number of possible phonemes and combines them together to form possible words. Which phonemes can be combined together depends on the language in question, its phonological rules and constraints. For instance, in German, syllable onsets or codas cannot consist of the consonant cluster /km/. Thus, if German listeners hear a combination of the sounds /km/, they can reliably locate a word or syllable boundary between these two sounds. Segmentation strategies based on rhythm also differ across languages and are subject to the language-specific prosodie structure of a given language. English, just as Dutch or German, is often classified as a stress-timed language.2 In an analysis of an English spoken corpus, Cutler and Carter (1987) found that over 90% of English content words begin with a strong syllable (i.e., a syllable with a full vowel). For English listeners, as proposed by the Metrical Segmentation Strategy (MSS, Cutler & Norris, 1988), strong syllables are convenient segmentation points at which lexical initiation would be successful (Cutler & Butterfield, 1992). A similar strategy appears to apply in Dutch (Vroomen & de Gelder, 1995, but see also Zwitserlood, Schriefers, Lahiri, & Van Donselaar, 1993, on the role of syllables; and Quené & Koster, 1998, showing no clear evidence for the MSS). In languages with a syllable-timed rhythm such as French, Catalan, and Spanish,
2
This classification of languages according to their rhythm into syllable-timed and stress-timed goes back at least to Pike (1945). Since then, this classification has been partly refined (e.g., Abercrombie, 1967; Ladefoged, 1975), partly criticized (e.g., Dasher & Bolinger, 1982; Dauer, 1983), and new proposals have been developed for rhythmic structure typology (e.g., Ramus, Nespor, & Mehler, 1999). For reviews, see also Auer (1993) and Pompino-Marschall (1990).
Introduction
18
speech segmentation seems to be based on syllables (e.g., Bradley, Sánchez-Casas, & García-Albea, 1993; Kolinsky, 1992; Mehler et al., 1981; Sebastián-Gallés, Dupoux, Seguí, & Mehler, 1992). In yet other languages with differing metrical properties such as Japanese, listeners have been shown to segment speech at mora boundaries (Otake, Hatano, Cutler, & Mehler, 1993; Otake, Hatano, & Yoneyama, 1996) because the rhythmic unit in Japanese is the mora (i.e., a subsyllabic unit; see Vance, 1987). While there is ample evidence for how different stress patterns can contribute to speech segmentation in several languages (see Cutler, 2005, for a review), less is known about how stress can be exploited in languages with word-initial fixed stress such as Slovak, Czech, Hungarian, or Finnish. Such languages have a potential advantage over free-stress languages such as German and English because segmentation at stressed syllables would provide reliable markers for word boundaries. 3 The few studies on languages with fixed stress such as French with word-final stress and Finnish with wordinitial stress do not offer a unitary view on this issue (Dahan, 1996; Suomi et al., 1997; Vroomen et al., 1998). One of the questions raised in Chapter 2 of this dissertation is therefore whether Slovak listeners use the location of fixed stress in Slovak speech segmentation. Related to this issue is a second question, namely, what role does the (stressed) syllable play in the segmentation process? Chapter 4 therefore investigates syllabification in Slovak. A comparison with a language with a different prosodie structure is achieved by testing German listeners on similar tasks and with comparable materials in Chapter 3 and Chapter 4.
1.4 Speech segmentation through word recognition Besides explicit segmentation strategies, another widely accepted approach suggests that lexical segmentation of the speech input results from word recognition itself (see Cutler & Clifton, 1999, for an overview). All the words that a person knows are stored in memory, or are created during comprehension using morphological knowledge. This storage system is called the mental lexicon, and it consists of lexical representations that include information about the meaning and the form of a given unit. All models of word recognition assume that such a storage system exists. When a person listens to speech, various words in the mental lexicon that match the input are activated simultaneously. Besides the words that the speaker uttered (e.g., the phrase ship inquiry), other similar or embedded words (e.g., shipping, choir, pink, etc.) can be activated as well (e.g., Zwitserlood, 1989). All these activated words enter into a competition process for recognition (e.g., shipping would compete, among others, with ship but also with both inquiry and choir, see Figure 2). Only those words will win the competition process that can account for the entire input. For instance, the word candidate choir will be inhibited by the word inquiry, because only the latter can account for the final syllable /ri/ and hence receives more support. The candidate shipping might also win the competiStress placement in free-stressed languages is not completely arbitrary, however. It depends on syllable weight or morphological factors, but its location within a word varies (see, Van der Hulst, 1999, for a detailed description of word prosodie systems of European languages).
Chapter 1
19
tion over ship, but once inquiry is recognized, the candidate shipping will lose the competition due to the overlap of the syllable /in/. In the end, preferably those words are recognized that have also been intended by the speaker. Once a word is recognized, the speech input is segmented. Abundant work from different paradigms has shown that the activation and competition of word candidates are central processes of spoken-word recognition (see McQueen, 2005, for a review), a proposal on which most researchers in the field agree. In addition, segmentation of the speech input achieved by word recognition is presumably a universal process because all languages have word-like units. However, the phonological form of words differs across languages, as shall be discussed further below.
Figure 2. Pattern of inhibitory connections between (a subset of) candidate words that match the input, activated by the presentation of the speech string [fipiqkwaiari]. The competition process described above can be time-consuming under specific situations, especially when there is a higher number of activated words that can make the recognition of the intended word harder (McQueen et al., 1994; Norris et al., 1995). Therefore, to achieve the highest possible efficiency in the word recognition process, some recent models combine the use of explicit segmentation strategies with the segmentation-by-word-recognition approach. Sublexical cues such as stress and phonotactics can constrain and modulate the activation and the competition of word candidates (for an overview, see McQueen, 2005). The importance of such models shall be discussed in the following section.
1.5 Models of spoken-word recognition Models of spoken-word recognition try to explain how speech comprehension works by modeling its likely architecture. Much of the empirical work across the field has been driven by these models. The experiments in this thesis also result from theoretical frameworks provided by such models. Therefore, it is important to understand how these
20
Introduction
theories inspired past and present research, and how the questions of this thesis are embedded in them. Many models of spoken-word recognition have their roots in visual-word recognition and were therefore initially based on reading (e.g., Forster, 1976; Morton, 1969). Because reading has spatial properties, it differs tremendously from listening. Speech is distributed in time; and this temporal aspect of spoken language makes it so different from written language, as described earlier. A number of models of spoken-word recognition have been developed since the 70s. All these models assume two basic processes: the activation and competition of several word candidates (e.g., Gaskell & Marslen-Wilson, 1997; Marslen-Wilson & Welsh, 1978; McClelland & Elman, 1986; Norris, 1994). The degree to which a lexical candidate is activated depends on the degree of match between the speech signal and the lexical representation. These models differ in their architecture: 1) with respect to the perceptual unit used in the mapping process, and 2) in the flow of information they allow (e.g., top-down and bottom-up). For example, processing is data-driven and bottom-up when it proceeds from the sensory input to the mental lexicon, hence from lower to higher levels. Top-down or knowledge-driven processes are those where the higher levels, say those representing lexical, syntactic or conceptual knowledge, influence word recognition. Autonomous models refer to bottom-up flow of information only, and interactive models allow both bottom-up and top-down processes. Certain models are autonomous but allow some kind of interaction on later stages. Either assumption can lead to different predictions as to what candidates compete with each other (for recent reviews of word-recognition models, see Dahan & Magnuson, 2006; Gaskell, 2007; Luce & McLennan, 2006; McQueen, 2005). The three most influential models of spoken-word recognition are Cohort, TRACE, and Shortlist. All these models include the central feature of spoken-word recognition, that is, the competition of simultaneously activated words. However, they also illustrate different types of approaches.
1.5.1 Cohort model The first model to suggest the concurrent activation of lexical candidates was the Cohort model (Marslen-Wilson & Welsh, 1978; and further updates, Marslen-Wilson, 1987, 1989, 1990; Marslen-Wilson & Tyler, 1980; Marslen-Wilson & Warren, 1994). In this model, the processing is divided into three phases. The first phase is the activation phase when the initial cohort (a set of candidate words) is set up based on the unfolding acoustic-phonetic representations. The second phase is word recognition, which occurs as soon as only one word is left over in the cohort. In the third phase, this word is integrated into the context. For example, if a sentence or a word starts with the phoneme /t/, an initial cohort is set up that contains all possible words beginning with this phoneme (e.g., tent, tub, tennis, etc.). As new acoustic information comes in (e.g., the phoneme /ε/), the number of candidate words in the cohort is reduced to candidates that are consistent with the input (to tent, tennis, and other words starting /te/), mismatching words are excluded from the cohort (e.g., tub). This continues until there is only one word left. The Cohort model assumes that many words can be recognized before their offset due to an early
Chapter 1
21
uniqueness point, i.e., a point at which there are no other competing words. For example, there are no other German words that start with Huj7, and hence the cohort would be reduced to the word Huflattich 'coltsfoot' after only four speech sounds. This procedure enables listeners to determine the end of a word, and more importantly, the onset of the following word. The activation of a cohort of word candidates happens serially in a strict left to right processing manner. Moreover, the membership operates with an all-or-none rule, i.e., once a word has been excluded from the cohort, it is not possible to re-enter. Despite ample supporting evidence (e.g., Cole & Jakimik, 1980; Grosjean, 1980; Tyler, 1984; Tyler & Wessels, 1983; Zwitserlood, 1989), researchers have pointed out several weak points of the model (e.g., Grosjean, 1985; Norris, 1982; Salasso & Pisoni, 1985). The model relies on the listeners' knowledge about where words start, but does not propose any explicit mechanism for how word beginnings are found. Moreover, there is no implementation of an explicit procedure that would allow lexical candidates to reenter the cohort after they have been discarded. The process of recovery from missed or misheard word onsets is also not further specified. Furthermore, a large number of words do not possess uniqueness points and are not recognized only until after their offset (e.g., short words that may or may not be parts of longer words, such as the word ship in the earlier example ship inquiry; see Luce, 1986). Therefore, the beginning of the next word cannot be predicted in such a simple way. In a modified version of the Cohort model (Marslen-Wilson, 1987, 1989), the all-ornone exclusion of a word candidate from the cohort was changed in such a way that a word was not completely deactivated and could still be recognized or reactivated when, for example, a pronunciation mistake occurred. This revised model also allows features rather than only phonemes to play a role in building up an initial cohort. Even though the Cohort model has been very influential and many following models implemented some of its characteristics, it cannot fully explain how a word is recognized if the initial cohort was built up on the wrong onset.
1.5.2 Trace Some of the weaknesses of the Cohort model have been addressed by the TRACE model (Elman & McClelland, 1988; McClelland 1979; McClelland & Elman, 1986), the first connectionist model that made clear predictions about how the speech stream is segmented into words. TRACE is based on the interactive activation model of visual-word recognition (McClelland & Rumelhart, 1981; Rumelhart & McClelland, 1982). It allows bottom-up and top-down flow of information and is a computer-implemented model. In this account, competition of word candidates proceeds via lateral inhibition between lexical competitors. The input representations are more complex than in most other models. They are based on seven acoustic-phonetic features, and each feature has a nine-point scale. The distinctive features are abstracted from the sensory input, and then they activate phonemes. The activation of a phonemic unit takes place as a function of its fit with the activated, distinctive features, so that in a given input there can be several phonemes activated. Activated phonemes raise the activation level of words, which include these phonemes. Words that do not match in onset, but only with parts of the perceived input, can also become part of the set of candidates. TRACE allows bi-directional
22
Introduction
facilitory but no inhibitory connections between the units of adjacent levels. All connections between units within a level are inhibitory. Once a feature or a word has been activated, it inhibits other competitors within a level. The lexical context plays an important role in this model; it can provide direct support for the acoustic-phonetic processing. Moreover, higher-level information can influence lexical activity levels, although this has not been explicitly implemented in the model. Even though TRACE has been directly or indirectly supported by a high number of studies and has been certainly one of the most influential models (e.g., Foss & Blank, 1980; Ganong, 1980; Mehler & Segui, 1987; Morton & Long, 1976; Rubin, Turvey, & Van Gelder, 1976; Samuel, 1987, 1996; see also McClelland, Mirman, & Holt, 2006, for an overview), several features of the model have been criticized (e.g., Burton, Baum, & Blumstein, 1989; Cairns, Shillcock, Chater, & Levy, 1995; Frauenfelder, Seguí, & Dijkstra, 1990; Massaro, 1989; Massaro & Cohen, 1991; Marslen-Wilson, Moss, & Van Halen, 1996; Marslen-Wilson & Warren 1994; McQueen, 1991; McQueen, Norris, & Cutler 1999; Norris, 1993, 1994; Pitt & McQueen, 1998). The main weakness of the model appears to be the coding of the acoustic signal as an abstract phonetic feature or as a phoneme, which does not include variation according to speech rate, phonological context or differing speakers. TRACE assumes that each phoneme has the same duration, but this is not a feature of natural speech. Moreover, information units (such as phonemes or words) in TRACE become active as the speech signal unfolds over time, but the time dimension is implemented as a parallel duplication of information units at each time step. This results in the separate representation of the same linguistic unit occurring at different points in time. Such a computation appears rather inefficient and implausible in terms of real time processing (Norris, 1994). A further point that has been questioned concerns the assumption of top-down processing (see also Norris, McQueen, & Cutler, 2000).
1.5.3 Shortlist The Shortlist model (Norris, 1994; Norris, McQueen, Cutler, & Butterfield, 1997; with extensions, Norris, McQueen, & Cutler, 2000) is a connectionist and autonomous computer-implemented model. It is very similar to TRACE because it is also based on competition via lateral inhibition between competitors. However, unlike TRACE, Shortlist does not allow higher level influences on lower level processing. Activation spreads from phoneme level to word level, but not vice versa. Facilitation and inhibition are possible only within a level. The model consists of two stages. First, in a strict bottom-up manner, likely word candidates are activated at each phoneme in the speech input and become part of the shortlist. Only the shortlisted words enter a small interactive activation network and compete with each other. If there are too many word candidates in the shortlist, those with the least bottom-up support are discarded, so that candidates with more bottom-up support can be included. Only those (combinations of) words that can account for the entire input are potential winners in the competition (as in the example ship inquiry in Figure 2). Just like every model, Shortlist has also been criticized (e.g., Connine, Titone, Deelman, & Blasko, 1997; McClelland et al., 2006; Newman, Sawusch, & Luce, 1997). The
Chapter 1
23
debate has mainly focused on the question whether an autonomous approach is necessary, if most of the experimental data can also be explained by interactive models. However, Norris and colleagues have posed the question whether there is experimental evidence showing that top-down feedback is really required. Shortlist would be challenged if someone had such evidence (for an exhaustive discussion, see Norris et al., 2000, and the accompanying responses). The resolution of the debate whether the flow of information from higher processing levels back to lower level is necessary in speech perception (i.e., whether top-down flow of information effects phonemic decisions) might be provided by future research. In contrast to the previous models, however, Shortlist attempts to implement newer empirical findings into the model to achieve a better predictive power. Sublexical cues such as stress and phonotactics are already implemented in the Shortlist model and effectively modulate the word competition by reducing the number of possible candidates. In addition, Norris and colleagues implemented in the model a proposed universal principle of segmentation, which itself appears not to hinge on structural differences between languages. This was called the Possible-Word Constraint (PWC, Norris et al. 1997, Norris, McQueen, Cutler, Butterfield, & Kearns, 2001). It is based on the assumption that a minimal possible word or a syllable should contain a vowel. Following this proposal, listeners make use of the knowledge about possible words which can then help them dismiss spurious word candidates. The central issue of this dissertation is to examine this universal segmentation proposal.
1.6 A universal principle in speech segmentation A large number of empirical studies using a wide variety of experimental paradigms confirm that the segmentation problem can be solved by interword competition. Models of spoken-word recognition such as TRACE and Shortlist, reviewed in the previous sections, make clear predictions about the competition mechanism. Both models suggest that the candidates that best match the input reduce word activation of competing words. This is presumably a universal mechanism. The activation and competition can be further modulated by explicit language-specific segmentation cues. Shortlist simulates spokenword recognition by combining the competition process with explicit cues such as stress and phonotactics. While such explicit segmentation cues are language specific, the Shortlist model also implements the universal PWC (Norris et al., 1997) that can modulate the competition in a uniform way. For example, on hearing the German word kleiden 'to cloth', also leiden 'to suffer' (among others) is activated. This candidate however loses the competition because its recognition would result in an unaccounted stretch of speech IkJ, which is not an existing German word and, according to the PWC, not a viable lexical parse. The PWC modulates the competition process in such a way that lexical parses including impossible words (i.e., non-syllabic sequences) are disfavored in the competition process. The PWC suggests different processing constraints on consonants versus vowels during lexical segmentation. It treats stretches of speech according to whether they can
24
Introduction
be viable parts of a lexical parse. The universal assumption behind the PWC is that any stretch of speech that consists of at least one vowel is a viable parse. Single consonants, because they do not constitute syllables, are not viable, and words (e.g., leiden) in parses that contain vowelless residues (e.g., kleiden) are penalized. This applies in a universal manner, the language-specific constraints on well-formed words are not crucial. The PWC is used to predict where likely word boundaries will and will not occur; they are unlikely to occur at such points that would leave single consonants stranded. The evidence for this constraint comes from experiments using the word-spotting task (see section 1.8.1). For example, in Norris et al.1 s study (2001), listeners were required to spot English words embedded at the end of nonsense strings. A word such as perturb was presented in three contexts: in a tense vowel CV syllable (e.g., dahperturb), lax vowel syllable (e.g., dEperturb) and a single consonant (e.g., sperturb). None of these contexts are existing English words, but dah could be a word. The PWC predicts that English listeners would treat both syllables alike and spot words in syllabic contexts faster than in single-consonant contexts. This is what Norris and colleagues found: listeners' detection of the target word was significantly faster and more accurate in both syllable contexts as compared to the consonant context. The crucial result was that there was no difference between the two syllables dE and dah, although only the latter is a well-formed English syllable. Moreover, there was a significant difference between dE and s, both of which are impossible English words. To assess the universality of the PWC, cross-linguistic investigations are necessary. Languages differ in what constitutes a possible word. Whereas some Bantu languages such as Sesotho (Doke & Mofokeng, 1957) prohibit one-syllable stand-alone words, other languages such as the Salish language Nuxálk (also known as Bella Coola; Bagemihl, 1991; Nater, 1984), or Berber (Dell & El Medlaoui, 1985; Ridouane, 2002) allow vowelless sequences to be words. Yet other languages allow words that contain only one consonant, as for example Slavic languages such as Slovak, Russian, and Czech. The basic principle of the PWC has been therefore tested in several languages that exhibit different structural constraints on possible words and syllables such as Sesotho (Cutler, Demuth, & McQueen, 2002), Japanese (McQueen, Otake, & Cutler, 2001), Dutch (McQueen & Cutler, 1998), Cantonese (Yip, 2004), and British Sign Language (Orfanidou, Adam, McQueen, & Morgan, 2007). Even in a language such as Sesotho, in which possible stand-alone words must consist of minimally two syllables, the effect of the PWC has been replicated. If the PWC was sensitive to this language-specific constraint on word structure, a monosyllabic sequence should be penalized in Sesotho speech segmentation. However, Cutler et al. (2002) showed that the recognition of a word such as alafa 'to prescribe' was as fast in a monosyllabic context (roalafa) as in a bisyllabic context (pafoalafa), but slower in a single-consonant context (halafa), in line with the universal PWC. Japanese, a language whose phonological structure is defined in terms of a subsyllabic unit called the mora, provided another interesting testing case for the generality of the PWC (McQueen et al., 2001). Previous studies (Cutler & Otake, 1994; Otake et al., 1993) showed that Japanese listeners are sensitive to moraic structure and that the mora is the basic rhythmic unit in the segmentation of Japanese speech. A mora can consist,
Chapter 1
25
for instance, of a vowel (V), one or two consonants plus a vowel (CV or CCV), or of one single nasal coda consonant (C) (see also Otake et al., 1993; Vance, 1987). The crucial question was whether Japanese listeners treat moraic nasals (which are not possible words, but well formed subsyllabic units) as possible or impossible residues in the segmentation of continuous speech. Japanese listeners spotted a word such as saru 'monkey' as fast in vowel contexts (sarua) as in moraic nasal contexts (saruN), and they spotted words such as agura ('to sit cross-legged') in vowel contexts (oagura) faster than in consonant non-moraic contexts (tagura). Despite the fact that single consonants are impossible words in Japanese, moraic nasals were viable segmentation contexts because they signal likely word (or mora) boundaries. Taken together, all these studies have one point in common: they show that any syllable (or subsyllable), but not consonant, is considered as an appropriate parsing unit. What counts as a possible word in terms of the PWC, as summarized by McQueen and colleagues (2001: 127), ...is not whether a sequence of sounds constitutes a phonologically acceptable word in the native language of any one listener, but whether that sound sequence, irrespective of the listener's native language, consists of only consonantal material. The PWC therefore operates on the language-universal constraint that no chunks in the lexical parse can contain only consonants. None of the languages on which the PWC was tested so far allows single consonants in its lexical inventory. It is therefore not clear how the PWC applies in languages such as Slovak. Are parses with single consonants disfavored or not in Slovak speech segmentation? This dissertation hence investigates how segmentation proceeds in Slovak, a language that allows single-consonant words. A comparison with German, a language that does not allow such consonantal minimal words, is carried out. In the context of this work, single-consonant words do not necessarily refer to the variable phonetic realization of words influenced by phonological processes such as reduction or deletion (which are possible in German and in many other languages), but rather to words that have an underlying form of a single consonant as in Slovak. One of the main questions, namely whether the PWC really is a universal segmentation principle, shall be addressed in Chapter 2 with native listeners of Slovak. In Chapter 3, the PWC is investigated with native listeners of German as well as with Slovak learners of German.
1.7 Slovak and German There are several crucial differences between the Slovak language and the German language. Slovak belongs to the West Slavic language group and is most closely related to Czech and Polish. German, on the other hand, is a West Germanic language, closely related to English and Dutch. Unlike German, Slovak allows words consisting solely of one consonant (k 'to') or a group of consonants (prst 'finger', where M forms the syllabic nucleus). Slovak has four prepositions consisting of single consonants k 'to', ζ
Introduction
26
' f r o m ' , 5 'with', and ν 'in' (see Nilsson, 1998, for a short description). Prepositions are proclitic, thus combining phonologically with the following word (e.g., ν rane [vrqpe] 'in the wound'). 4 However, orthographically they are always separated by a blank from the following noun or adjective (or other word classes) to avoid ambiguity in a phonetically otherwise ambiguous sequence such as [vrqpe], representing the prepositional phrase ν rane ('in the wound') or the noun vrane ('crow' + Singular Dative inflection). Table 1. Slovak and German minimal words and possible Adaptedfrom Hanulíková & Dietrich (2008). Example k a ma: pre skb ak mak vlak vstatJ pstrux u:st jest j pri:stJ striastJ tekst
Slovak C V C V CC V V V V V V V V V V V V
ccc
c cc ccc cccc c cc ccc c
syllables
German
in
comparison.
Example
-
c c c c c cc cc cc cc ccc
V = Vowel, diphthong, or syllabic consonant C = Consonant (C) = Glottal stop
(C) V V V V (C) V V V V
c cc ccc c cc ccc
c c c c
e: du: fio: Jtro: in bal pral Jtram
-
(C) V V V V (C) V V V V (C) V V V V (C) V V V
c cc ccc c cc ccc c cc ccc c cc
cc ast cc lust cc frust cc JpriÇt ccc o:pst ccc markt ccc krampf ccc jtrumpf cccc eenst cccc heepst cccc jvaqkst cccc pflantst ccccc impfst ccccc kempfst ccccc krampfst
4
Examples throughout the dissertation are provided in citation form (in italics) or in broad phonemic transcriptions (between slashes); narrow phonetic transcriptions are given only where appropriate (between square brackets). Note that all vowels in Slovak are lax. Orthographically, vowel quantity is spelled with an acute (e.g., a), and most of the palatal consonants in Slovak are marked by a hácek (e.g., cf, c). However, if the consonants d, t, n, I are followed by the vowels e or /', they are mostly also palatal but not marked by a hácek (e.g., deti 'children').
Chapter 1
27
Both languages have relatively complex syllabic structures (see Table 1). While Slovak phonology allows the occurrence of consonant clusters of up to four consonants in onset position (e.g., pstruh 'trout'), German syllables mainly exhibit complex codas with up to five consonants (e.g., du kämpfst 'you fight'). More than two consonants in the Slovak syllable coda are rare and occur only in loan words (e.g., [tekst] 'text'; see Rubach, 1993). Furthermore, the Slovak coda exhibits fewer possible combinations than the German coda (see Hanulíková, 2003; Hanulíková & Dietrich, 2008, for more details). The German vowel system can be characterized by vowel quantity and vowel quality (there is no consensus about which is the primary phonological feature). In contrast to German, Slovak exhibits phonemic quantity but not quality distinctions. One language-specific characteristic of syllabic structure in German is that German syllables should not end with a short lax vowel. In Slovak, no such restriction holds. Slovak is a fixed stress language (stress falls on word-initial position and is not contrastive on the word level), while German has lexical stress (i.e., it is less predictable with regard to position, and it is contrastive, e.g., iiberSEtzen 'to translate' versus Übersetzen 'to ferry across the river', capitalized letters indicate main stress). In Slovak, stress has a demarcative function, because its position correlates with word onsets, similar to Czech, Hungarian, Finnish and Icelandic (e.g., Sabol & Zimmermann, 1994; Zimmermann, 1990). In German, stress is culminative, each word or phrase has a single strongest syllable bearing the main stress (Hayes, 1995; see also Cutler, 2005). Further discussion of stress will be provided in Chapter 2 and Chapter 3. The differing metrical structures as well as the constraints on possible words and syllables will be crucial in the experiments conducted in this thesis.
1.8 Methodologies When a scientist wants to find an answer to the question how speech processing works, several methods can help her on the way. A scientist from Utrecht - F.C. Donders (1869/1969) - was among the first who suggested that mental processes could be measured with time, and that psychological processes require time. These so-called reactiontime (RT) measurements have since then been widely used in various areas of psychology and linguistics in the hope of gaining more insight into online mental processes. Speech processing research is largely based on such empirical evidence. This means that it is often based on some kind of measurements, for example in terms of precise time such as milliseconds or in terms of proportions of specific verbal responses. Such measurements can be achieved by using various experimental tools and experimental designs. In some cases, a relative difference of only a few milliseconds between some entities under investigation might already reveal different ongoing processes. Most of the work that has been reviewed throughout this introductory chapter was based on empirical methods with response-time measurements. In the present thesis, several experiments with such measurements are reported. Two different RT tasks were used: The wordspotting task and the auditory lexical-decision task. In the syllabification reversal task no
28
Introduction
response times but rather the quantity and the quality of specific verbal responses were measured.
1.8.1 Word-spotting The word-spotting task was developed by Cutler and Norris (1988) to study the role of segmentation cues in spoken-word recognition. Since then, the task has proven useful in testing the role of cues such as stress (e.g., Banel & Bacri, 1994; Cutler & Norris, 1988; Suomi et al., 1997; Vroomen, Van Zon, & de Gelder, 1996; Vroomen et al., 1998), phonotactics (McQueen, 1998; Van der Lugt, 2001; Weber & Cutler, 2006), allophonic variation (Dumay, Content, & Frauenfelder, 1999; Kirk, 2000; Yerkey & Sawusch, 1993), and lexical competition (e.g., McQueen et al., 1994; Norris et al., 1995). In this task, listeners hear nonsense words or phrases over headphones and are required to detect any real word that might be embedded in these nonsense words. The relative difficulty of spotting a word in a given context as compared to another should be reflected in differences in reaction times and error rates. It is assumed that there is a link between the activation of a word and the response: If the activation level is low, the RT is longer and the error rate is higher. The dependent variables are hence the detection latencies and the error rates (see also McQueen, 1996, for further details of the task).
1.8.2 Auditory lexical decision The auditory lexical decision is an established method used to measure the time needed to decide whether a string is or is not a real word in a given language. Initially applied mainly in reading research (i.e., visual lexical decision), it has been also widely used in the audition and speech comprehension to study basic processes in word recognition (McCusker, Hillinger, & Bias, 1981; and see Goldinger, 1996, for an overview). It has been also often used as a control task in combination with word-spotting experiments. The speed with which a word can be recognized varies as a function of various variables such as the task properties (auditory or visual presentation), the frequency and length of words, or interpersonal differences.
1.8.3 Syllabification reversal task In the syllabification reversal task, listeners reverse the syllables of a word that they have heard (e.g., monkey), and produce the resulting string (e.g. [ki.mArj]). This task was developed by Treiman and Danis (1988) to investigate the affiliation of intervocalic consonants (e.g., to which syllable belongs the consonant /t/ in letter). Since then it has been used to investigate people's internal representation of words and to address syllabification as well as speech segmentation issues in various languages (e.g., Berg & Niemi, 2000; Content, Kearns, & Frauenfelder, 2001; Schiller, Meyer, & Levelt, 1997). Participants are usually required to respond quickly in order to minimize reasoning about syllable boundaries. In such a metalinguistic task, listeners are forced to make explicit choices about syllable onsets.
Chapter 1
29
1.9 Structure of the thesis The thesis is structured as follows. In Chapter 2, three word-spotting experiments are presented that investigate how segmentation proceeds in Slovak, a language with singleconsonant words and fixed stress. The universality of the PWC is assessed by measuring Slovak native listeners' recognition of words preceded either by syllables or by single consonants. Furthermore, the role of fixed stress in speech segmentation of Slovak is examined by comparing recognition of words with or without main stress on their initial syllables. Chapter 3 reports three experiments that employed the same design as the studies in Chapter 2. These studies address the segmentation of native German listeners and native Slovak learners of German with respect to the PWC and stress. Moreover, the aim of this chapter is to examine how native knowledge of Slovak influences the segmentation of non-native speech. Chapter 4 describes two syllabification experiments with German and Slovak speakers. These studies were conducted to investigate the role of syllabification in speech segmentation. In the final Chapter 5, a summary of all experiments is provided and future directions are discussed.
2 Segmentation of Slovak speech
The Possible-Word Constraint (PWC; Norris, McQueen, Cutler, & Butterfield, 1997) is a universal segmentation principle: lexical candidates are disfavored if the resulting segmentation of continuous speech leads to vowelless residues in the input, for example, single consonants. Three word-spotting experiments investigate how segmentation proceeds in Slovak, a language with single-consonant words and fixed stress.
2.1 Introduction The English phrase play tennis, when spoken, also contains the unintended words lay, late and eight among others. Similarly, the Slovak word kvety 'flowers' contains for example the embedded words k 'to', ν 'in', vety 'sentences' and ty 'you'. Users of any language, when listening to running speech, are faced with the same task: They have to extract the right words from the speech signal, and not these unintended words, in order to understand the message. Given the continuous nature of the speech signal, how do listeners across languages with differing phonological constraints know where one word ends and another begins? One proposal that has been made is that listeners have at their disposal a segmentation procedure in which vowelless sequences are disfavored as parsing units (Norris et al., 1997). The goal of this study was to investigate whether this segmentation mechanism is language-universal. Specifically, I tested whether the segmentation of continuous speech in Slovak, a language with single-consonant words (e.g., k 'to' or ν 'in'), follows the same principles as other languages, most of which do not allow single consonants to be words. Previous research suggested several solutions to how the segmentation problem can be solved across languages (for reviews, see Dahan & Magnuson, 2006; Mattys, 1997; McQueen, 2007). The present work investigated a supposedly universal principle by which the continuous speech signal is segmented into words in a similar way across languages, known as the Possible-Word Constraint (PWC; Norris et al. 1997; Norris, McQueen, Cutler, Butterfield, & Kearns, 2001). The PWC is based on a central mechanism of word recognition, according to which several words that match the acoustic input are activated and compete with each other (e.g., Gaskell & Marslen-Wilson, 1997; McClelland & Elman, 1986; Norris, 1994). The PWC proposes that the activation of a word will be reduced if between this word and any likely word boundary, vowelless
Chapter 2
31
sequences such as single consonants are left over. A likely word boundary can be signaled by a pause or by language-specific cues such as stress, allophonic detail or phonotactic probabilities. Hence, the recognition of a word such as lay in play or vety in kvety will be unlikely because this would leave a single consonant (p or k) as a residue. In this way the word is misaligned with the likely word boundary before the first consonant (cued, e.g., by silence if the input word is utterance initial). The PWC account thus suggests that viable residues can be vowels but not single consonants, and that this is true for all languages. The PWC unifies two popular approaches. The first is that segmentation is modulated by language-specific, signal-driven cues. The second is that segmentation is a by-product of lexical competition. Note that these two approaches are not necessarily mutually exclusive. The PWC in fact attempts to explain how signal-based and lexical information can be integrated in the segmentation process. In support of the first approach, previous research has shown that the speech signal contains several regularly occurring physical properties such as the aspiration of wordinitial plosives or durational cues which correlate with word boundaries (Lehiste, 1960; Nakatani & Dukes, 1977), and are exploited by listeners (e.g., Dumay, Content, & Frauenfelder, 1999; Gow & Gordon, 1995; Mattys & Melhorn, 2007; Quené, 1987, 1992; Shatzman & McQueen, 2006; Spinelli, McQueen, & Cutler, 2003). However, languages differ in how word boundaries are marked by physical cues and hence the exact nature of these cues depends on language-specific phonology (Lehiste, 1964). For example, all languages have specific restrictions on sequential probabilities and listeners can use these cues for word boundary location (e.g., Church, 1987; McQueen 1998; Vitevitch & Luce, 1999; Van der Lugt, 2001). Listeners across languages also rely on information provided by the rhythmic structure of a given language. The use of metrical structure was demonstrated in a stress-timed language such as English and is known as the Metrical Segmentation Strategy (Cutler & Norris, 1988). Cutler and Norris showed that English listeners segment speech more easily at strong syllables (containing a full vowel) than at weak syllables (containing a reduced vowel). The listeners hence rely on distributional probabilities, as most lexical words in English indeed start with a strong syllable (Cutler & Carter, 1987). In a syllable-timed language such as French, syllables seem to be used for segmentation (e.g., Cutler, Mehler, Norris, & Segui, 1986; Dumay, Frauenfelder, & Content, 2002; Mehler, Dommergues, Frauenfelder, Seguí, 1981), while in Japanese, segmentation is based on the mora (Cutler & Otake, 1994; Otake, Hatano, Cutler, & Mehler, 1993). The segmentation procedures based on stress information therefore vary depending on the metrical structure of a specific language, as has been demonstrated in studies in Spanish, Catalan, French, and Dutch (Pallier, Sebastián-Gallés, Felguera, Christophe, & Mehler, 1993; Sebastián-Gallés, Dupoux, Seguí, & Mehler, 1992; Vroomen & de Gelder, 1995; see also Cutler, 2005, for a review). Not enough is known about how stress information is used in segmentation in languages where stress regularly falls on the first position in a word and thus demarcates word boundaries. This applies for example in Slovak, where the primary stress always falls on the first syllable of a word (e.g., Král' & Sabol, 1989; Pauliny, 1979; Sabol, 1977; Sabol & Zimmermann, 1994). This demarcative property of fixed stress is poten-
32
Segmentation of Slovak speech
tially useful for the segmentation of continuous speech, as has been partly demonstrated in Finnish, a language that also exhibits word-initial fixed stress (Suomi, McQueen, & Cutler, 1997; Suomi, Toivanen, & Ylitalo, 2003; Suomi & Ylitalo, 2004; Vroomen, Tuomainen, & de Gelder, 1998). These signal-based cues, however, are probabilistic, and none of them by itself is sufficient to solve the segmentation problem entirely. The second approach to the segmentation problem based on lexical competition proposes a more general solution: Words are recognized through the competition of alternative word candidates, and segmentation is a by-product of word recognition. Through the competition process, the word recognition system can settle on an optimal parse of the speech input, even if signal-based cues are not present. The activation and competition of multiple lexical candidates are core mechanisms implemented by most models of spoken-word recognition (e.g., TRACE, McClelland & Elman, 1986; and Shortlist, Norris, 1994; see McQueen, 2005, for a review) and have received a great deal of empirical support (e.g., Allopenna, Magnuson, & Tanenhaus, 1998; Cluff & Luce, 1990; Connine, Blasko, & Titone, 1993; Connine, Blasko, & Wang, 1994; McQueen, Norris, & Cutler, 1994; Norris, McQueen, & Cutler, 1995; Tabossi, Burani, & Scott, 1995; Vitevitch & Luce, 1998, 1999; Vroomen & de Gelder, 1995; Zwitserlood & Schriefers, 1995). In the English phrase play tennis, the words play, lay, late and any, for example, would compete with each other for recognition. The correct segmentation falls out of the competition process as play and tennis win out over their competitors. Various sources of information can thus be exploited to solve the segmentation problem. Recent research however has emphasized the importance and interaction of these different sources of information (Mattys, 2004; Mattys, White, & Melhorn, 2005). Mattys et al. (2005) proposed a hierarchy of the relative importance of various speech segmentation cues depending on the listening conditions. In a series of English cross-modal fragment priming experiments, Mattys et al. showed that (with an optimal-quality input) listeners rely strongly on lexical knowledge as compared to sublexical cues such as segmental acoustics and word stress, the latter being at the end of a weighted hierarchy of different sources of information. A critical question is how these multiple information sources are integrated. The PWC offers an answer to this question. The PWC unifies the use of signal-based cues with the use of lexically-based cues, and adds a general viability constraint based on simple information about whether a vowel is present or absent in the speech input. The primary support for the PWC comes from studies using the word-spotting paradigm developed by Cutler & Norris (1988), in which listeners were asked to respond whenever they found an English target word embedded at the beginning or at the end of a nonsense sequence. For example, in Norris et al. (1997), the English word apple was embedded in a singleconsonant context fapple or in a CVC syllable context vuffapple. Neither of these contexts is an existing English word, but only vuff is a possible well-formed word in English. The results showed that if the target word apple was preceded by a single consonant, it was recognized more slowly and less accurately than when it was preceded by a nonsense syllable. This result was interpreted as evidence that a single consonant, which itself cannot be an English word, is not a viable residue of the input and thus will be disfavored as a parsing unit. According to the PWC, the activation of lexical candidates
Chapter 2
33
is reduced in this way. Norris et al. (1997) implemented this principle in the Shortlist model of spoken-word recognition (Norris, 1994) and specified the nature of a viable residue as one containing a vowel. To test whether the model was indeed universal, Norris et al. (2001) investigated how English listeners use their knowledge about the well-formedness of a syllable in their native language. The question was whether they would treat any syllable as a possible parsing unit (i.e., as a viable residue), independently of the phonological constraints in English. A word such as perturb was therefore presented in three contexts: a tense-vowel CV syllable (e.g., dahperturb), a lax-vowel CV syllable (e.g., dEperturb) and a single consonant (e.g., sperturb). None of these contexts are existing English words, but dah could be a word. While there are English words with tense vowels and an open coda (e.g., car [ka] in British English), there are none with a lax vowel without the coda being closed (e.g., deck [dsk], but not dE [de]). The PWC predicts that English listeners would treat both syllables alike and spot words in syllabic contexts faster than in singleconsonant contexts. Replicating the results from the previous experiment, listeners' detection of the target word was significantly faster and more accurate in both syllable contexts as compared to the consonant context. The crucial result was that there was no difference between the two syllables dE and dah, although only the latter is a wellformed English syllable. Moreover, there was a significant difference between dE and s, both of which are impossible English words. Given this evidence, Norris et al. (2001) specified the PWC more clearly: A possible residue does not have to be a phonologically acceptable word in a specific language; however, it cannot consist of only consonants. According to Norris and colleagues, the potential lexical status of words in a specific language is not relevant, and the PWC is a universal principle, a claim that has been supported by studies with further languages including Japanese (McQueen, Otake, & Cutler, 2001), Dutch (McQueen & Cutler, 1998), Sesotho (Cutler, Demuth, & McQueen, 2002), and Cantonese (Yip, 2004). These languages exhibit differing surface constraints on possible syllables, differing phonological processes as well as differing language-specific cues to word boundaries. For example, possible stand-alone words in Sesotho must consist of minimally two syllables. If the PWC were sensitive to this language-specific constraint on word structure, a monosyllabic sequence should be penalized in segmentation. However, Cutler et al. (2002) showed that the recognition of a word such as alafa 'to prescribe' was as fast in a monosyllabic context as in a bisyllabic context, but slower in a single-consonant context, in line with the universal PWC. The effect of the PWC was observed in all these languages, and thus the conclusion was drawn that segmenting a word from a vowelless sequence is always harder. This suggests, taking the earlier Slovak example, that the word vety in kvety is misaligned with a possible word boundary and therefore will be penalized - a single consonant is a non-viable string because it is not a syllable. The status of the Slovak single consonant k in kvety, however, differs from other languages. Whereas in many languages single consonants are not possible words, Slovak does allow single-consonant words such as the preposition k. Thus, one important issue is this: How are single consonants processed in languages in which they are meaningful grammatical elements and well-formed words? Will these be considered as non-viable residues, even though they constitute lexical units and hence are part of the candidate
34
Segmentation of Slovak speech
set? If so, the word vety in kvety would be penalized, making kvety easier to recognize. This benefit of the PWC would however come with a cost for other input: vete in k vete 'to the sentence' would be penalized; so that recognition of such sequences would then have to be based on grammatical knowledge. Alternatively - if the PWC is not a simple universal - prepositions could be treated as viable residues. In consequence, vety in kvety is not disfavored. The recognition of kvety would then not benefit from the PWC and hence might be slower because more candidates would be in competition (e.g., vety). But recognition of vete given k vete would not be made more difficult. In a language such as Slovak, therefore, it is not obvious whether prepositions should or should not count as viable residues in speech segmentation. Languages differ in what constitutes a possible word. Whereas some Bantu languages such as Sesotho (Doke & Mofokeng, 1957) prohibit one-syllable stand-alone words, other languages such as the Salish language Nuxálk (also known as Bella Coola; Nater, 1984; Bagemihl, 1991), or Berber (Dell & El Medlaoui, 1985; Ridouane, 2002) allow vowelless sequences to be words. Yet other languages allow words that contain only one consonant, as for example Slavic languages such as Slovak, Russian and Czech. None of the languages on which the PWC was tested so far allows single consonants in its lexical inventory. The present study was thus intended to fill this gap and investigates how segmentation proceeds in the Slovak language. Slovak provides a direct test of the universality of the PWC and of its claim that single consonants are universally non-viable residues. Slovak belongs to the West Slavic language group (together with Czech, Polish, Sorbían and Kashubian), and is most closely related to Czech. Slovak phonology allows the occurrence of clusters of up to four consonants in onset position (e.g., pstruh 'trout'; Rubach, 1993). It also allows words consisting solely of a group of consonants, which however always contain at least one sonorant as a syllabic nucleus (e.g., /r/ in the famous tongue twister: strc prst skrz krk 'stick the finger down the throat'). In addition, Slovak has four prepositions, consisting of single consonants k 'to', ζ 'from', s 'with', and ν 'in'. Each has a voiceless and a voiced positional allophone and each also has a vocalized form, e.g., /k/, /g/ and /ku/; /v/, /f/ and /vo/; etc. The vocalized form occurs when the following word has a similar place of articulation (e.g., zo zeme 'from the earth'; e.g., Nilsson, 1998). Out of one million word forms in the Slovak national corpus (Slovensky národny korpus, 2007), 3% are these single-consonant prepositions (vocalized units form an additional 0.4% and are thus considerably less frequent). Prepositions are proclitic, thus combining phonologically with the following word (e.g., ν rane [vrajie] 'in the wound'). However, orthographically they are always separated by a blank from the following noun or adjective (or other word classes) to avoid ambiguity in a phonetically otherwise ambiguous sequence such as [vrajie], representing ν rane ('in the wound') or vrane ('crow' + Singular Dative inflection). Three word-spotting experiments (Experiment 1A, IB and 2) were conducted to explore how single consonants are processed during segmentation of continuous speech in Slovak depending on whether they are words (e.g., k 'to') or not (e.g., t). The processing cost for target words preceded by single consonants was compared to words preceded by nonsense syllables (e.g., dug), which could be words in Slovak. If the PWC really is universal, it should apply to Slovak as it applies to other languages. Slovak listeners
Chapter 2
35
should treat single consonants as nonviable units because they don't constitute syllables. The PWC should hence apply irrespective of the lexical status of prepositions or of the potential lexical status of any single consonant in Slovak. Slovak listeners should be slower at spotting a word in a single-consonant context as compared to a syllable. However, if the PWC is sensitive to some language-specific characteristics of the viability of speech input, Slovak listeners should treat single-consonant prepositions as viable residues. Hence, spotting a word in a prepositional context (real Slovak word) should be as fast as in a syllable context (possible Slovak word). Spotting a word in a nonprepositional context should be slower because this kind of context is not a Slovak word and contains no vowel.
2.2 Experiment 1 I adopted a similar design to Norris et al. (1997), and used the word-spotting task, where listeners respond when they detect any real word embedded in nonsense strings (Experiments 1A and IB; 1A was a pilot study to test the material and procedure with a small number of participants). This task allows us to compare detection of the same word in different contexts. Target words were embedded in three preceding contexts: a singleprepositional consonant, non-prepositional consonant and a syllable. Given that naturally produced items were used, the same target could exhibit acoustic differences over conditions. To control for this variation, a very useful heuristic for word-spotting experiments is to use the auditory lexical-decision task, presenting targets which have been excised from their contexts. I therefore used this task and tested whether the words taken from each of the word-spotting contexts were equally recognizable (Experiment 1C).
2.2.1 Method Participants Sixty-three native speakers of Slovak, students at the Faculty of Mechatronics (mechanical engineering) and the Department of Political Science at the Alexander Dubcek's University in Trencin, volunteered or received monetary compensation for their participation. Thirty-six students participated in Experiment 1Β and 27 in Experiment 1C. They were recruited on the basis of written advertisements or from classes. An additional nine native Slovak speakers, students from various universities and employees of the Slovak cultural institute or the Slovak embassy in Berlin, took part in the pilot Experiment 1 A. None of them reported any hearing difficulties. 5
Participants in this study and all following experiments filled in a questionnaire (see Appendix B).
Segmentation of Slovak speech
36 Materials and Design
Seventy-two Slovak bisyllabic words (nouns and verbs) were selected as targets. All started with a consonant and had no other words embedded in them (with the exception of unavoidable single-vowel or single-consonant words such as a 'and' or ν 'in'; but this applied to all material and all three conditions equally). 6 Each word was embedded in three preceding contexts to yield three nonsense sequences per target. For example, the target word ruka 'hand' was embedded in a syllabic context (e.g., /dugruka/), in a singleconsonant context that also functions as a preposition in Slovak (e.g., /gruka/), or a single consonant that has no meaning (e.g., /truka/). The syllabic context (e.g., /dug/) is not an existing Slovak word, but it could be one, as the Slovak vocabulary contains monosyllabic words with short lax vowels (e.g., zub 'tooth'). For the prepositional context, two existing prepositions k 'to' and ν 'in' were chosen (for each only one allophone /g/ and /f/ was used). Verb targets were embedded only in /g/ contexts because Iii is also a verbal prefix (just as s and z) and hence real words would have emerged. It is important to note that all combinations of prepositional consonants and targets always resulted in nonsense sequences (e.g., /gruke/ 'to the hand' is a meaningful sequence in Slovak, but /gruka/ is n o t / Further consonants, /p, J", t/, which are not possible words in Slovak, were used in the non-prepositional context. The syllable context consisted of CVC syllables with short vowels [υ], [ε], [ο], [α], [ι] as nuclei. The final consonants of the CVC syllables were balanced so as to end equally often either with a prepositional or with a non-prepositional consonant. The consonant clusters that emerged through the combination of consonantal onsets of target words and the added preceding consonantal context (e.g., /tr/ in truka) were all phonotactically legal in Slovak. The material was controlled for frequency estimates of the lemma and onset consonant clusters, which were taken from the Slovak national corpus (Slovensky národny korpus, 2007) and logarithmically transformed. The average log lemma frequency per million for the target words was 1.9120. The mean log frequency of the onset consonant clusters over all items was 2.390 in the prepositional condition and 2.351 in the non-prepositional condition. All experimental items are listed in Appendix A.l. Further, 133 fillers were constructed so as to match the form of the target-bearing strings. None of the filler words contained existing Slovak content words (again, as previously mentioned, it was unavoidable that single segment words were embedded). In addition to the author, two Slovak native speakers checked the materials for possible embeddings. Three experimental lists were then created with all the fillers in each list. Each target appeared in all lists, but in only one preceding context in a given list. Type of context was counterbalanced over lists so that each list had 24 targets in each type of context. For the syllable condition, stimuli were chosen in such a way that half of the syllable con-
6
In previous studies employing the word-spotting task, short words such as determiners (e.g., a, an) were present but never spotted, presumably because of a preference for longer words. Therefore, it is unlikely that single-segment words will be spotted in the present study (including the prepositional context such as k). In addition, the instructions state to spot words at the end and not at the onset of the nonsense word, making the detection of the prepositions k or ν improbable.
Chapter 2
37
texts per list ended with a prepositional consonant and half with a non-prepositional consonant. The order of stimuli was randomized, but there was a restriction that at least one filler occurred between two target-bearing items. A set of four additional targetbearing items along with nine fillers were constructed and used for a practice session. The materials were read by a phonetically trained female native speaker of Slovak who was not aware of the aim of the study. She received instructions to read the material at a normal speech rate. The main stress was always on the first syllable of the whole string and intervocalic consonants were produced ambisyllabically. The speaker read the items one by one, separated by a pause, in a clear citation style three timesln a row. The list of items was randomized. The recordings were made in a sound-proof booth on a Digital Audio Tape (DAT) at 48 kHz sampling rate with 16-bit resolution. They were then re-digitized onto a computer and down-sampled to 22.05 kHz. The stimuli were measured, labeled and spliced into single speech files using the Praat speech editor (Boersma, 2001). None of the stimuli contained a schwa between the consonant clusters (e.g., /tr/ in truka or /gr/ in dugruka). All speech files were normalized so that their mean amplitude was comparable. For the lexical-decision task used in Experiment 1C, all target words were carefully excised from their preceding contexts using the Praat speech editor. For example, the target word ruka was removed from its preceding contexts /dug/, /t/, and /g/ respectively. Visual and auditory criteria were used to determine the onset of the first segment of the target, cutting at positive zero-crossings. The same procedure was applied to the fillers. The same three lists were used as in Experiments 1A and 1B, but without the preceding contexts. Procedure The participants were tested separately in a quiet room. For Experiments 1A-B, they received written instructions that they would hear nonsense strings over headphones. Their task was on each trial to press a button whenever they detected a real word embedded at the end of a nonsense string. For Experiment 1C, the written instructions stated that they would hear real Slovak words and nonsense words over headphones. They were asked to press a response button if they thought the presented item was a Slovak word. In both sub-experiments, participants were asked to respond both as fast and as accurately as possible and to say aloud the word they found. Both experiments started after a short practice session. Participants heard the stimuli one at a time over headphones at a comfortable listening level. Figure 3 shows the basic organization of the experimental lists and the presentation order of the stimuli in Experiments 1A-B; in Experiment 1C only the timing and the materials differed (this organization applies for the following chapters as well). Note that examples of stimuli in Figure 3 and in the following sections are not transcribed at every mention but are often provided in orthographic form for the sake of simplicity, such that the grapheme {g} refers to the voiced allophone of the preposition {k}. The presentation of the stimuli, timing, and the response time (RT) measurements were controlled by NESU (Nijmegen Experiment Set-Up), experimental software developed at the Max Planck Institute for Psycholinguistics in Nijmegen. Each trial started
38
Segmentation of Slovak speech
with a 500 ms silence, after which the stimulus was presented. The time interval between the onsets of two successive trials was 4000 ms in Experiments 1A-B and 3000 ms in Experiment 1C. The subject's spoken responses were recorded on tape as a control. All responses were monitored during the experiment and checked for correctness a second time using the recordings. Button-press responses accompanied by spoken responses that were not the intended target words were discarded and counted as errors. The RTs in these and the following experiments were recorded from stimulus onset, but prior to the analysis were adjusted so as to measure from word offset by subtracting the total sequence duration. a) Experimental list 1 gruka 500 ms
3500 ms
snúha
zugrisa
snúha
grisa
snúha
trisa
500 ms
b) Experimental list 2 truka ->-«
500 ms
3500 ms
500 ms
c) Experimental list 3 dugruka •X500 ms
3500 ms
-X 500 ms
Figure 3. Experimental procedure: Presentation of stimuli and timing.
2.2.2 Results If an item was missed by more than 2/3 of all participants in one of the conditions, it was excluded from the analysis. Five items (liga, lústit', nudit', suma, and kacka) were therefore excluded from Experiment 1A, four items (risa, suma, sebee, and liga) from Experiment IC, and one item (liga) was excluded from Experiment IB. Similarly, if a participant missed more than 50% of all items per condition, her or his data were also excluded from the analysis. The data of one subject were hence excluded from Experiment IB. The data of one subject in Experiment 1A were lost due to a technical failure. The mean RTs and the mean error rate (no response or response other than the intended target) were then calculated for each condition and analyzed.
Chapter 2
39
2.2.2.1 Experiment 1A: Word-spotting Mean RTs and mean error rates for each context are displayed in Figure 4 and Table 2. Responses to words in the prepositional context were faster and more accurate than in either of the other two contexts. Analyses of Variance (ANOVAs) for both participants (Fl) and items (F2) were carried out. There was a main effect of context in both the RT analysis (Fl(2, 14) = 60.13, ρ < .001; F2{2, 132) = 16.49, ρ < .001) and in the error analysis (Fl(2, 14) = 2 1 . 9 9 , p < .001; F2(2, 132) = 7.77, ρ = .001).7 Paired /-tests between the three conditions showed that responses to strings like gruka were significantly faster (tl(l) = 9.97, ρ < .001, ¿2(66) = 3.70, ρ < .001) than to truka, but only marginally more accurate (tl(l) = 1.82, ρ = .11, t2(66) = 1.78, ρ = .08). Responses to a syllable context like dugruka were both slower (tl(l) = 8.34, ρ < .001, ί2(66) = 5.97, ρ < .001) and less accurate (tl(l) = 6.83, ρ < .001, t2(66) = 3.94, ρ < .001) than to gruka. Moreover, ruka in truka was detected faster (tl(T) = 4.24, ρ = .004, /2(66) = 2.08, ρ = .042) and more accurately (tl(l) = 4.63,ρ = .002, t2(66) = 2.12,p = .038) than in dugruka. In summary, the recognition of a word in the prepositional condition was easier than in the syllable and the non-prepositional condition. This result does not follow the PWC because word detection should be easier in the syllable context and should be equally hard in both single-consonant contexts. Further interpretation will be possible with more data collected in Experiment IB. Experiment 1A
Experiment 1B
50%
1000
800
40%
800
600
a> 30% IS
600
400
20%
t
400
1000
• RT @ Error
LU
200
I
prep non-prep syllable Type of Context
10% 0%
200
50%
• RT a Error
40% ν 30% 15
HH
20%
I
UJ
10% 0%
prep non-prep syllable Type of Context
Figure 4. Experiments 1A and IB: Word-spotting. Mean reaction times (RTs, measured in ms from word offset) and mean percentage of errors, as a function of type of context. Error bars show standard errors. Prep = prepositional consonant, non-prep = nonprepositional consonant.
7
Greenhouse-Geisser values are reported throughout the dissertation, but the degrees of freedom are uncorrected.
40
Segmentation of Slovak speech
Table 2. Experiment 1A, IB, 1C, and 2; Mean RTs (in ms measured from word offset) and mean error rates for each condition. C = Consonant, V = Vowel. Type of Context C preposition C non-preposition CVC syllable Example gruka truka dugruka Experiment 1A Mean RT 767 880 563 Mean Error 3% 7% 15% Experiment IB Mean RT Mean Error
561 6%
668 8%
744 18%
Example Experiment 1C Mean RT Mean Error
(g)ruka
(t)ruka
(dug)ruka
364 3%
368 4%
376 5%
Example Experiment 2 Mean RT Mean Error
g/ruka
t/ruka
dug/ruka
569 7%
700 14%
631 9%
2.2.2.2 Experiment
IB:
Word-spotting
Mean RTs and mean error rates for the three preceding contexts are shown in Figure 4 and Table 2. Responses to target words in the prepositional condition (e.g., gruka) were faster and exhibited fewer errors than in the non-prepositional condition (e.g., truka). Responses to a syllable context (e.g., dugruka) were both slower and less accurate than to the other two conditions. There was a significant main effect of context for both the RT analysis (Fl(2, 68) = 35.49,ρ < .001; F2{2, 140) = 25.26,p< .001) and the error analysis (Fl(2, 68) = 27.89,p< .001; F2{2, 140) = 22.91,ρ = .001). All pairwise i-tests (see Table 3) showed significant differences between the conditions, except for the error rate between preposition and non-preposition.8 An additional analysis was conducted to rule out that the main pattern of results could be attributed to differing frequency of consonant clusters. A correlation analysis was conducted with the RTs for items in the prepositional and non-prepositional condition as the dependent variable and the log frequency of the onset clusters for those items as the independent variable. There was no effect of the frequency that could have explained the difference obtained between those two conditions (R = .01, ρ = .91).
g The main pattern of results was consistent across conditions independent of the nature of the prepositional consonant (k versus v).
Chapter 2
41
Table 3. Experiment IB: Paired t-test comparisons between the three types of contexts. Dependent variables Comparisons preposition vs. non-preposition
RT
Error
/7(34) == 5.91** ¿2(70) == 4.72**
/7(34) == 1.55 t2( 70) == 1.58
preposition vs. syllable
¿7(34) == 7.21** t2(10) == 6.83**
tl{ 34) == 6.30** t2( 70) == 6.28**
non-preposition vs. syllable
//( 34) == 3.55** ¿2(70) == 2.82**
tl( 34) == 5.43** ¿2(70) == 4.42**
Note. **p