Perceptual Training on Lexical Stress Contrasts: A Study with Taiwanese Learners of English as a Foreign Language [1st ed.] 9783030511326, 9783030511333

This book presents the effects of perceptual training on the perception of English lexical stress in rising intonation b

241 46 3MB

English Pages IX, 124 [130] Year 2020

Table of contents :
Front Matter ....Pages i-ix
Introduction (Shu-chen Ou)....Pages 1-6
Perceptual Training: A Literature Review (Shu-chen Ou)....Pages 7-33
Training to Perceive English Lexical Stress in Rising Intonation: The Immediate Effects (Shu-chen Ou)....Pages 35-59
Training to Perceive English Lexical Stress in Rising Intonation: Generalizability and Retainability (Shu-chen Ou)....Pages 61-83
General Discussion and Conclusion (Shu-chen Ou)....Pages 85-100
Back Matter ....Pages 101-124

Recommend Papers

English as a Foreign Language for Deaf and Hard of Hearing Learners 9780367753542, 9780367753566, 9781003162179

437 104 2MB Read more

Approaches and Methods of Teaching English as a Foreign Language

626 27 24MB Read more

A classroom-based study of Chinese as a Foreign Language pronunciation targeting syllables with final ‘-i’

113 22 770KB Read more

Drills in English stress-patterns: ear and speech training drills and tests for students of English as a foreign language [1 ed.]

350 100 7MB Read more

TEFL/TESL: Teaching English as a Foreign or Second Language

214 32 5MB Read more

Teaching English as a foreign language to large, multilevel classes

187 17 5MB Read more

Improving Academic Listening and Note-Taking Skills: A Study in Foreign Learners’ Strategy Training 3631816448, 9783631816448

The book investigates the effectiveness of long-term strategy training on how successfully and efficiently students list

469 93 10MB Read more

Socializing Identities through Speech Style: Learners of Japanese as a Foreign Language 9781847691026

Drawing on the perspective of language socialization and a theory of indexicality, this book examines dinnertime talk in

120 67 843KB Read more

English as a Foreign Language: Perspectives on Teaching, Multilingualism and Interculturalism 1527542874, 9781527542877

This book introduces the reader to the ongoing research on teaching English as a foreign language and highlights recent

421 95 2MB Read more

Focus on French as a Foreign Language: Multidisciplinary Approaches 9781853597688

This book offers sharp new insights into the acquisition and use of French as a foreign language. The authors are specia

139 16 740KB Read more

Perceptual Training on Lexical Stress Contrasts: A Study with Taiwanese Learners of English as a Foreign Language [1st ed.]
9783030511326, 9783030511333

Author / Uploaded
Shu-chen Ou

0 0 0
Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up

File loading please wait...

Citation preview

SPRINGER BRIEFS IN LINGUISTICS

Shu-chen Ou

Perceptual Training on Lexical Stress Contrasts A Study with Taiwanese Learners of English as a Foreign Language

SpringerBriefs in Linguistics

More information about this series at http://www.springer.com/series/11940

Shu-chen Ou

Perceptual Training on Lexical Stress Contrasts A Study with Taiwanese Learners of English as a Foreign Language

123

Shu-chen Ou National Sun Yat-sen University Kaohsiung, Taiwan

ISSN 2197-0009 ISSN 2197-0017 (electronic) SpringerBriefs in Linguistics ISBN 978-3-030-51132-6 ISBN 978-3-030-51133-3 (eBook) https://doi.org/10.1007/978-3-030-51133-3 © The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, speciﬁcally the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microﬁlms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a speciﬁc statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional afﬁliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Acknowledgements

This book would not have been possible without the generous help from many people. First of all, I am indebted to my former research assistants—Ms. Rong-ting Yeh, Ms. Hsiao-wen Cheng, and Ms. Ya-ling Lin—for their invaluable assistance with subject recruitment, data collection, and various aspects of the experiments. I also appreciate the time and patience of all the subjects, who participated in the study between 2010 and 2012 and provided much interesting data. I owe special thanks to Mr. Zhe-chen Guo, whose assistance with statistical analyses has led me to insights that might have escaped my attention. My gratitude extends to my parents and my sister Jenny for their wholehearted support throughout my career. Finally, I am grateful to the National Science Council (NSC, now known as the Ministry of Science and Technology) of Taiwan for ﬁnancial support. The present research was funded by two NSC grants to the author (grant numbers: NSC 99-2410-H-110-055; NSC 100-2410-H-110-047).

v

Contents

. . . . . .

1 1 2 5 5 5

... ... ...

7 7 9

...

9

...

14

. . . . . .

. . . . . .

. . . . . .

19 19 21 24 26 28

3 Training to Perceive English Lexical Stress in Rising Intonation: The Immediate Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 Acoustic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Falling Intonation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Rising Intonation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

35 35 36 37 37 40

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Theories of Second-Language Speech Perception 1.3 Term Clariﬁcation . . . . . . . . . . . . . . . . . . . . . . 1.4 Organization of the Book . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

2 Perceptual Training: A Literature Review . . . . . . . . . . . . . . . . . 2.1 Perceptual Training: The Methodology and Design . . . . . . . . . 2.2 Applications of Perceptual Training . . . . . . . . . . . . . . . . . . . . 2.2.1 Training Listeners to Perceive Non-native Segmental Contrasts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Training Listeners to Perceive Non-native Suprasegmental Contrasts . . . . . . . . . . . . . . . . . . . . . . 2.3 Methodological Factors That Can Inﬂuence Training Outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Training Method: Identiﬁcation Versus Discrimination . 2.3.2 Talker and Context Variability . . . . . . . . . . . . . . . . . . 2.4 English Lexical Stress and Mandarin-Speaking Listeners . . . . 2.5 The Current Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

vii

viii

Contents

3.4 Training and Testing Procedures . . . . . . . . . . . . . . . . . . . . 3.4.1 Pre-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4.2 Perceptual Training . . . . . . . . . . . . . . . . . . . . . . . . 3.4.3 Post-test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.6 Response Accuracy Analysis . . . . . . . . . . . . . . . . . . . . . . . 3.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.1 Trainee Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.7.2 Control Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.1 Enhanced Performance Under the Rising Intonation Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.8.2 Reduced Performance Under the Falling Intonation Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

44 44 44 46 46 47 47 47 50 52

.....

52

..... ..... .....

54 56 58

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

61 61 62 63 63 65 68 68 69 69 69 70 70 74 78 78 80 83 83

5 General Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Summary of the Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Implications for Second-Language Speech Perception Models and the Perceptual Training Paradigm . . . . . . . . . . . . . . .

85 85

4 Training to Perceive English Lexical Stress in Rising Intonation: Generalizability and Retainability . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Materials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Acoustic Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Falling Intonation . . . . . . . . . . . . . . . . . . . 4.3.2 Rising Intonation . . . . . . . . . . . . . . . . . . . . 4.4 Testing Procedures . . . . . . . . . . . . . . . . . . . . . . . . 4.4.1 Generalization Test . . . . . . . . . . . . . . . . . . 4.4.2 Retention Test . . . . . . . . . . . . . . . . . . . . . . 4.5 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Response Accuracy Analysis . . . . . . . . . . . . . . . . . 4.7 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7.1 Generalization Test . . . . . . . . . . . . . . . . . . 4.7.2 Retention Test . . . . . . . . . . . . . . . . . . . . . . 4.8 Summary and Discussion . . . . . . . . . . . . . . . . . . . 4.8.1 Generalizability of the Training Effects . . . . 4.8.2 Retainability of the Training Effects . . . . . . 4.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

. . . . . . . . . .

. . . . . . . . . . . . . . . . . . .

87

Contents

5.3 Further Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Stimuli Variability . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 The Effects of Perceptual Training on Production 5.3.3 Perceptual Learning of Lexical Stress in a More Ecologically Valid Environment . . . . . . . . . . . . . 5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

....... ....... .......

93 93 95

....... ....... .......

96 97 97

Appendix A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Appendix B . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 Appendix C . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 Appendix D . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

Chapter 1

Introduction

Abstract Perceiving sound contrasts in a second language (L2) can be challenging. One problem that confronts Mandarin-speaking learners of English as a foreign language is with perceiving English lexical stress contrasts embedded in rising intonation. The present book reports a training program that aims to help the learners overcome such a perceptual challenge. As an introduction, this chapter sketches how difficulty with L2 speech perception occurs from the perspectives of some major models or theories. It is then pointed out that the difficulty may be effectively mitigated with a technique known as perceptual training. Before this technique is discussed in detail in the next chapter, we clarify a few terms that will be used throughout the rest of the chapters and presents an organization of the book.

1.1 Background Learning a second language (L2) as an adult is widely believed to be difficult. The difficulty usually stems from various sources and the specific challenges may vary from one learner to another. Yet, no matter how different these challenges are, there are a number of common goals that have to achieve to be considered proficient in the L2. One of them is effective speech comprehension. This ability is vital for communication with native speakers of the target language and other L2 learners, and problems with comprehension can occur at several different stages of speech processing. For example, learners may be unable to parse the heard utterance into syntactically appropriate units or recognize the implied messages. They may also experience lower-level problems such as failures to identify the intended words or perceive the L2 phonemes. Perceiving phonemes, or more generally, contrastive sound categories that serve to distinguish one word from another, is presumably a task at the lowest-level stage of speech processing. However, for L2 learners, this seemingly basic skill can prove challenging and may not be mastered even by those with extensive experience with the L2. Thus, there has been a vast array of studies that seek to help learners improve their perception of L2 sound distinctions. This book reports a study that represents one of these research endeavors and centers around a perceptual problem experienced by Mandarin-speaking listeners who have © The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 S. Ou, Perceptual Training on Lexical Stress Contrasts, SpringerBriefs in Linguistics, https://doi.org/10.1007/978-3-030-51133-3_1

1

2

1 Introduction

been learning English as a foreign language (EFL) in Taiwan—their difficulty with perceiving English lexical stress contrasts embedded in rising intonation. In the current study, a technique known as “perceptual training” was applied to train perception of English lexical stress in rising intonation by Taiwanese EFL learners. The following chapter will introduce the technique and describe the learners’ problem in greater detail. To set the scene for the discussions in later chapters, it would be useful to first review some major models of L2 speech perception and examine how they account for why perceiving certain L2 sounds is difficult in the first place. As it turns out, it is generally assumed that the challenge is, at least in part, caused by prior experience with their first-language (L1). Next, a few terms that will be used recurrently are explained, followed by an overview of this book.

1.2 Theories of Second-Language Speech Perception Adult L2 learners are often beset with troubles such as failure to differentiate between two contrastive L2 sounds in perception or production. In contrast, infants and young children seem to be in a better position to acquire L2 speech (Lamendella 1977; Singleton 1995), as reflected in their ability to discriminate even sound distinctions that are phonemically irrelevant in L1 (Aslin et al. 1998; Streeter 1976; Werker and Lalonde 1988; Werker and Tees 1984). The relative difficulty associated with adult L2 speech learning is sometimes mentioned in support of one of the earlies claims about language acquisition—the Critical Period Hypothesis (CPH). It holds that the ability to acquire a new language is greatly reduced after puberty because of cerebral maturation and concomitant loss of neural plasticity (Penfield and Roberts 1966; Lenneberg 1967). Nevertheless, while the CPH provides a bioneurologically-based explanation for challenges encountered in adults’ learning of L2 sound contrasts, it makes no specific empirical predictions about what exactly the challenges may be and does not explicitly address the role of L1, which has been well-documented in the L2 acquisition literature. More recent models of L2 speech perception have suggested that a number of problems with L2 perception can be explained and predicted by considering the differences and relations between sound categories in L1 and L2. Although this section does not aim to provide a comprehensive review of the models or to compare their postulates or predictions, it would be insightful to discuss some of the major ones. One such model is the Perceptual Assimilation Model (PAM; Best 1994, 1995), which was proposed originally to account for perception of non-native speech by naïve listeners, namely, perception of sounds from a non-L1 by listeners without prior contact with the language. Founded on the view of direct realism, PAM assumes that the listener directly perceives the gestural events involved in the articulation of the speech signal. Based on the perceived gestural information, a speech sound may or may not be assimilated to an L1 phoneme, and the patterns of assimilation of two sounds determines the ease or difficulty with which they can be discriminated. If two sounds are perceptually assimilated into two different L1 phonemes

1.2 Theories of Second-Language Speech Perception

3

(i.e., “two-category” assimilation), they should be differentiated accurately. On the other hand, if they are assimilated equally well to a single L1 phoneme (i.e., “singlecategory” assimilation), the discriminability is predicted to be poor. Between these two extreme cases is an assimilation type called “category goodness,” which occurs when both members of a sound pair are assimilated to an L1 phoneme, but one is perceived as a better exemplar than the other. Depending on the relative goodness of two sounds as exemplars of the phoneme, discrimination in this case can range from poor to excellent. In addition, it is also possible that while one member of a pair is assimilated to an L1 phoneme, the other is unassimilable; when this occurs, the two sounds can be told apart easily. More recently, the framework of PAM has been extended to L2 speech perception and learning (PAM-L2; Best and Tyler 2007). Unlike naïve listeners, L2-learning listeners are in the process of acquiring the phonological system of a new language. Therefore, PAM-L2 is complemented with hypotheses regarding the emergence of the L2 phonological system and its interaction with L1. However, the possible assimilation types described by PAM also hold for L2 speech perception. For example, learners would show “single-category” assimilation if the two sounds of an L2 contrast are perceived as equally good or bad exemplars of an L1 phoneme. Perceptual differentiation between the sounds in this case is predicted to be inaccurate and difficult. Likewise, L2 learners may also perceive one member of the contrast as a better exemplar of the phoneme or perceive it as unassimilable to any L1 category, showing varying degrees of discrimination accuracy. PAM-L2 further assumes that with extensive experience with the target language, learners can possibly establish robust categories for L2 sounds, which allow them to successfully perceive contrasts that were difficult at the initial stage of acquisition. Another influential model of L2 speech perception is the Speech Learning Model (SLM; Flege 1995, 1999, 2002). Unlike PAM-L2, SLM does not adopt a directrealist perspective and thus is not framed in terms of assimilation based on perceived articulatory gestures; rather, it assumes that speech sounds are classified on the basis of perceived phonetic (dis)similarity. Yet, there are some commonalities between these two models in their predictions. SLM posits that the phonetic categories making up the phonological systems of L1 and L2 exist in a common space and that an L2 sound may be perceived as “new” or “similar” as compared with the existing L1 categories in this space. In the case of new L2 sounds, learners would not classify them as instances of an L1 category and may be able to perceive them accurately. On the other hand, problems with L2 speech typically arise when L2 sounds are so perceptually similar to some L1 category that they are classified as instances of that category, resulting in the so-called “equivalent classification.” Perception of two contrastive L2 sounds that undergo equivalent classification into the same L1 category is predicted to be challenging, as in the case of single-category assimilation of PAM. SLM also hypothesizes that precisely because of equivalent classification, it is especially difficult to form new categories for L2 sounds that are similar to certain L1 sounds.

4

1 Introduction

However, as with PAM-L2, SLM does not reject the idea that learners can ultimately perceive L2 sounds accurately. One key assumption of SLM is that the mechanisms underlying L1 and L2 speech learning are fundamentally the same. As in the case of L1, an L2 phonological system takes time to develop and is guided by the received input. Sufficient L2 input over time may lead learners to distinguish between L1 sounds and similar L2 sounds. This possibility is supported by empirical findings demonstrating listeners’ sensitivity to fine-grained phonetic differences between L1 and non-L1 speech. For example, English-speaking listeners can even discern the subtle differences between the phoneme sequence /tu/ produced by English speakers and that by French speakers (Flege 1984). Such sensitivity may prevent equivalent classification of L2 sounds into L1 categories and facilitate creation of new L2 categories. Problems with distinguishing between L2 sounds can also be understood from the perspectives of other speech perception theories, such as the Native Language Magnet (NLM) model (Kuhl 1993, 2004). The model holds that experience with L1 warps the perceptual space, leading to the emergence of “prototypes” for native speech categories. The prototypes are members of the speech categories that are considered to be the best exemplars. Just as a magnet attracts nearby objects, a prototype attracts similar sounds in such a way that causes those sounds to be perceived as the same as the prototype. This is known as the “perceptual magnet effect,” which can make it difficult to distinguish between sounds that are prototypic members of a category. As a consequence, listeners may not be able to accurately perceive L2 sounds due to attraction to some native-language prototypes. Moreover, the perceptual magnet effect is found not only in adults but also in infants as young as six months old (Kuhl 1991), suggesting that the emergence of prototypes is ontologically early and therefore problems with L2 speech learning can possibly occur soon after initial exposure to L1. Although the aforementioned models differ in how they theorize about the principles and mechanisms underlying speech perception, one consensus is that prior linguistic experience plays a crucial role in perceiving L2 sounds. Problems with L2 speech are most likely to occur when contrastive sounds are treated as instances of an L1 category, either by the virtue of perceptual assimilation, equivalent classification, or attraction to a prototype. Nevertheless, it is not asserted that the problems will necessarily persist and can never be overcome. Models such as PAM-L2 and SLM recognize the possibility of perceptual learning: that is, as one gains more L2 input, new categories for challenging L2 sounds may eventually be developed, making accurate perception possible. For L2 learners, instructors, and researchers, a practical question that may be of interest then concerns what can be done to facilitate perceptual learning. One conclusion from past studies on this issue has been that L2 speech perception can be effectively improved with a simple laboratory-based training method known as “perceptual training.” We will discuss this method in greater detail in the following chapter. For now, it is useful to clarify terms that have been used and will continue to be used throughout this book.

1.3 Term Clarification

5

1.3 Term Clarification As mentioned, there are well-documented differences between adults and children or infants in terms of their aptitude for learning the sounds of a new language. Since the current research and the studies to be reviewed are concerned with the adult population, the term “(L2) learners” refers specifically to adult learners. Besides, while being a listener to a language does not imply being a learner of that language, “learners” will sometimes be used interchangeably with “listeners,” as this book is about training L2 learners’ perception. Unless otherwise stated, it is assumed that those that are referred to as “listeners” have been learning the language that they are perceiving. “Native language” and “L1” are synonymous in our usage; similarly, a “non-native language” is assumed to be an “L2,” although the former clearly does not entail the latter. Again, it will be explicitly stated when “non-native language” refers to a non-L1 with which one has no prior experience.

1.4 Organization of the Book We have presented in this chapter a brief sketch of L2 speech perception problems as viewed from different models and pointed out that they can possibly be mitigated with perceptual training. Chapter 2 will describe the training technique, examine various cases of its application, and identify a number of factors that may influence its efficacy. Then, we will discuss the perceptual challenge that forms the central theme of this book—Mandarin-speaking EFL learners’ difficulty with perceiving lexical stress patterns in rising intonation. Chapter 3 details the perceptual training program conducted to help the learners overcome the challenge and reports the results of a pre-test and a post-test. Performance on these tests was compared to evaluate how trainees’ perception of the same set of stimuli had altered shortly after the completion of the training. Chapter 4 reports the results of two further tests that assessed whether changes in perceptual patterns after the training generalized to new stimuli and whether they were retained for an extended period of time. Finally, Chap. 5 presents a general discussion of all the findings and concludes this book.

References Aslin, R.N., P.W. Jusczyk, and D.B. Pisoni. 1998. Speech and auditory processing during infancy: Constraints on and precursors to language. Best, C.T. 1994. The emergence of language-specific influences in infant speech perception. In Development of Speech Perception: The Transition from Recognizing Speech Sounds to Spoken Words, ed. J. Goodman and H.C. Nusbaum, 167–224. Cambridge, MA: MIT Press. Best, C.T. 1995. A direct realist view of cross-language speech perception. In Speech Perception and Linguistic Experience: Issues in Cross-Language Research, ed. W. Strange, 171–204. Timonium, MD: York Press.

6

1 Introduction

Best, C.T., and M.D. Tyler. 2007. Nonnative and second-language speech perception: Commonalities and complementarities. In Language Experience in Second Language Speech Learning: In Honor of James Emil Flege, ed. O.-S. Bohn and M.J. Munro, 13–34. Amsterdam: John Benjamins. Flege, J.E. 1984. The detection of French accent by American listeners. The Journal of the Acoustical Society of America 76 (3): 692–707. Flege, J.E. 1995. Second language speech learning: Theory, findings, and problems. In Speech Perception and Linguistic Experience: Issues in Cross-Language Research, ed. W. Strange, 233– 276. Timonium, MD: York Press. Flege, J.E. 1999. Age of learning and second language speech. In Second Language Acquisition and the Critical Period Hypothesis, ed. D. Birdsong, 101–131. Mahwah, NJ: Lawrence Erlbaum Associates. Flege, J.E. 2002. Interactions between the native and second-language phonetic systems. In An Integrated View of Language Development: Papers in Honor of Henning Wode, ed. P. Burmeister, T. Piske, and A. Rohde, 217–243. Trier, Germany: Wissenschaftlicher Verlag Trier. Kuhl, P.K. 1991. Human adults and human infants show a “perceptual magnet effect” for the prototypes of speech categories, monkeys do not. Perception and Psychophysics 50 (2): 93–107. Kuhl, P.K. 1993. Innate predispositions and the effects of experience in speech perception: The Native Language Magnet theory. In Developmental Neurocognition: Speech and Face Processing in the First Year of Life, ed. B. de Boysson-Bardies, S. de Schonen, P. Jusczyk, P. McNeilage, and J. Morton, 259–274. Dordrecht: Springer. Kuhl, P.K. 2004. Early language acquisition: Cracking the speech code. Nature Reviews Neuroscience 5: 831–843. Lamendella, J.T. 1977. General principles of neurofunctional organization and their manifestation in primary and nonprimary language acquisition. Language Learning 27 (1): 155–196. Lenneberg, E.H. 1967. The biological foundations of language. Hospital Practice 2 (12): 59–67. Penfield, W., and L. Roberts. 1966. Speech and Brain Mechanisms. New York: Atheneum. Singleton, D. 1995. Introduction: A critical look at the critical age hypothesis in second language acquisition research. In The Age Factor in Second Language Acquisition, ed. D. Singleton and Z. Lengyel, 1–29. Clevedon, Avon, UK: Multilingual Matters Ltd. Streeter, L.A. 1976. Language perception of 2-month-old infants shows effects of both innate mechanisms and experience. Nature 259 (5538): 39–41. Werker, J.F., and C.E. Lalonde. 1988. Cross-language speech perception: Initial capabilities and developmental change. Developmental Psychology 24 (5): 672–683. Werker, J.F., and R.C. Tees. 1984. Phonemic and phonetic factors in adult cross-language speech perception. The Journal of the Acoustical Society of America 75 (6): 1866–1878.

Chapter 2

Perceptual Training: A Literature Review

Abstract This chapter provides a review of the perceptual training literature and presents the goals of the current research. We begin with describing in detail the methodology of perceptual training and its basic design and review several studies that apply this technique to improve learners’ perception of L2 segmental and, particularly, suprasegmental contrasts. As studies differ from one another in methodological aspects such as the task used to assess participants’ performance and the variability of the stimuli, some factors that can possibly impact the outcomes and efficacy of a training intervention are identified and discussed. Next, we summarize the findings from previous research on lexical stress perception and then touch on the issue of central interest in this book—the difficulty with perceiving English word stress patterns in rising intonation on the part of Mandarin-speaking learners of English as a foreign language. Finally, we outline a perceptual training program that aimed to help the learners overcome the difficulty and explain the rationale for its design.

2.1 Perceptual Training: The Methodology and Design As shown in the previous chapter, speech perception models and theories generally assume that learners’ problems with discriminating between second-language (L2) sound contrasts partly arise from the tendency to perceive them in terms of nativelanguage categories. Yet, their perceptual patterns may not be immutable as there has been behavioral and neurophysical evidence for flexibility of the adult perceptual system (Bradlow and Bent 2008; Clarke and Luce 2005; Kraus et al. 1995; Lively et al. 1993; Norris et al. 2003; Nygaard and Pisoni 1998). Therefore, with proper training or instructions, learners can possibly recalibrate their perceptual criteria, thereby improving their perception of L2 contrasts (see Samuel and Kraljic 2009, for a review). A practical question that would immediately arise concerns what training methods are effective. Although introducing modifications to perceptual patterns of non-native speech might appear to involve long-term interventions and sophisticated instructions, it turns out that this can be readily achieved by using a method as simple © The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 S. Ou, Perceptual Training on Lexical Stress Contrasts, SpringerBriefs in Linguistics, https://doi.org/10.1007/978-3-030-51133-3_2

7

8

2 Perceptual Training: A Literature Review

Fig. 2.1 Basic design of a perceptual training program

as auditory training in a laboratory setting, which we will refer to as “perceptual training.” Due to advantages such as ease of implementation, this method has been widely used to facilitate learning of a non-native contrast. In its basic form, a perceptual training program contains a pre-test, a series of training sessions, and a post-test, as illustrated in Fig. 2.1. The pre-test assesses trainees’ initial performance on sound contrasts of interest to the researchers. This is normally done by asking them to complete an identification or discrimination task. In an identification task, they hear only one stimulus at a time and are given two or more options which represent the categories associated with the members in the sound contrasts. Then, they identify the stimulus by selecting the option corresponding to the category they think it belongs to. As for discrimination task, there are at least three types: AX discrimination, ABX discrimination, and oddity discrimination. In the first type, the participants are presented with two stimuli (referred to as A and X) and indicate whether X is of the same category as A. In the second type, they hear three stimuli (A, B, and X) and determine whether X is the same as A or as B. Or, as in an oddity discrimination task, they may hear three (or perhaps more) stimuli and choose the one that is different from the rest in terms of category affiliation. After the pre-test, the trainees precede to the training sessions, in which they receive repeated practices on the identification or discrimination task they just completed. The stimuli used in these sessions may be the same as or different from those in the pre-test, but in either case, they contain the sound contrasts under investigation. Importantly, after responding to each trial in the training, the participants are offered immediate feedback, which informs them whether the response is correct or not and gradually leads them to implicitly discover the perceptual cues to the contrasts. Some studies (e.g., Wang et al. 1999) also included additional instructions, such as asking the participants to orally repeat an auditory stimulus, to expedite their learning. The total length of the sessions can range from six or seven days to several months. After the training is finished, the participants receive the post-test, which is the same as the pre-test. Significant performance changes from the pre-test to the post-test are considered to be the effects of the training intervention. Finally, to help evaluate these effects, a control group, who only take the two tests and do not receive the training, would also participate. Note that the procedures described above represent only a rudimentary design. As one of the essential goals of L2 learning is to develop robust new categories for non-native contrasts, it is common for researchers to also examine whether the training-induced modifications persist over time and whether they generalize beyond the contexts or situations in which the contrasts were learned. These two questions can be addressed by using “retention” and “generalization” tests, respectively. Usually

2.1 Perceptual Training: The Methodology and Design

9

identical to the post-test, a retention test is administered when an extended period of time, such as three or six months, has elapsed after the completion of the training (e.g., Bradlow et al. 1997; Lively et al. 1994). A generalization test is normally given at roughly the same time when the post-test is conducted. It assesses whether training effects extend to novel stimuli and the source of novelty can be speaker as well as the word and phonetic context in which the contrasts are embedded (e.g., Aliaga-Garcia 2010; Lively et al. 1993; Logan et al. 1991; Strange and Dittmann 1984). Moreover, since it has long been suggested that there is a close link between perception and production (e.g., Fowler 1981, 1986; Goldstein and Fowler 2003; Liberman and Mattingly 1985; Liberman and Whalen 2000; Mitterer and Ernestus 2008), some (e.g., Wang 2008; Wong 2013) include a production task in the training and testing to see whether changes in perception are transferred to production (see, Sakai and Moorman 2018, for a review of the production-perception link in perceptual training literature). There have also been attempts to train participants entirely on production and examine how their pronunciation and/or perception of L2 sounds change after the intervention (e.g., Akahane-Yamada et al. 1998; Dowd et al. 1998; Kartushina et al. 2015; Linebaugh and Roche 2015). The following section provides a review of several cases of the application of perceptual training to non-native speech learning. We will describe the difficulties experienced by adult L2 listeners in each case and discuss important findings from the training interventions implemented to help them overcome the problems.

2.2 Applications of Perceptual Training 2.2.1 Training Listeners to Perceive Non-native Segmental Contrasts We begin this section by considering adult listeners’ learning of L2 contrasts at the segmental level. Languages differ from one another in the size and make-up of consonant inventories, and one of the oft-cited problems that occur as a result of such cross-linguistic differences is Japanese listeners’ difficulty with perceiving /r/ and /l/ in American English (AE). In AE, /r/ and /l/ are phonetically realized as [ô] and [l], but this phonemic distinction is absent in Japanese, which has only one /r/ phoneme that is typically realized as [R] (Price 1981; Vance 1987). The Perceptual Assimilation Model (PAM; Best 1995), for example, predicts that English [ô] and [l] would be perceptually assimilated into the /r/ (or /w/) category in Japanese, although they may be perceived as poor exemplars of that category (Best and Strange 1992). Due to this single-category assimilation, Japanese listeners’ discrimination of AE /r/ and /l/ would often be inaccurate, as has been widely reported in the literature (e.g., Goto 1971; Miyawaki et al. 1975; Sheldon and Strange 1982; Yamada and Tohkura 1992). Moreover, this problem appears to be a persistent one, since Japanese listeners,

10

2 Perceptual Training: A Literature Review

even with years of experience of living in an AE-speaking environment, may still be unable to discriminate the contrast with native-like accuracy (MacKain et al. 1981). One of the earliest attempts to improve Japanese listeners’ perception of AE /r/ and /l/ via perceptual training was made by Logan et al. (1991). In their study, native speakers of Japanese were trained to identify English minimal word pairs contrasting the /r/ and /l/ sounds in various phonetic environments, including the word-initial singleton position (e.g., rice vs. lice), word-initial consonant cluster (e.g., pray vs. play), intervocalic position (e.g., arrive vs. alive), word-final consonant cluster (e.g., hoard vs. hold), and word-final singleton position (e.g., rear vs. real). Minimal pairs like these (68 pairs in total) were recorded by five AE talkers and the resulting stimuli served as the training materials. In each training session, subjects were presented with the stimuli from one of the talkers and trained by using two-alternative forced-choice identification: they heard one stimulus in a trial and identified it by choosing from the minimal pair that the stimulus belonged to. The pair were shown on a screen and responses were made by button presses. Incorrect responses were followed by feedback. Subjects were trained on the stimuli from each talker three times and thus there were totally 15 training sessions. They were asked to identify a heard stimulus in a similar way in the pre-test and post-test, which used a different set of /r/-/l/ minimal pairs (adopted from Strange and Dittmann 1984) produced by a talker not included in the training. The results showed that the Japanese-speaking participants’ overall identification accuracy of /r/ and /l/ significantly improved (from 78.1% in pre-test to 85.9% in posttest). Moreover, the magnitude of the improvement was contingent on the phonetic context in which the contrast was embedded. For example, /r/ and /l/ in word-initial clusters showed the greatest increase in identification accuracy: in the pre-test, the average accuracy rate for /r/ and /l/ in this context (below 60%) was the lowest among all the phonetic environments, but it increased markedly (to almost 80%) in the posttest. Clearly, Japanese listeners’ perception of /r/ and /l/ is malleable and amenable to improvement, even when the pair of sounds are carried in the most challenging phonetic context. In a later study, Lively et al. (1993) conducted a training program identical to that of Logan et al. (1991) in terms of the procedures of training and testing. It was observed that their Japanese-speaking listeners’ identification accuracy progressively increased during the training, and their performance in the final training session was comparable to that in two generalization tests which presented new words recorded by a familiar taker and by a new talker. This suggests that the traininginduced improvement was transferred to novel stimuli and the listeners had learned to exploit the essential cues to the /r/-/l/ contrast. To further examine whether such improvement could be robustly preserved, Lively et al. (1994) carried out a perceptual training study that included two retention tests. Overall, their training procedures were identical to those in Logan et al. (1991): trainees learned to identify English minimal pairs contrasting /r/ and /l/ in different phonetic contexts in a two-alternative forced-choice identification task. Assessment of performance immediately after the training replicated the results of Logan et al. That is, Japanese-speaking trainees’ overall identification accuracy significantly

2.2 Applications of Perceptual Training

11

increased (from 65% in pre-test to 77% in post-test). Three months after the completion of the training, some of the trainees returned and received a follow-up test, which was the same as the post-test. Although no extra training was offered during the three-month interval, there was no significant decrease in response accuracy from the post-test to the three-month follow-up test. This finding is noteworthy considering the fact that the participants were living in a Japanese-speaking environment without much exposure to spoken English during the course of the study. With no need for an environment that encourages the use of English outside the laboratory, information about the /r/ and /l/ contrast acquired during a short-term perceptual training can be retained for some time. In sum, it has been demonstrated that for Japanese listeners, perceptual training brings about improved ability to distinguish between English /r/ and /l/, which not only generalizes to novel stimuli and but also persists over time. Such a laboratorybased intervention has also been shown to be effective in other cases of consonant perception, at least when the pre-test and post-test results are compared. These cases include English /θ/ and /ð/ perceived by French-speaking learners of English (e.g., Jamieson and Morosan 1986), Hindi dental and retroflex stops perceived by native AE and Japanese listeners (Pruitt et al. 2006), three-way voicing distinctions perceived by AE listeners (e.g., McClaskey et al. 1983; Pisoni et al. 1982), and so on. In addition to consonants, vowels are another segmental dimension along which new categories need to be formed during the course of L2 acquisition. However, there are a number of intrinsic differences between consonants and vowels in terms of their psycholinguistic nature. It has long been recognized that perception of vowels tends to be less categorical compared with that of consonants (Pisoni 1971, 1973; Stevens et al. 1969). Information carried by vowels is also found to be perceptually more mutable and less secure than that carried by consonants, and this seems to be true regardless of the relative size of the consonant repertoire as opposed to that of the vowel repertoire in the listener’s native language (e.g., Cutler et al. 2000; Marks et al. 2002; Van Ooijen 1996). In fact, it is hypothesized that relative to consonants, vowels convey less information for distinguishing one word from another and thus contribute less to lexical processing (e.g., Nespor et al. 2003). Nevertheless, despite the less categorical nature of vowels relative to consonants, it is still possible to establish or reinforce L2 vowel categories via perceptual training. A case in point is European Portuguese listeners’ learning of the /E/-/æ/, /i/-/I/, and /u/-/*/ contrasts in AE. European Portuguese has only /i/, /e/, and /E/ in the front region of the vowel space and does not distinguish between /E/ and /æ/ (Escudero et al. 2009). It has only /u/ in the high back region, and thus this language does not use tense-lax distinction for high vowels (and vowels in general). It is of interest to note that in addition to vowel quality differences, AE /i/-/I/ and /u/-/*/ also involve durational differences: tense vowels tend to be realized with a longer duration than their lax counterparts, although this is only a secondary cue to the contrasts for native English listeners (Ainsworth 1972; Hillenbrand et al. 2000). European Portuguese does not exploit vowel length phonemically either (Mateus et al. 2005). The spectral or temporal cues to the three pairs of AE vowels are simply irrelevant for identifying vowel categories in European Portuguese. Thus, European Portuguese EFL learners’

12

2 Perceptual Training: A Literature Review

discrimination between the two members in each vowel pair would be expected to be poor or, at least, less accurately than that of native English listeners. This problem is indeed observed for the European Portuguese EFL learners in a study by Rato (2014) prior to their participation in a perceptual training program. The learners first received a pre-test in which they identified the vowels in English words with the CVC structure (where C = consonant and V = vowel), such as heed. A word was presented in each trial, and the vowel in the word was either of the following seven vowels: /E/, /æ/, /i/, /I/, /u/, /*/, and /2/. The learners were shown seven words that represented the seven vowels (e.g., heed, hid, head, had, who’d, hood and hud) and asked to select the one that corresponded to the heard word. The pre-test results indicated that for the learners who participated as trainees, the overall correct identification rates of words containing the /E/-/æ/, /i/-/I/, and /u/-/*/ contrasts were around 60% (/E/-/æ/: 66%; /i/-/I/: 56%, and /u/-/*/: 65%). While these accuracy rates were apparently much better than the chance level (which was 14.3%, given that there were seven alternatives), they were still relatively low compared with those of a group of native listeners, whose overall performance was near-perfect (above 95%). The training program carried out to improve the learners’ perception of /E//æ/, /i/-/I/, and /u/-/*/ consisted of five sessions and included discrimination and identification tasks. The first three sessions presented an AX discrimination task and a two-alternative forced-choice identification task; each trial in both tasks involved just the two vowels in one pair of the contrasts. The final two sessions included an oddity discrimination task (in which listeners heard three stimuli and chose the categorically different one) and also a seven-alternative forced-choice task similar to the pre-test. Feedback was provided immediately after each trial in all tasks. A series of assessments after the training suggest that the intervention is generally effective in helping the European Portuguese EFL learners develop robust AE vowel categories. The results of the trained learners in the post-test, which was identical to the pre-test, revealed significant performance gain for the three pairs of vowel contrasts: the mean identification accuracy rates for /E/-/æ/, /i/-/I/, and /u/-/*/ rose to about 80% (/E/-/æ/: 81%; /i/-/I/: 85%; /u/-/*/: 79%). A retention test administered two months after the completion of the training further showed that the trained leaners’ overall accuracy did not decline for all vowel pairs, indicating that learning was retained. Finally, there was partial generalization of training. Compared with that in the post-test, the trainees’ performance in a generalization test (a sevenalternative forced-choice task presenting new words produced by five novel talkers) was even better for /E/-/æ/ (90%) and /i/-/I/ (89%) except for /u/-/*/, the accuracy of which significantly decreased (67%). However, their performance on /u/-/*/ in the generalization test was significantly higher than that of a control group who did not receive any training, suggesting that the training is still somewhat beneficial for the back vowel pair. The benefits of perceptual training for European Portuguese listeners are notable considering the fact that the AE tense-lax vowel distinctions seem to be a pervasive problem. Non-native listeners can find tense and lax vowels confusable even if part of the cues to tense-lax contrasts has a distinctive function in their native language. An

2.2 Applications of Perceptual Training

13

example would be listeners of Cantonese. This language uses vowel length phonologically (Lee 1983), and one might assume that such use of vowel length would assist Cantonese listeners in distinguishing /i/ and /u/ from /I/ and /*/. Yet, they may still have trouble with tense-lax vowel contrasts (Hung 2000). Fortunately, the problem is not insurmountable, as demonstrated in a training study by Wang (2008) with EFL learners whose native language was either Cantonese or Mandarin. As in European Portuguese, the contrasts /E/-/æ/, /i/-/I/, and /u/-/*/ do not exist in Mandarin and distinguishing between the two vowels in each pair is indeed challenging for Mandarin-speaking listeners (Wang 1998). Before the training phase, Wang’s EFL learners received a pre-test that included a two-alternative forced-choice task where they identified minimal pairs contrasting the three pairs of vowels (e.g., heed vs. hid, head vs. had, who’d vs. hood etc.), produced by two native English speakers. Next, during the training, the learners were presented with synthesized stimuli of relevant minimal pairs as well as natural stimuli produced by another four native speakers. Their task was the same as in the pre-test, except that this time immediate feedback was provided. Comparisons of pre-test and post-test performance indicated that the trainees’ overall identification accuracy of the three vowel contrasts all increased significantly (to above 75%), with the average accuracy gains for /E/-/æ/, /i/-/I/, and /u/-/*/ being 16%, 14%, and 32%, respectively. Their performance in a generalization test using new words produced by familiar and novel talkers was above 75%, significantly better than that of a control group. A retention test administered three months later showed that while the trainees’ accuracy rates for the three vowel contrasts decline slightly, they were still significantly higher than those in the pre-test. Interestingly, the findings also suggest that Wang’s (2008) Mandarin and Cantonese EFL learners did pick up the appropriate spectral cues to the vowel contrasts. The evidence comes from an identification task that used synthesized vowel continua and was also included in the training and testing phases. For each vowel contrast, a synthetic vowel continuum varying in spectral and durational steps was created. There were six steps for both the spectral and temporal dimensions, and all possible combinations of the spectral and temporal steps yielded a total of 36 stimuli for each contrast. Each stimulus had two possible identification responses (e.g., /i/ or /I/) and the percentage of one response (e.g., /i/) was calculated. The results revealed that compared with the pre-test, the trainee group in the post-test gave more /i/ responses for stimuli at the spectral steps closer to the /i/ end of the continuum and fewer /i/ responses for those at the steps closer to the /I/ end. This pattern was found for the /E/-/æ/ and /u/-/*/ as well, suggesting that they perceived the vowel contrasts more categorically in the post-test and became more sensitive to the spectral cues. Yet, the identification patterns across the six temporal steps provide little evidence that the trainees exploited duration in perceiving the vowels after the training. Generally speaking, the above results were found not only for the Mandarin-speaking learners, but also for the Cantonese-speaking ones, whose language uses vowel length phonologically. Therefore, modifications introduced by training in a laboratory setting enable learners to attend to the most important cues for distinguishing non-native vowel contrasts.

14

2 Perceptual Training: A Literature Review

Perceptual training has been shown to be effective in improving perception of English vowels by other non-native populations, such as Japanese listeners (e.g., Nishi and Kewley-Port 2007) and French listeners (e.g. Iverson et al. 2012). In general, the findings from the literature point to the conclusion that L2 learners’ perceptual patterns are to some extent malleable by laboratory-based training as far as perception of segmental contrasts are concerned. In what follows, it will be examined whether such malleability is also evident in the learning of suprasegmental features. Although training studies with regard to this issue are relatively few (compared with those on the learning of consonants and vowels), we will survey a number of them and focus on two cases: training on perception of lexical tones and length contrasts.

2.2.2 Training Listeners to Perceive Non-native Suprasegmental Contrasts In this section, we consider how perceptual training can be applied to the learning of lexical tone as well as length contrasts and begin by discussing a specific case of application: training native AE listeners to perceive Mandarin lexical tones. Lexical tones are distinctive pitch patterns over a syllable that serve the function of distinguishing one word from another. Mandarin has four lexical tones, referred to as Tone 1, Tone 2, Tone 3, and Tone 4 and characterized by high-level, rising, low-dipping, and falling pitch patterns, respectively. These pitch patterns are realized as changes in fundamental frequency (F0). Table 2.1 shows how combining the four tones with the same CV syllable results in different words and Fig. 2.2 provides an illustration of the F0 manifestation of the tones. It has long been observed that in addition to F0, the four tones differ in some other acoustic properties such as duration. For example, Tone 3 is intrinsically the longest while Tone 4 is the shortest (Lin 1965; Tseng 1990). Nevertheless, such non-F0 cues are only secondary acoustic correlates for native Mandarin listeners (e.g., Blicher et al. 1990; Howie 1976; Moore and Jongman 1997). Perceiving Mandarin tones poses a challenge to AE listeners because of the different uses of pitch in the two languages. Pitch shapes over a syllable are not Table 2.1 Mandarin lexical tones

Tone category

Pitch pattern

Tone numeral

Example

Tone 1

High-level

55

ma55 ‘mother’

Tone 2

Rising

35

ma35 ‘hemp’

Tone 3

Low-dipping

214

ma214 ‘horse’

Tone 4

Falling

51

ma51 ‘scold’

The tone numeral column indicates transcriptions of the four lexical tones using Chao’s (1930) tone numeral system

2.2 Applications of Perceptual Training

15

Fig. 2.2 F0 contours of the four Mandarin lexical tones. Adapted from Xu (1997: 67)

exploited in English to encode lexical distinctions; rather, English words are distinguished suprasegmentally by lexical stress (e.g., pérmit vs. permít, where the acute mark indicates stressed syllables), which involves the relative prominence of more than one syllable and is conveyed by multiple correlates, including not only pitch, but also duration and intensity (e.g., Fry 1955, 1958; Gay 1978; Kochanski and Orphanidou 2008; Lieberman 1960; Sluijter 1995). One major problem that confronts AE listeners is with distinguishing between Tone 2 and Tone 3. Although only Tone 2 is called a rising tone, both Tone 2 and Tone 3 acoustically contain a rise in F0, as can be seen in Fig. 2.2. It has been suggested that a salient cue to these two tones lies in the timing of “F0 turning point,” that is, the point at which the F0 rise occurs (Moore and Jongman 1997; Shen et al. 1993). This point is earlier for Tone 2 but later for Tone 3, and using such a cue is presumably difficult for listeners whose native language does not contrast lexical items with syllable-level pitch variations. A perceptual training program was carried out by Wang et al. (1999) to help improve tone perception by AE-speaking learners of Mandarin. Its overall design was the same as that of Logan et al. (1991) except for a few differences in the training procedure. Specifically, the task in the pre-test was four-alternative forced-choice identification: the learners heard 100 Mandarin monosyllables (25 for each tone) produced by the same Mandarin speaker and indicated the tone category for each of them. Then, those who participated as trainees were trained on tone identification by presenting response options pairwise. That is, in each trial, they were presented with a Mandarin monosyllable (e.g., bei with Tone 3) and given two options (e.g., bei with Tone 3 and bei with Tone 1). Feedback was provided immediately after they responded. The training used stimuli provided by another four Mandarin speakers and consisted of eight sessions. The monosyllables used in the training and testing were varied in syllabic structure (e.g., V, CV, CVN, CGVN, etc., where N = nasal and G = glide) to create contextual variability. Comparisons of the trainees’ performance in the pre-test and the post-test indicated that tone perception significantly improved (from 69% in pre-test to 90% in post-test, by more than 20%). Such improvement is comparable to or even greater than the performance gain observed in previous studies with non-native segmental

16

2 Perceptual Training: A Literature Review

perception (e.g., AE /r/-/l/ perceived by Japanese listeners: Logan et al. 1991; Lively et al. 1994; AE /E/-/æ/, /i/-/I/, and /u/-/*/ perceived by Mandarin listeners: Wang 2008). A closer inspection of identification accuracy of individual tones revealed that perception improved for all tones significantly and rather equally. Further, analysis of the trainees’ tone confusion patterns suggests that the training benefited perception substantially even for the most confusable tone pair—Tone 2 and Tone 3. Analysis of tone pair confusions revealed that the overall rate of errors for this pair (i.e., misperceiving Tone 2 as Tone 3 and Tone 3 as Tone 2) was 25%, which was the highest among all the tone pairs. Yet, it dropped substantially (to only 8%) at the post-test. Significant decrease in error rate was also noted for most of the other tone pairs. Finally, Wang et al. (1999) found no significant difference in accuracy among the post-test, two generalization tests (one presented new items produced by a familiar talker and the other presented new items produced by a novel talker), and one retention test (administered six months after the training). It is therefore possible to introduce robust and long-term modifications to tone perception with auditory training in a laboratory condition, as in the case of segmental perception. A further question that may be worthwhile to ask concerns whether the efficacy of perceptual training differs for learners whose native language is a tonal language and those whose native language is not. One obvious difference between AE listeners’ learning of Mandarin tones and all the cases of segmental training discussed in the previous section lies in the fact that while consonant and vowel phonemes are universal, lexical tones are simply non-existent in some languages, including English. Given that AE listeners’ tone perception can be substantially improved via perceptual training, as shown in Wang et al. (1999)’s study, one may expect to the training benefit to be even greater for native listeners of a tone language who are learning tonal contrasts of a non-native language. However, one may also expect the opposite: perceptual training would be less effective for listeners of tonal languages than those of non-tonal ones. This alternative possibility is not improbable. Theories such as Flege’s (1995) Speech Learning Model assumes that new categories for non-native contrasts cannot be formed if the listener is unable to perceive phonetic differences between native and non-native sounds. Establishing new tone categories can potentially be difficult for tone language listeners because they may classify contrastive non-native tones as exemplars of a single native tone category. Conversely, although non-tonal language listeners may perceive tones in terms of native lexical prosody (e.g., lexical stress), they could more easily form new categories for tones if they notice their phonetic differences from native suprasegmental features and consider them to be “novel” sounds. With respect to the two possibilities, empirical studies from the non-native speech perception literature currently do not seem to provide a verdict: while it is reported that tone language experience is advantageous to learning of non-native tones (e.g., Wayland and Guion 2004), some (e.g., So 2005) find no evidence for this. Nevertheless, it needs to be noted that the participants in Wayland and Guion (2004) and So (2005) were naïve listeners with no prior experience with the target tonal languages and should be regarded as non-learners. To investigate perceptual training effects on L2 tone learning and first-language influence, Wang (2013)

2.2 Applications of Perceptual Training

17

conducted a study with learners of Mandarin whose native language was either English, Japanese, or Hmong. Among these three languages, only Hmong is a lexical tone language. With seven lexical tones, it has a more complicated tonal inventory than Mandarin does, and Hmong-speaking learners might be expected to benefit from training on Mandarin tone perception to a greater extent than English- or Japanesespeaking learners. In the pre-test, the three learner groups took a four-alternative forced-choice identification task in which they identified the tone categories for Mandarin monosyllabic words. Interestingly, it turned out that the mean identification accuracy of the Hmong group (61%) was significantly lower than that of the English group (78%) and that of the Japanese group (80%). It thus seems that tone language experience does not give the Hmong learners any advantages in Mandarin tone identification. To examine whether Hmong listeners’ Mandarin tone perception can be improved, Wang (2013) implemented a training intervention using two different paradigms: perception only training (with only auditory input) and perception and production training (with auditory plus visual input). Trainees were free to choose either paradigm. As in Wang et al. (1999), the task in the perception only training was fouralternative forced-choice identification: the trainees were presented with real monosyllabic words (produced by four native Mandarin speakers) and identified their tone categories. Feedback was provided immediately after each trial. The perception and production training used the same auditory stimuli, with the differences being that every time a stimulus was presented, a real-time visualization of its pitch contour was shown on the screen and that instead of identifying the tone, trainees orally repeated the stimulus. Given as feedback was a display of the pitch contour of the trainees’ repetition, shown together with the display of the target stimulus. This allowed them to visually compare their own tone productions with those of the native speakers. Comparisons of performance in the pre-test, a post-test (same as the pre-test), and a generalization test (using new words produced by a new speaker) revealed no significant difference between the trainees receiving the perception only training and those receiving perception and production training. Importantly, there was no interaction effect suggesting that the training was more or less effectively for a particular learner group. English, Japanese, and Hmong trainees improved from the pre-test to the post-test to a similar extent; they also showed no significant difference between the post-test and generalization test. Consequently, although Hmong-speaking learners seem less able to identify Mandarin tones than learners with non-tonal language background, possibly due to first-language interference, there is no evidence that this prevents them benefitting from perceptual training. Another case of perceptual training that we will consider here is training AE listeners to perceive length contrasts in Japanese. In Japanese, the length distinction (short vs. long) is phonemic for vowels (e.g., /biRu/ ‘building’ versus /bi Ru/ ‘beer’, where /i/ is short while /i / is long). This distinction is primarily temporal: long vowels are about two to three times longer than their short counterparts and there is little difference in their spectral properties (Hirata 2004; Hirata and Tsukada 2004; Kondo 1995). Japanese has the short versus long contrast for intervocalic consonants as well (e.g., /bagu/ ‘bug’ vs. /bag u/ ‘bag’, where /g/ is short while /g / is long), with long (or

18

2 Perceptual Training: A Literature Review

geminate) consonants being about three times longer than the corresponding short (or singleton) consonants (Idemaru and Guion 2008). English does not employ such contrasts, and although tense vowels are phonetically longer than short vowels, such a durational cue is, as mentioned, secondary. Consequently, perceiving the Japanese length distinctions has been a challenge for English-speaking listeners (Oguma 2000; Tajima et al. 2002; Yamada et al. 1995). The training program in Hirata (2004) was conceived with the goal of helping AE listeners overcome such a challenge. Note that unlike those in most of the training interventions discussed so far, the AE participants in this study were non-learners, with no experience with the target language (i.e., Japanese). They were first given a pre-test in which they counted the correct number of “moras” of Japanese words. The mora in Japanese is a prosodic unit that determines the timing of segments. A short vowel, a nasal coda, and a geminate consonant each count as one mora. A long vowel is bimoraic. For instance, while /biRu/ and /bi Ru/ are both disyllabic words, the former contains two moras and the latter three moras. Similarly, /bagu/ and /bag u/ both consist of two syllables, but the former has two moras whereas the latter has three. The AE-speaking participants were given a handout explaining the concept of mora prior to the pre-test. The words in the pre-test had one to six moras and were presented in isolation as well as in carrier sentences. The task for the participants was a six-alternative forced-choice task in which they indicated the mora count for the words. The pre-test results showed that their overall correct identification rate was only 39.0%. Inspection of their accuracy rates for different types of words revealed that identifying the number of moras was particularly difficult if the words involved length distinctions. For example, their mean accuracy for words with a long vowel (37.9%) and those with a geminate consonant (41.3%) was significantly lower than the accuracy rate for words that did not contain such segments (65.8%). Hirata’s (2004) training was done by using the same identification task as the pre-test, but the stimuli were a different set of words produced by different talkers. Another difference was that the trainees were randomly assigned to either of two groups: one was presented with the words spoken in isolation (i.e., word training) and the other with the words spoken in carrier sentences, specifically, in either sentenceinitial or sentence-medial positions (i.e., sentence training). This group division was to investigate whether learning length contrasts in words in isolation and in sentences would lead to different amounts of improvement. Again, as in past studies, immediate feedback was provided after each trial. The post-test presented the same stimuli as the pre-test, and the results indicated that the overall identification accuracy had improved significantly (from 39.0% in pre-test to 53.7% in post-test). Analysis of accuracy by word type revealed that correct identification rates for words containing long vowels or geminate consonants also increased. Specifically, the average accuracy rates for words with a long vowel and those with a geminate consonant in the post-test increased to 47.0% (from 37.9% in pre-test) and 48.7% (from 41.3% in pre-test), respectively, although they were still lower than that for words without these segments (68.9%). There was no significant difference in overall improvement between the two groups that received word and sentence training. These findings show that training using laboratory tasks can improve perception of length contrasts

2.2 Applications of Perceptual Training

19

in a non-native language for listeners without previous experience with that language. Such improvement is likewise reported in a later study by Hirata et al. (2007) with vowel length contrasts. The findings recounted above point to the conclusion that auditory training in a laboratory can effectively enhance perception of various native segmental and suprasegmental contrasts by listeners of various native languages. The method has also been used and proved effective for learning aspects of speech beyond individual segments and phonemic contrasts, such as the syllable structure (Huensch and Tremblay 2015) and contrastive focus (Putri et al. 2019). Its seemingly universal benefit makes it interesting to consider in detail what the discussed studies have in common in terms of the design of the training intervention. In fact, it may already be apparent that nearly all those studies include identification rather than discrimination tasks as a means to train and assess perception, and that they use stimuli recorded by several different talkers instead of just one talker. Such a preference for identification and stimulus variability is also evident in the training literature in general. In the following section, we will discuss how two methodological factors—the type of tasks (for training and assessment) as well as stimulus variability (due to talkers and phonetic contexts)—are thought to affect the effectiveness of perceptual training. Consideration of these factors will lend insight into how a training program can be devised to help overcome the perceptual problem of interest in this book.

2.3 Methodological Factors That Can Influence Training Outcomes While perceptual training is widely reported to be successful in improving perception of non-native sounds, it is believed that there are a number of methodological variables that would affect its efficacy. Among them, two that are important and particularly relevant for the current study are the type of tasks used in training and the variability of stimuli due to different talkers and phonetic environments. We will consider these two factors in this section.

2.3.1 Training Method: Identification Versus Discrimination Compared with those that train perception with identification tasks, studies that adopt discrimination tasks such as AX or ABX discrimination (e.g., Strange and Dittmann 1984; Wayland and Guion 2004) are fewer. This would not be surprising given that identification has long been argued to be more effective than discrimination in facilitating perceptual learning (e.g., Jamieson and Morosan 1986, 1989). Presented with only one stimulus at a time, listeners in identification tasks cannot respond based on comparison of stimuli. Such a procedure is thought to encourage formation of robust

20

2 Perceptual Training: A Literature Review

representations because it forces the listeners to discover the essential acoustic– phonetic properties that distinguish one sound from another. Conversely, the discrimination paradigms present listeners with stimuli that are minimally contrastive, allowing them to respond based on stimulus comparison. However, such stimuli are seldom available in real-life communicative situations and it is possible that listeners attend to acoustic–phonetic differences irrelevant to the target contrasts. For example, listeners in an AX discrimination task may tend to respond “different” even to stimuli that are just different tokens of the same category (Ou 2016). Therefore, it is commonly assumed that in comparison to identification training, discrimination training is less effective for establishing new L2 categories. The advantage of identification over discrimination is verified by, for example, Carlet and Cebrian (2015), who directly compared the effects of the two training paradigms on learning of British English vowels by L2 English learners with Spanish and Catalan as their native languages. The learners participating as trainees were assigned to two groups: one receiving identification training and the other receiving AX discrimination training. The training material were CVC nonsense words recorded by four British English speakers and contained one of the seven vowels selected for investigation (i.e., /æ/, /2/, /I/, /i/, /f /, /e/, and /A /). For those who were trained on AX discrimination, the stimuli were presented in a pairwise manner and the task was to decide if the two stimuli in each trial were the same or different. For those trained by means of identification, the same stimuli were presented but the task was seven-alternative forced-choice identification (with the seven alternatives representing the seven trained vowels). The learners were tested on English vowel identification. Comparing the pre-test and post-test performance revealed that while both groups showed improvement, the gain for those who had the identification training (about 25%) was greater than that for those who had the discrimination training (about 10%). This provides evidence for the superiority of the former training method over the latter, which is likewise reported in more recent works (e.g., Carlet 2019; Law et al. 2019). A cautionary note, however, may have to be made concerning the use of identification training. In order to accomplish identification tasks, participants must have knowledge of the correspondence between graphemes and phonemes or the association between the “labels” given as options and the abstract linguistic categories those labels represent (e.g., “Tone 1” for the high-level pitch pattern in Mandarin). If they lack such knowledge, additional instructions are required. An example is Hirata (2004), which, as mentioned, provided a handout introducing the notion of the Japanese mora to AE listeners. Identification tasks may not be appropriate or difficult to implement if participants cannot learn how to associate the labels with the categories that they are trained to identify.

2.3 Methodological Factors That Can Influence Training Outcomes

21

2.3.2 Talker and Context Variability To ensure enough training stimuli and encourage generalization, the stimuli are usually varied in some ways. The variability has multiple sources. For instance, it can be different words. The discussion in this section will focus on two major sources of variability that are of particular relevance to our training program—variability due to different talkers and variability due to different phonetic contexts. One common feature of the studies reviewed above is that they all include multiple talkers for the purpose of creating high acoustic–phonetic variability in the stimuli. There has been accumulated evidence that this design is conducive to formation of robust categories, allowing learners to generalize the learned patterns to novel items (Barcroft and Sommers 2005; Logan et al. 1991; Wang et al. 1999; Ryu and Kang 2019). The reason for including multiple talkers is similar to that for choosing identification tasks over discrimination tasks. Exposing learners to exemplars produced by different talkers forces them to focus on cues essential to the target contrasts and ignore idiosyncrasies in the talkers’ speech. Exposing them to exemplars produced by a single talker, however, can possibly lead to learning of talker-specific details which are not crucial or cannot be generalized to novel talkers. Multi-talker stimuli also better resemble the speech input that learners encounter in real-world communicative settings and are therefore more ecologically valid. In fact, it has been recently suggested that the benefit of high variability is so robust that as long as it is included in the training stimuli, identification and discrimination training would produce comparable gains (Shinohara and Iverson 2018). Nevertheless, despite the widely reported advantage of high talker-variability training, some research reveals that such training is not necessarily helpful for all listener populations. Specifically, those who have weak perceptual abilities do not benefit from high talker-variability, as demonstrated in a study by Perrachione et al. (2011), who trained AE listeners to identify tone-like pitch contours. The listeners varied in their pre-training aptitude in pitch perception and were trained to identify the pitch contours (i.e., level, rising, and falling contours) of a number of nonsense words (e.g., /dôi/) under two conditions: a high-variability condition in which the stimuli they were exposed to stimuli produced by four different talkers and a lowvariability condition in which they were exposed to stimuli produced by only one talker. The results of a post-training test, which presented stimuli from four new talkers, indicated that only those with stronger pitch perception abilities in the first place benefited from high talker-variability training; those who were less able to perceive pitch contours, however, were adversely affected by high-talker variability. Such a finding could be linked to the fact that the less acoustic–phonetic predictability in listening conditions with high variability increases processing cost (Creel et al. 2008; Mullennix and Pisoni 1990; Wong et al. 2004). Neger et al. (2014) demonstrate that perceptual learning depends at least in part on the ability to detect implicit statistical regularities. Thus, using multi-talker stimuli could potentially make it difficult to track these regularities and hence impair perceptual learning, suggesting that its benefit is not as well-founded as one may have assumed.

22

2 Perceptual Training: A Literature Review

An additional reason to rethink the use of high talker-variability concerns the fact that individual differences exist not only in non-linguistic attributes such as the speaker’s voice, but also possibly in production patterns of sound contrasts that learners are trained to identify. Speech production involves a complex coordination of articulatory gestures, which typically overlap in time to some extent and result in what is commonly referred to as coarticulation. Much work (e.g., Beddor 2009; Grosvald 2009; Noiray et al. 2011; Yu 2016) has revealed that speakers of even the same language exhibit differing coarticulatory patterns. One example is a recent study by Beddor et al. (2018) with anticipatory vowel nasalization produced and perceived by native AE speakers. In a production experiment examining production of English words with the CVNC structure (e.g., bent), which contain a vowel followed by a nasal coda, it was found that these speakers employed two different strategies. While some tended to produce the words with an earlier onset of coarticulatory vowel nasalization, resulting in a more nasalized vowel, others tended to produce them with a later onset of coarticulatory nasalization, resulting in a longer duration for the nasal consonant. Furthermore, a second experiment investigating the time course of perception with the eye-tracking technique showed that those who produced an earlier onset of coarticulatory nasalization were more efficient in using vowel nasality during on-line speech processing, suggesting that speakers producing a particular pattern are inclined to attend to that pattern in perception. One implication of these findings for multi-talker perceptual training research is that different talkers may in fact produce different acoustic–phonetic cues to the same contrasts. It is conceivable that if the learners are trained on stimuli from talkers producing one pattern but tested on stimuli from talkers producing some other patterns, the training outcome may be suboptimal or do not reflect learning of the cues that the researchers believe to have been learned. Unfortunately, there are currently no training studies that perform detailed acoustic analyses of individual talkers’ patterns in producing sound contrasts of a real spoken language and investigate the potential consequences of such talker-variability for perceptual learning. As an interim summary, it can be said that despite the mounting evidence for the efficacy of high talker-variability training, recent findings reveal that the picture is in fact complicated and care would need to be exercised in using such a training paradigm. Substantial acoustic–phonetic variability across stimuli due to multiple talkers may be overwhelming for listeners with weak perceptual abilities and therefore impair their perception. Talkers may also employ different strategies in producing the target sound contrasts and it is preferable to analyze their production patterns individually whenever possible. In the following, we will consider another crucial source of stimulus variability—variability that arises from different phonetic contexts. Exposure to a large number of contextual variants of a target sound category is thought to be important for formation of robust representations. The importance of such exposure relates to the fact that while some variants may be easy to perceive and

2.3 Methodological Factors That Can Influence Training Outcomes

23

identify, others can be quite challenging and constitute the major source of learners’ perceptual difficulty. A case in point is Japanese EFL learners’ perception of AE /r/ and /l/, which are realized differently across pre-vocalic and post-vocalic environments (Lehiste 1964). Perceiving the /r/-/l/ contrast has been shown to be harder in the former environments than in the latter environments for Japanese listeners (Goto 1971; Lively et al. 1993; Logan et al. 1991; Sheldon and Strange 1982). For example, as mentioned, the pre-test results of Lively et al. (1993) training study indicated that the listeners’ average accuracy in identifying the contrast carried in a pre-vocalic consonant cluster (e.g., breed vs. bleed) was the lowest (below 60%) among all phonetic environments. On the other hand, they performed relatively well (with a mean accuracy rate of about 95%) when the contrast occurred post-vocalically. These suggest that learners’ pre-training ability to perceive a non-native contrast can crucially depend on the phonetic contexts where the contrast is embedded. Although it is a common practice in perceptual training studies is to use stimuli elicited in different phonetic contexts, few of them do explore the extent to which learners’ initial perception of the target sound categories is contextually dependent. For instance, Wang et al. (1999) trained AE listeners’ identification of the four Mandarin lexical tones realized on monosyllabic words with different segmental structures (e.g., CV, CVN, CGVN, etc.), but it was not examined whether the listeners’ performance varied across segmental contexts. One can of course reasonably argue that perhaps except for some fine-grained F0 perturbations due to different consonants or vowels (Chen 2011; Jun 1996; Kohler 1982; Ladd and Silverman 1984; Ohala 1978; Whalen and Levitt 1995), variation in segmental strings is unlikely to influence the F0 realizations of the tones in such a way that will bias the listeners’ responses and completely prevent them from identifying the tonal categories. Yet, such an argument does not apply to all cases of sound contrasts and contexts. An example is Mandarin-speaking EFL learners’ perception of English lexical stress contrasts carried in rising intonation, which is of central interest to this study. As will be discussed in greater detail, the cues to English lexical stress differ considerably across different intonational contexts and this causes a problem for the learners. We showed in this section that identification tasks are preferentially used in order to successfully introduce modifications to perceptual patterns. To this end, researchers also tend to include stimuli produced by multiple talkers, although recent evidence has suggested caution in using high talker-variability training. Besides, it is important to consider the phonetic contexts of the target categories or contrasts that learners are trained to perceive. The next section describes how the phonetic manifestation of English lexical stress is conditioned by intonational context and discuss findings from previous studies on Mandarin listeners’ perception of lexical stress.

24

2 Perceptual Training: A Literature Review

2.4 English Lexical Stress and Mandarin-Speaking Listeners Lexical stress is a phonologically distinctive feature that defines the relative prominence of syllables within a word. In a polysyllabic word, one or more syllables receive lexical stress and typically sound more prominent than the unstressed ones. In English, although such stress-bearing syllables are predominantly word-initial (Cutler and Carter 1987), their occurrence is variable: there can be, for instance, disyllabic words with stress on the first syllables (e.g., cámpus, where the stressed syllable is indicated by the acute mark), or a trochaic stress pattern, as well as disyllabic words with stress on the second syllables (e.g., campáign), or an iambic stress pattern. Compared with that of lexical tones, the acoustic–phonetic manifestation of lexical stress in English is rather complex. First, it involves differences along a combination of phonetic dimensions, including pitch, duration, intensity, and vowel quality (Fry 1958; Gay 1978; Sluijter 1995). For the majority of English words, lexical stress is signaled simultaneously by differences in all these dimensions. For example, other things being equal, the stressed first syllable of áddress contains a full vowel [æ] and tend to have higher pitch, longer duration, and greater intensity than the unstressed first syllable of addréss, which has a reduced vowel (i.e., the schwa [ђ]). There are also words where the presence or absence of lexical stress is cued only suprasegmentally (i.e., in terms of differences in pitch, duration, and intensity), but word pairs that contrast in this way (e.g., pérmit vs. permít) are quite few in the English lexicon (Culter 1986; Cutler and Pasveer 2006). Second, the suprasegmental cues to stressed syllables—particularly the pitchrelated ones—can differ across different intonational contexts. Stressed syllables produced in non-constraining sentences (e.g., statement sentences), which have a falling intonation, are associated with a high nuclear pitch accent (denoted by H* in the ToBI (Tones and Break Indices) prosodic annotation system). Followed by a low phrase pitch accent (L-) and a low boundary tone (L%), this high nuclear pitch accent results in what listeners perceive as higher pitch for stressed syllables. Yet, when produced in sentences with a rising intonation, such as yes/no questions, stressed syllables bear a low nuclear pitch accent (L*), followed by a high phrase pitch accent (H-) and a high boundary tone (H%). In the case of disyllabic words, these high phrase and boundary accents cause the second syllables to always have higher overall pitch than the first syllables. This can be seen in Fig. 2.3, which shows pitch curves of pérmit and permít elicited in falling and rising intonation. Under the rising intonation context, the cues to different stress patterns no longer include pitch height; rather, as will be seen in the next chapter, they include the relative durations of the vowels within a word as well as the shape of the pitch curve (Ou 2010). Much research concerned with lexical stress perception has focused on the issue of “stress deafness”—a term originally coined to describe French listeners’ inability to perceive stress contrasts (Dupoux et al. 1997; Dupoux et al. 2008; Peperkamp and

2.4 English Lexical Stress and Mandarin-Speaking Listeners

pérmit

25

permít

H*

H*

Falling

L-L%

L-L%

H-H%

H-H% L*

Rising

L*

Fig. 2.3 Pitch contours of pérmit and permít produced in falling (upper panels) and rising (lower panels) intonation by a female American English speaker, with annotations using the ToBI system (stimuli from Ou 2010)

Dupoux 2002). While the stress deafness that French listeners experience is presumably due to the fact that their language has fixed phrasal prominence associated with final syllables and does not use contrastive word stress, a series of studies conducted by the author of this book have revealed a special type of stress deafness on the part of Mandarin-speaking listeners learning English as a foreign language (henceforth EFL) in Taiwan. That is, they have considerable difficulty with identifying the location of stress in rising intonation. For example, although they are able to identify disyllabic noun–verb stress minimal pairs (e.g., pérmit vs. permít) with near-perfect accuracy when stressed syllables are produced in falling intonation and consistently marked with higher pitch, Taiwanese EFL learners, including those with intermediate to high English proficiency, generally fail to do so when stressed syllables are produced in rising intonation, under which it is always the second syllables of the words that have a higher pitch value (Ou 2016, 2019). Such a problem is reflected not only in the low overall response accuracy for the rising intonation context, but also in a large accuracy difference between disyllabic words with a trochaic stress pattern and those with an iambic stress pattern. When the intonation is rising, the learners’ accuracy is extremely low for trochaically stressed words but unusually high for iambically stressed ones. This should be interpreted as an artifact of the learners’ strong inclination to give a “verb” or “second syllable stressed” response to whatever disyllabic stimuli with higher pitch on the second syllables. For this reason, their superior performance on iambically stressed items should be regarded as spurious. These findings cannot be attributed to lack of lexical knowledge (e.g., know only the verb permít but do not know its noun counterpart pérmit), as similar

26

2 Perceptual Training: A Literature Review

results were obtained in comparable experiments using nonsense words (Ou 2010). Taiwanese EFL learners clearly show a bias toward using higher pitch as the cue to stress, which causes their deafness to stress contrasts in rising intonation. The learners’ heavy reliance on pitch height is not surprising considering the primacy of pitch cues in perception of lexical tones. It has long been observed that Mandarin-speaking listeners are predisposed to interpreting stress distinctions as tonal contrasts (Archibald 1997; Cheng 1968; Juffs 1990). This was demonstrated in a classic study by Cheng (1968) with Mandarin speakers’ production of MandarinEnglish code-switching speech. One of the well-known phonological processes in Mandarin tonal phonology is the Tone 3 Sandhi Rule, whereby a Tone 3 (low-dipping tone) changes into a Tone 2 (rising tone) when followed by another Tone 3. Cheng found that when Mandarin speakers were instructed to say a Mandarin word with Tone 3 (e.g., hao214 ‘good’) followed by an English word beginning with an unstressed syllable (e.g., proféssor), Tone 3 Sandhi was triggered, changing the Tone 3 word into a Tone 2 (e.g., hao35 ). The sandhi did not occur when a Tone 3 word preceded an English word beginning with a stressed syllable. This suggests that Mandarin listeners perceive unstressed syllables bearing a low tone (and stressed ones as bearing a nonlow tone). The Taiwanese EFL learners’ failure to discern stress patterns in rising intonation as reported in Ou’s studies is thus likely to be due to influence from the lexical prosody of their native language. A question of interest is then whether Taiwanese EFL learners’ perception of word stress can be improved via perceptual training. In the following section, we will conclude this chapter by stating the specific goals of the present research and the rationale for the design of our training intervention.

2.5 The Current Study We recruited Taiwanese EFL learners to participate in a perceptual training program which aimed to address the following three main questions: 1. When assessed immediately after the training, will the learners show improved ability to identify English lexical stress contrasts in rising intonation? 2. Will the improvement (if there is any) generalize to new words? 3. Will it be maintained after an extended period of time (i.e., three months) after the training? In addition to gain in accuracy of stress pattern identification under the rising intonation context, we would also examine if there were any other changes—positive or negative—in the learners’ perceptual behavior. The participants were divided into a control group and a trainee group; the former received an irrelevant activity and the latter received the training. The training stimuli included disyllabic noun–verb stress minimal pairs in which stress location is only cued suprasegmentally (i.e., not cued by vowel quality differences), such as pérmit

2.5 The Current Study

27

and permít. The trainees learned the stress contrasts in a two-alternative forcedchoice task, which presented one stimulus at a time and asked them to identify its morphosyntactic category (i.e., stimuli with the first stressed should be “nouns” and those with the second syllables stressed should be “verbs”). Such an identification task might be more effective for development of robust categories than a discrimination task, as discussed above. In addition, we assumed that no additional instructions would be needed as concepts such as nouns and verbs should be familiar to most EFL learners in Taiwan. However, to minimize the possibility that some word pairs were unknown to the participants, a few selection criteria, which will be described in the following chapter, were imposed on the word pairs. Finally, we also included noun–verb pairs in which stress differences are accompanied by differences in vowels (e.g., áddress vs. addréss), as is often the case in the English vocabulary. These items could help determine whether the trainees would still rely on segmental quality to identify stress patterns in rising intonation even after the training. This is not entirely impossible since segmental information is found to outweigh suprasegmental information in word recognition (Cutler and Chen 1997; Cooper et al. 2002). Alternatively, given the accumulated evidence for the efficacy of perceptual training in improving suprasegmental perception, it may be the case that the trainees could indeed learn to exploit purely suprasegmental differences between stress patterns in rising intonation. Unlike most of previous studies, the current training program used stimuli from only one talker. This choice was made for the following reasons. First, as we have shown in the preceding section, the cues to English lexical stress are rather complicated and variable across intonational contexts. Including multiple talkers can potentially cause the trainees to be distracted by irrelevant acoustic properties (e.g., the speaker’s voice) and prevent them from successfully learning the appropriate cues to stress in each context. Moreover, as suggested by the findings of Perrachione et al. (2011), high talker-variability training may be deleterious for trainees with weak initial perceptual ability. We have seen that insofar as rising intonation is concerned, Taiwanese EFL learners’ perception of stress minimal pairs is impaired and considerably biased toward the pitch height cue. Thus, they can be said to be low-aptitude perceivers in this intonational context. Together with the acoustic– phonetic complexity of stress cues, multi-talker stimuli can make the training too overwhelming to be beneficial. Another reason for using only one talker is that it allowed for detailed acoustic analyses of the single talker’s production patterns, which are seldom done in previous work. As will be seen in the acoustic analysis sections of the next two chapters, there were indeed systematic cues to the stress patterns of the noun–verb pairs in rising intonation, including the timing of pitch rise and the ratio of the duration of the second vowel to that of the first vowel. Chapter 3 describes the training program in greater detail and presents the findings from the pre-test and post-test. A generalization test and a retention test were administered to answer the second and third questions listed above and their results are reported in Chap. 4.

28

2 Perceptual Training: A Literature Review

References Ainsworth, W.A. 1972. Duration as a cue in the recognition of synthetic vowels. The Journal of the Acoustical Society of America 51 (2B): 648–651. Akahane-Yamada, R., E. McDermott, T. Adachi, H. Kawahara, and J.S. Pruitt. 1998. Computerbased second language production training by using spectrographic representation and HMMbased speech recognition scores. In Proceedings of the Fifth International Conference on Spoken Language Processing ed. R. H. Mannell and J. Robert-Ribes, 1747–1750. Sydney, Australia. Aliaga-Garcia, C. 2010. Measuring perceptual cue weighting after training: A comparison of auditory vs. articulatory training methods. In Proceedings of the Sixth International Symposium on the Acquisition of Second Language Speech, New Sounds 2010, ed. K. Dziubalska-Kołaczyk, M. Wrembel, and M. Kul, 12–18. Poznan, Poland. Archibald, J. 1997. The acquisition of English stress by speakers of nonaccentual languages: Lexical storage versus computation of stress. Linguistics 35 (1): 167–181. Barcroft, J., and M.S. Sommers. 2005. Effects of acoustic variability on second language vocabulary learning. Studies in Second Language Acquisition 27 (3): 387–414. Beddor, P.S. 2009. A coarticulatory path to sound change. Language 85: 785–821. Beddor, P.S., A.W. Coetzee, W. Styler, K.B. McGowan, and J.E. Boland. 2018. The time course of individuals’ perception of coarticulatory information is linked to their production: Implications for sound change. Language 94 (4): 931–968. Best, C.T. 1995. A direct realist view of cross-language speech perception. In Speech Perception and Linguistic Experience: Issues in Cross-Language Research, ed. W. Strange, 171–204. Timonium, MD: York Press. Best, C.T., and W. Strange. 1992. Effects of phonological and phonetic factors on cross-language perception of approximantsEffects of phonological and phonetic factors on cross-language perception of approximants. Journal of Phonetics 20 (3): 305–330. Blicher, D.L., R.L. Diehl, and L.B. Cohen. 1990. Effects of syllable duration on the perception of the Mandarin Tone 2/Tone 3 distinction: Evidence of auditory enhancement. Journal of Phonetics 18 (1): 37–49. Bradlow, A.R., and T. Bent. 2008. Perceptual adaptation to non-native speech. Cognition 106 (2): 707–729. Bradlow, A.R., D.B. Pisoni, R. Akahane-Yamada, and Y.I. Tohkura. 1997. Training Japanese listeners to identify English /r/ and /l/: IV. Some effects of perceptual learning on speech production. The Journal of the Acoustical Society of America 101 (4): 2299–2310. Carlet, A. (2019). Different high variability procedures for training L2 vowels and consonants. In Proceedings of the 19th International Congress of Phonetic Sciences, 944–948. Carlet, A., and J. Cebrian. 2015. Identification vs. discrimination training: Learning effects for trained and untrained sounds. In Proceedings of the 18th International Congress of Phonetic Sciences. Chao, Y.R. 1930. A system of tone letters. Le Maître Phonétique 30: 24–27. Chen, Y. 2011. How does phonology guide phonetics in segment–f0 interaction? Journal of Phonetics 39 (4): 612–625. Cheng, C.-C. 1968. English stresses and Chinese tones in Chinese sentences. Phonetica 18 (2): 77–88. Clarke, C., and P. Luce. 2005. Perceptual adaptation to speaker characteristics: VOT boundaries in stop voicing categorization. In Proceedings of the ISCA Workshop on Plasticity in Speech Perception, 23–26. London, U.K. Cooper, N., A. Cutler, and R. Wales. 2002. Constraints of lexical stress on lexical access in English: Evidence from native and non-native listeners. Language and Speech 45 (3): 207–228. Creel, S.C., R.N. Aslin, and M.K. Tanenhaus. 2008. Heeding the voice of experience: The role of talker variation in lexical accessHeeding the voice of experience: The role of talker variation in lexical access. Cognition 106 (2): 633–664.

References

29

Cutler, A. 1986. Forbear is a homophone: Lexical prosody does not constrain lexical access. Language and Speech 29 (3): 201–220. Cutler, A., and D.M. Carter. 1987. The predominance of strong initial syllables in the English vocabulary. Computer Speech and Language 2 (3–4): 133–142. Cutler, A., and H.C. Chen. 1997. Lexical tone in Cantonese spoken-word processing. Perception and Psychophysics 59 (2): 165–179. Cutler, A., and D. Pasveer. 2006. Explaining cross-linguistic differences in effects of lexical stress on spoken-word recognition. In Proceedings of the Third International Conference on Speech Prosody, ed. Hoffman, Rüdiger, and Mixdorff, Hansjörg, 250–254. Dresden: TUD press. Cutler, A., N. Sebastián-Gallés, O. Soler-Vilageliu, and B. Van Ooijen. 2000. Constraints of vowels and consonants on lexical selection: Cross-linguistic comparisons. Memory and Cognition 28 (5): 746–755. Dowd, A., J. Smith, and J. Wolfe. 1998. Learning to pronounce vowel sounds in a foreign language using acoustic measurements of the vocal tract as feedback in real time. Language and Speech 41 (1): 1–20. Dupoux, E., C. Pallier, N. Sebastian, and J. Mehler. 1997. A destressing “deafness” in French? Journal of Memory and Language 36 (3): 406–421. Dupoux, E., N. Sebastián-Gallés, E. Navarrete, and S. Peperkamp. 2008. Persistent stress ‘deafness’: The case of French learners of Spanish. Cognition 106 (2): 682–706. Escudero, P., P. Boersma, A.S. Rauber, and R.A. Bion. 2009. A cross-dialect acoustic description of vowels: Brazilian and European Portuguese. The Journal of the Acoustical Society of America 126 (3): 1379–1393. Flege, J.E. 1995. Second language speech learning: Theory, findings, and problems. In Speech Perception and Linguistic Experience: Issues in Cross-Language Research, ed. W. Strange, 233– 276. Timonium, MD: York Press. Fowler, C.A. 1981. Production and perception of coarticulation among stressed and unstressed vowels. Journal of Speech, Language, and Hearing Research 24 (1): 127–139. Fowler, C.A. 1986. An event approach to the study of speech perception from a direct-realist perspective. Journal of Phonetics 14 (1): 3–28. Fry, D.B. 1955. Duration and intensity as physical correlates of linguistic stress. The Journal of the Acoustical Society of America 27 (4): 765–768. Fry, D.B. 1958. Experiments in the perception of stress. Language and Speech 1 (2): 126–152. Gay, T. 1978. Physiological and acoustic correlates of perceived stress. Language and Speech 21 (4): 347–353. Goldstein, L., and C.A. Fowler. 2003. Articulatory phonology: A phonology for public language use. In Phonetics and Phonology in Language Comprehension and Production: Differences and Similarities, 159–207. Goto, H. 1971. Auditory perception by normal Japanese adults of the sounds “L” and “R”. Neuropsychologia 9 (3): 317–323. Grosvald, M. 2009. Interspeaker variation in the extent and perception of long-distance vowel-tovowel coarticulation. Journal of Phonetics 37 (2): 173–188. Hillenbrand, J.M., M.J. Clark, and R.A. Houde. 2000. Some effects of duration on vowel recognition. The Journal of the Acoustical Society of America 108 (6): 3013–3022. Hirata, Y. 2004. Effects of speaking rate on the vowel length distinction in Japanese. Journal of Phonetics 32 (4): 565–589. Hirata, Y., and K. Tsukada. 2004. The effects of speaking rates and vowel length on formant movements in Japanese. In Proceedings of the 2003 Texas Linguistics Society Conference: Coarticulation in Speech Production and Perception, ed. A. Agwuele, W. Warren, and S. H. Park, 73–85. Somerville, MA: Cascadilla Proceedings Project. Hirata, Y., E. Whitehurst, and E. Cullings. 2007. Training native English speakers to identify Japanese vowel length contrast with sentences at varied speaking rates. The Journal of the Acoustical Society of America 121 (6): 3837–3845.

30

2 Perceptual Training: A Literature Review

Howie, J.M. 1976. Acoustical Studies of Mandarin Vowels and Tones. Cambridge, England: Cambridge University Press. Huensch, A., and A. Tremblay. 2015. Effects of perceptual phonetic training on the perception and production of second language syllable structure. Journal of Phonetics 52: 105–120. Hung, T.N. 2000. Towards a phonology of Hong Kong English. World English’s 19 (3): 337–356. Idemaru, K., and S.G. Guion. 2008. Acoustic covariants of length contrast in Japanese stops. Journal of the International Phonetic Association 38 (2): 167–186. Iverson, P., M. Pinet, and B.G. Evans. 2012. Auditory training for experienced and inexperienced second-language learners: Native French speakers learning English vowels. Applied Psycholinguistics 33 (1): 145–160. Jamieson, D.G., and D.E. Morosan. 1986. Training non-native speech contrasts in adults: Acquisition of the English /ð/-/θ/ contrast by francophones. Perception and Psychophysics 40 (4): 205–215. Jamieson, D.G., and D.E. Morosan. 1989. Training new, nonnative speech contrasts: A comparison of the prototype and perceptual fading techniques. Canadian Journal of Psychology 43 (1): 88–96. Juffs, A. 1990. Tone, syllable structure and interlanguage phonology: Chinese learners’ stress errors. International Review of Applied Linguistics in Language Teaching 28 (2): 99–118. Jun, S. A. (1996). Influence of microprosody on macroprosody: A case of phrase initial strengthening. UCLA Working Papers in Phonetics 92: 97–116. Kartushina, N., A. Hervais-Adelman, U.H. Frauenfelder, and N. Golestani. 2015. The effect of phonetic production training with visual feedback on the perception and production of foreign speech sounds. The Journal of the Acoustical Society of America 138 (2): 817–832. Kochanski, G., and C. Orphanidou. 2008. What marks the beat of speech? The Journal of the Acoustical Society of America 123 (5): 2780–2791. Kohler, K.J. 1982. F0 in the production of Lenis and Fortis Plosives. Phonetica 39: 199–218. Kondo, Y. 1995. Production of Schwa by Japanese speakers of English: A crosslinguistic study of coarticulatory strategies (Doctoral dissertation). UK: University of Edinburgh. Kraus, N., T. McGee, T.D. Carrell, C. King, K. Tremblay, and T. Nicol. 1995. Central auditory system plasticity associated with speech discrimination training. Journal of Cognitive Neuroscience 7 (1): 25–32. Ladd, R., and K.E. Silverman. 1984. Vowel intrinsic pitch in connected speech. Phonetica 41 (1): 31–40. Law, I.L.G., I. Grenon, C. Sheppard, and J. Archibald. 2019. Which is better: Identification or discrimination training for the acquisition of an English coda contrast. In Proceedings of the 19th International Congress of Phonetic Sciences. Lee, T. 1983. The vowel system in two varieties of Cantonese. UCLA Working Papers in Phonetics 57: 97–114. Lehiste, I. 1964. Acoustic characteristics of selected English consonants. International Journal of American Linguistics 30: 10–115. Liberman, A.M., and I.G. Mattingly. 1985. The motor theory of speech perception revised. Cognition 21 (1): 1–36. Liberman, A.M., and D.H. Whalen. 2000. On the relation of speech to language. Trends in Cognitive Sciences 4 (5): 187–196. Lieberman, P. 1960. Some acoustic correlates of word stress in American English. The Journal of the Acoustical Society of America 32 (4): 451–454. Lin, M.C. 1965. The pitch indicator and the pitch characteristics of tones in Standard Chinese. Acta Acoustica (China) 2: 8–15. Linebaugh, G., and T.B. Roche. 2015. Evidence that L2 production training can enhance perception. Journal of Academic Language and Learning 9 (1): A1–A17. Lively, S.E., J.S. Logan, and D.B. Pisoni. 1993. Training Japanese listeners to identify English /r/ and /l/ II: The role of phonetic environment and talker variability in learning new perceptual categories. The Journal of the Acoustical Society of America 94 (3): 1242–1255.

References

31

Lively, S.E., D.B. Pisoni, R.A. Yamada, Y. Tohkura, and T. Yamada. 1994. Training Japanese listeners to identify English/r/and/l/. III. Long-term retention of new phonetic categories. The Journal of the Acoustical Society of America 96 (4): 2076–2087. Logan, J.S., S.E. Lively, and D.B. Pisoni. 1991. Training Japanese listeners to identify English /r/ and /1/: A first report. Journal of Acoustical Society of America 89: 874–886. McClaskey, C.L., D.B. Pisoni, and T.D. Carrell. 1983. Transfer of training of a new linguistic contrast in voicing. Perception and Psychophysics 34 (4): 323–330. Mateus, M.H.M., I. Falé, and M. Freitas. 2005. Fonética e fonologia do português (Portuguese Phonetics and Phonology). Lisbon: Universidade Aberta. MacKain, K.S., C.T. Best, and W. Strange. 1981. Categorical perception of English /r/ and /l/ by Japanese bilinguals. Applied Psycholinguistics 2 (4): 369–390. Marks, E.A., D.R. Moates, Z.S. Bond, and V. Stockmal. 2002. Word reconstruction and consonant features in English and Spanish. Linguistics 40: 421–438. Mitterer, H., and M. Ernestus. 2008. The link between speech perception and production is phonological and abstract: Evidence from the shadowing task. Cognition 109 (1): 168–173. Miyawaki, K., J.J. Jenkins, W. Strange, A.M. Liberman, R. Verbrugge, and O. Fujimura. 1975. An effect of linguistic experience: The discrimination of [r] and [l] by native speakers of Japanese and English. Perception and Psychophysics 18 (5): 331–340. Moore, C.B., and A. Jongman. 1997. Speaker normalization in the perception of Mandarin Chinese tones. Journal of the Acoustical Society of America 102: 1864–1877. Mullennix, J.W., and D.B. Pisoni. 1990. Stimulus variability and processing dependencies in speech perception. Perception and Psychophysics 47 (4): 379–390. Neger, T.M., T. Rietveld, and E. Janse. 2014. Relationship between perceptual learning in speech and statistical learning in younger and older adults. Frontiers in Human Neuroscience 8: 628. Nishi, K., and D. Kewley-Port. 2007. Training Japanese listeners to perceive American English vowels: Influence of training sets. Journal of Speech, Language, and Hearing Research 50 (6): 1496–1509. Noiray, A., M.A. Cathiard, L. Ménard, and C. Abry. 2011. Test of the movement expansion model: Anticipatory vowel lip protrusion and constriction in French and English speakers. The Journal of the Acoustical Society of America 129 (1): 340–349. Norris, D., J.M. McQueen, and A. Cutler. 2003. Perceptual learning in speech. Cognitive Psychology 47 (2): 204–238. Nespor, M., M. Peña, and J. Mehler. 2003. On the different roles of vowels and consonants in speech processing and language acquisition. Lingue e Linguaggio 2 (2): 203–230. Nygaard, L.C., and D.B. Pisoni. 1998. Talker-specific learning in speech perception. Perception and Psychophysics 60 (3): 355–376. Ohala, J. 1978. Production of tone In Tone: A Linguistic Survey, ed. V. Fromkin, 5–39. New York: Academic Press. Oguma, R. 2000. Perception of Japanese long vowels and short vowels by English speaking learners. Japanese-Language Education Around the Globe 10: 43–55. Ou, S.-C. 2010. Taiwanese EFL learners. Concentric: Studies in Linguistics 36 (1): 1–23. Ou, S.-C. 2016. Perception of English lexical Stress with a marked pitch accent by native speakers of Mandarin. Taiwan Journal of Linguistics 14 (2): 1–31. Ou, S.-C. 2019. The role of lexical stress in spoken English word recognition by listeners of English and Taiwan Mandarin. Language and Linguistics 20 (4): 569–600. Peperkamp, S., and E. Dupoux. 2002. A typological study of stress “deafness”. Laboratory Phonology 7: 203–240. Perrachione, T.K., J. Lee, L.Y. Ha, and P.C. Wong. 2011. Learning a novel phonological contrast depends on interactions between individual differences and training paradigm design. The Journal of the Acoustical Society of America 130 (1): 461–472. Pisoni, D.B. 1971. On the nature of categorical perception of speech sounds (Doctoral thesis). University of Michigan.

32

2 Perceptual Training: A Literature Review

Pisoni, D.B. 1973. Auditory and phonetic memory codes in the discrimination of consonants and vowels. Perception and Psychophysics 13 (2): 253–260. Pisoni, D.B., R.N. Aslin, A.J. Perey, and B.L. Hennessy. 1982. Some effects of laboratory training on identification and discrimination of voicing contrasts in stop consonants. Journal of Experimental Psychology: Human Perception and Performance 8 (2): 297. Price, P. (1981). A Cross-linguistic Study of Flaps in Japanese and in American English. Ph.D. Dissertation, University of Pennsylvania. Pruitt, J.S., J.J. Jenkins, and W. Strange. 2006. Training the perception of Hindi dental and retroflex stops by native speakers of American English and Japanese. The Journal of the Acoustical Society of America 119 (3): 1684–1696. Putri, A.S., H. Ge, A. Hart, V. Yip, and A. Chen. 2019. The effect of explicit training on comprehension of English focus-to-prosody mapping by Indonesian learners of English. In Proceedings of the 19th International Congress of Phonetic Sciences, 1937–1941. Rato, A. 2014. Effects of perceptual training on the identification of English vowels by native speakers of European Portuguese. In Proceedings of the International Symposium on the Acquisition of Second Language Speech, vol. 5, 529–546. Ryu, N.-Y., and Y. Kang. 2019. Web-based high variability phonetic training on L2 coda perception. In Proceedings of the 19th International Congress of Phonetic Sciences, 482–486. Sakai, M., and C. Moorman. 2018. Can perception training improve the production of second language phonemes? A meta-analytic review of 25 years of perception training research. Applied Psycholinguistics 39 (1): 187–224. Samuel, A.G., and T. Kraljic. 2009. Perceptual learning for speech. Attention, Perception, and Psychophysics 71 (6): 1207–1218. Sheldon, A., and W. Strange. 1982. The acquisition of /r/ and /l/ by Japanese learners of English: Evidence that speech production can precede speech perception. Applied Psycholinguistics 3 (3): 243–261. Shen, X.S., M. Lin, and J. Yan. 1993. F 0 turning point as an F0 cue to tonal contrast: a case study of Mandarin tones 2 and 3. The Journal of the Acoustical Society of America 93 (4): 2241–2243. Shinohara, Y., and P. Iverson. 2018. High variability identification and discrimination training for Japanese speakers learning English /r/–/l. Journal of Phonetics 66: 242–251. Sluijter, A.M.C. 1995. Phonetic Correlates of Stress and Accent. The Hague: Holland Academic Graphics. So, C.K. 2005. The effect of L1 prosodic backgrounds of Cantonese and Japanese speakers on the perception of Mandarin tones after training. The Journal of the Acoustical Society of America 117 (4): 2427. Stevens, K.N., A.M. Libermann, M. Studdert-Kennedy, and S.E.G. Öhman. 1969. Crosslanguage study of vowel perception. Language and Speech 12 (1): 1–23. Strange, W., and S. Dittmann. 1984. Effects of discrimination training on the perception of /rl/ by Japanese adults learning English. Perception and Psychophysics 36 (2): 131–145. Tajima, K., A. Rothwell, and K.G. Munhall. 2002. Native and non-native perception of phonemic length contrasts in Japanese: Effect of identification training and exposure. The Journal of the Acoustical Society of America 112 (5): 2387. Tseng, C.-Y. 1990. An Acoustic Phonetic Study on Tones in Mandarin Chinese. Taipei, Taiwan: Institute of History and Philology Academia Science. Van Ooijen, B. 1996. Vowel mutability and lexical selection in English: Evidence from a word reconstruction task. Memory and Cognition 24 (5): 573–583. Vance, T.J. 1987. An Introduction to Japanese Phonology. Albany, NY: State University of New York Press. Wang, X. 1998. Effects of first language on native Mandarin speakers’ perception of English vowels. In Proceedings of the 14th Northwestern Linguistics Conferences, ed. K. Lee, and M. Oliverie, 7–8. Wang, X. 2008. Perceptual Training for Learning English Vowels: Perception, Production, and Long-term Retention. Saarbrücken: VDM Verlag Dr. Müller.

References

33

Wang, X. 2013. Perception of Mandarin tones: The effect of L1 background and training. The Modern Language Journal 97 (1): 144–160. Wang, Y., M.M. Spence, A. Jongman, and J.A. Sereno. 1999. Training American listeners to perceive Mandarin tones. The Journal of the Acoustical Society of America 106 (6): 3649–3658. Wayland, R.P., and S.G. Guion. 2004. Training English and Chinese listeners to perceive Thai tones: A preliminary report. Language Learning 54 (4): 681–712. Whalen, D.H., and A.G. Levitt. 1995. The universality of intrinsic F0 of vowels. Journal of Phonetics 23 (3): 349–366. Wong, P.C., H.C. Nusbaum, and S.L. Small. 2004. Neural bases of talker normalization. Journal of Cognitive Neuroscience 16 (7): 1173–1184. Wong, J.W.S. 2013. The effects of perceptual and/or productive training on the perception and production of English vowels /I/ and /i / by Cantonese ESL learners. In Proceedings of the 14th Annual Conference of the International Speech Communication Association, ed. F. Bimbot, 2113– 2117. Lyon, France. Xu, Y. 1997. Contextual tonal variations in Mandarin. Journal of Phonetics 25 (1): 61–83. Yamada, R.A., and Y.I. Tohkura. 1992. The effects of experimental variables on the perception of American English /r/ and /l/ by Japanese listeners. Perception and Psychophysics 52 (4): 376–392. Yamada, T., R.A. Yamada, and W. Strange. 1995. Perceptual learning of Japanese mora syllables by native speakers of American English: Effects of training stimulus sets and initial states. In Proceedings of the 14th International Congress of Phonetic Sciences, 322–325. Yu, A.C. 2016. Vowel-dependent variation in Cantonese /s/ from an individual-difference perspective. The Journal of the Acoustical Society of America 139 (4): 1672–1690.

Chapter 3

Training to Perceive English Lexical Stress in Rising Intonation: The Immediate Effects

Abstract This chapter describes the perceptual training program and reports the findings from the pre-test and post-test. The training stimuli were disyllabic noun– verb pairs that contrasted in stress position [e.g., pérmit (n.) vs. permít (v.)] and were produced in both falling and rising intonation. Acoustic analyses revealed two cues to stress patterns in rising intonation: (i) the timing of pitch elbow, the point at which pitch begins to rise and (ii) the relative durations of the first and second vowels. The stimuli were presented in an identification task to train and test our participants, who were Mandarin-speaking learners in Taiwan with an upper-intermediate to advanced English proficiency level. Statistical analyses indicated that the trainees’ stress identification in rising intonation improved significantly from pre-test to posttest, suggesting that the training intervention was effective at least for stimuli used in both training and testing. Surprisingly, it was found that their stress identification in falling intonation declined from pre-test to post-test for the stimuli with trochaic stress. Further correlation analyses suggested that this might be attributed to a bias toward using relative vowel durations in identifying stress patterns.

3.1 Introduction A perceptual training program was conducted with the goal of improving Taiwanese EFL learners’ perception of English lexical stress contrasts in falling intonation and, particularly, in rising intonation. It consisted of six sessions to be completed within a six-day period, preceded by a pre-test and followed by a post-test. The flow of the training and testing is outlined in Fig. 3.1. The post-test used the same stimuli as those in the pre-test and was conducted shortly after the completion of the program. Thus, performance changes from the pre-test to the post-test, if any, are interpreted as what we refer to as the “immediate” effects of the training. We report the findings from the two tests and discuss them with relevance to these effects.

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2020 S. Ou, Perceptual Training on Lexical Stress Contrasts, SpringerBriefs in Linguistics, https://doi.org/10.1007/978-3-030-51133-3_3

35

36

3 Training to Perceive English Lexical Stress …

Fig. 3.1 Flowchart of the perceptual training program

3.2 Materials Twelve pairs of disyllabic English words served as the training stimuli. Each pair comprised of a noun and a verb. The noun had primary stress on the first syllable and therefore a trochaic stress pattern, whereas the verb had primary stress on the second syllable and an iambic stress pattern. Among the 12 pairs, six were so-called stress minimal pairs; that is, the noun and verb were segmentally identical and differed only in the location of primary stress (e.g., pérmit (n.) vs. permít (v.), where the acute mark indicates primary stress). These stress minimal pairs are listed in the first column of Table 3.1 and will be referred to as “non-reduction” pairs because the vowels in the unstressed syllables were not reduced (e.g., to the schwa [ђ]). However, such pairs are not common in the English lexicon, as there is a strong tendency for unstressed syllables to undergo vowel quality reduction (Cutler 1986; Cutler and Pasveer 2006). For this reason, the remaining six pairs were included as the “reduction” pairs, in which the unstressed syllables were realized with a reduced vowel. They are shown in the second column of Table 3.1. To minimize the possibility of confounds due to unfamiliarity with the materials, all the nouns and verbs had to be included in the 7000-word English vocabulary list compiled by the College Entrance Examination Center of Taiwan for Taiwanese high school students. The words all had a frequency of at least one per million words in the COBUILD corpus of English, according to the lexical statistics from the CELEX English database (Baayen et al. 1993). The nouns and verbs were inserted into the final positions of two sentential contexts: (i) an affirmative statement (i.e., Yes, I said ____.) and (ii) a yes–no question (i.e., Did you say ____?). The statement carrier sentence was used to elicit productions of the words in falling intonation. Under this intonation context, stressed syllables receive a high nuclear pitch accent (H*), followed by a low phrase pitch accent (L-) and a low boundary tone (L%). Compared with unstressed syllables, they would Table 3.1 Pairs of nouns and verbs used in the perceptual training

Non-reduction pérmit

Reduction permít

áddress

addréss

súrvey

survéy

éxploit

explóit

résearch

reséarch

récord

recórd

tránsplant

transplánt

próject

projéct

ínsert

insért

súspect

suspéct

ímport

impórt

próduce

prodúce

3.2 Materials

37

be realized with higher pitch—a cue that Mandarin-speaking EFL learners were reported to heavily rely on in identification or discrimination of stress patterns (Ou 2010). The yes–no question was intended to elicit productions in rising intonation, a context in which stressed syllables bear a low nuclear pitch accent (L*), followed by a high phrase pitch accent (H-) and a high boundary tone (H%). Successful stress perception in such a context has proved challenging for Taiwanese EFL learners and may require cues other than pitch height, such as the duration of vowels and the shape of pitch curve (e.g., Ou 2010, 2016). A trained female phonetician and native speaker of English with a North American accent was instructed to read each word in each of the two sentences four times into a SONY Hi-MD recorder in a sound-attenuated room. The recorded speech was digitized at a sampling rate of 44 kHz (16 bits) and stored as a single WAV file on a flash drive. Since the speaker produced four repetitions, there were four tokens for each word in each sentence. These four tokens were checked for breathiness, hoarseness, creakiness, and trembling voice and the two with the least voice quality disruptions were selected for use. The selected tokens were excised from the sentential context and saved as separate WAV files. As a result, there were a total of 96 stimuli (24 words (12 pairs of nouns and verbs) × 2 intonation contexts × 2 tokens), used in the training, the pre-test, and the post-test. All the sound files used in this study and acoustic measurements performed on them (see Sect. 3.3) are available at https://osf.io/bq7ez/?view_only=40326cfecdfd450e8189f06a368d1b38.

3.3 Acoustic Analysis 3.3.1 Falling Intonation To find out cues to the stress patterns of the word pairs, an acoustic analysis was performed on the 96 stimuli, focusing on the three suprasegmental aspects associated with implementation of lexical stress—pitch, duration, and intensity (e.g., Bolinger 1958; Fry 1955, 1958; Gay 1978; Kochanski and Orphanidou 2008; Lieberman 1960; Sluijter 1995; Sluijter and Van Heuven 1996). The results of 48 of the stimuli, which were produced in falling intonation, are first presented. We begin with the dimension of pitch. Before acoustic measurements were taken, all the stimuli were subjected to manual annotation using Praat (Boersma and Weenink 2016). The pitch contours of the two syllables in each stimulus were labeled, and the portions corresponding to the labels were extracted and saved as separate files. These files were then submitted to a Python script which calculated the average pitch value of each contour as well as the pitch values at five equidistant points in normalized time (i.e., pitch values at 0, 25, 50, 75 and 100% of the contour). The means and standard deviations (SDs) of the average pitch values of the two syllables in the stimuli are displayed in Table 3.2. A series of paired t-tests were carried out to compare the two syllables in the same stimulus in terms of average

38

3 Training to Perceive English Lexical Stress …

Table 3.2 Means (with SDs in parentheses) for the average pitch values (in Hz) for the first and second syllables of the stimuli in falling intonation by presence/absence of vowel reduction and by stress pattern Stress pattern

First syllable

Second syllable

Trochaic

236.67 (44.93)

141.13 (26.86)

Iambic

176.69 (40.89)

216.54 (21.19)

Trochaic

201.36 (28.75)

147.24 (44.31)

Iambic

174.88 (32.43)

197.38 (20.63)

Non-reduction

Reduction

pitch value. An alpha level of 0.05 (two-tailed) was used for all statistical tests. It was found that when the stress pattern was trochaic, the first syllables had significantly higher pitch than the second ones, whether the stimuli belonged to the pairs involving vowel reduction or not (non-reduction: t(11) = 7.304, p < 0.001; reduction: t(11) = 3.254, p = 0.008). When the stress pattern was iambic, it was the second syllables that had significantly higher pitch (non-reduction: t(11) = −3.128, p = 0.010; reduction: t(11) = −3.037, p = 0.011). These results agree with Fig. 3.2, which shows the mean pitch values at the five time points for stimuli with trochaic and iambic stress. Clearly, the peak of pitch occurs on the stressed syllable, regardless of the stress pattern. This is consistent with the analysis of stressed syllables as bearing a high nuclear pitch accent (H*) in the falling intonation context. In addition to pitch, the durations of vowels are potential cues to lexical stress as well. Therefore, during the aforementioned annotation process, vowels were also labeled, with their boundaries determined by inspection of the waveform and spectrogram in Praat. In general, the vocalic parts in the stimuli were the portions with periodic vibration in the waveform. However, in cases where the vowels were preceded or followed by sonorants with clear voicing, vowel boundaries were determined also by examining spectral properties. For nasals such as [n] and [m], the boundaries of the vowels were points where the formants began to be dampened or change from dampened to undampened. For the approximant [r], which has been found to lower the third formant frequency (e.g., Boyce and Espy-Wilson 1997; Delattre and Freeman 1968; Westbury et al. 1998), the vowel boundaries were defined as the midpoints of the formant transition. Shown in Table 3.3 are the means for the durations of the first vowels (V1s) and the second vowels (V2s) of the stimuli and for the duration ratios of V2 to V1. Paired t-tests indicated that when the stress pattern was trochaic, there were no significant durational differences between V1s and V2s, regardless of the presence or absence of vowel reduction (non-reduction: t(11) = 1.783, p = 0.102; reduction: t(11) = −0.768, p = 0.459). This could be due to the fact that the words were embedded in the final position of the carrier sentence, where the durations of V2s were relatively increased due to final lengthening. On the other hand, when the stress pattern was iambic, V2s were significantly longer than V1s for both types of stimuli (non-reduction: t(11) =

3.3 Acoustic Analysis

39

Fig. 3.2 Mean pitch values (in Hz) at the 0, 25, 50, 75, and 100% time points for the pitch contours of the stimuli produced in falling intonation. The values were calculated separately for the nonreduction and reduction stimuli and subdivided by stress pattern. The error bars represent 95% confidence intervals

−5.583, p < 0.001; reduction: t(11) = −12.992, p < 0.001). These suggest that vowel duration can possibly be exploited to distinguish between the stress patterns: if V1 and V2 are durationally comparable, the stimulus may have trochaic stress; if V2 is considerably longer than V1, then chances are that the stimulus has iambic stress. Finally, measurements of intensity were made. We calculated the average intensity values of V1s and V2s, the means of which are presented in Table 3.4. Comparisons using paired t-tests revealed that when produced with trochaic stress, V1s had significantly greater intensity than V2s did for both the non-reduction (t(11) = 4.965, p < 0.001) and reduction (t(11) = 6.176, p < 0.001) stimuli alike. The

40

3 Training to Perceive English Lexical Stress …

Table 3.3 Means (with SDs in parentheses) for the durations (in milliseconds) of the first and second vowels of the stimuli in falling intonation and for the duration ratios of the second vowels to the first vowels. The values are subdivided by presence/absence of vowel reduction and by stress pattern Stress pattern

First vowel

Second vowel

Duration ratio

Trochaic

127.50 (21.43)

163.67 (65.06)

1.33 (0.55)

Iambic

102.42 (26.94)

217.75 (69.84)

2.25 (0.76)

Trochaic

146.33 (37.23)

136.42 (29.76)

0.99 (0.35)

Iambic

82.33 (24.08)

221.67 (35.43)

2.88 (0.79)

Non-reduction

Reduction

Table 3.4 Means (with SDs in parentheses) for the average intensity values (in dB) of the first and second vowels of the stimuli in falling intonation by presence/absence of vowel reduction and by stress pattern Stress pattern

First vowel

Second vowel

Trochaic

76.19 (2.49)

68.08 (4.97)

Iambic

71.97 (2.95)

75.81 (1.84)

Trochaic

76.09 (2.22)

70.12 (3.32)

Iambic

70.12 (3.11)

73.31 (3.30)

Non-reduction

Reduction

reverse was true when the stress pattern was iambic: this time it was V2s that showed significantly greater intensity (non-reduction: t(11) = −3.763, p = 0.003; reduction: t(11) = −3.047, p = 0.011). In summary, when the disyllabic words were spoken in falling intonation, the stressed syllables were realized with higher pitch and greater vowel intensity than the unstressed ones. V2s were significantly longer than V1s in the stimuli with iambic stress but not in those with trochaic stress. Therefore, potential cues for distinguishing between the two stress patterns in falling intonation can be found in all the three suprasegmental dimensions. A list of all the acoustic measurements performed on each individual stimulus in falling intonation is provided in Appendix A.

3.3.2 Rising Intonation The remaining 48 stimuli, which were elicited in rising intonation, were similarly analyzed. The same procedures and criteria for pitch contour and vowel boundary labeling were followed. Table 3.5 presents the means of the average pitch values of the two syllables in the stimuli and Fig. 3.3 the mean pitch values at the five

3.3 Acoustic Analysis

41

Table 3.5 Means (with SDs in parentheses) for the average pitch values (in Hz) for the first and second syllables of the stimuli in rising intonation by presence/absence of vowel reduction and by stress pattern Stress pattern

First syllable

Second syllable

Trochaic

162.61 (14.45)

267.07 (22.36)

Iambic

160.86 (11.17)

216.54 (14.05)

Trochaic

170.91 (37.44)

255.39 (17.86)

Iambic

163.75 (6.28)

176.94 (15.84)

Non-reduction

Reduction

equidistant time points. As are clear from the figure, pitch peaks were associated with the second syllables. This is consistent with the results of paired t-tests, which indicated that the second syllables had significantly higher average pitch than the first ones for the stimuli with the trochaic stress pattern (non-reduction: t(11) = − 14.570, p < 0.001; reduction: t(11) = −8.800, p < 0.001) and also for those with the iambic stress pattern (non-reduction: t(11) = −9.637, p < 0.001; reduction (t(11) = −2.970, p = 0.012). Consequently, the relative overall pitch heights of syllables do not seem to be a reliable correlate of lexical stress in rising intonation. Nevertheless, it does not follow that there were no other pitch cues to stress at all. Figure 3.3 reveals that while the stimuli with trochaic and iambic stress were both produced with a rising pitch contour, the loci at which the rise started were different. For trochaic stress, pitch began to change from slightly falling to rising halfway in the first syllable. Yet, in the case of iambic stress, the change did not occur until around 20% of the second syllable. These observations are compatible with the view that stressed syllables under the rising intonation context bear a low nuclear pitch accent (L*), after which movement toward a high phrase pitch accent (H-) and a high boundary tone (H%) unfolds. Following some previous studies (e.g., Herman et al. 1996; Welby 2007), we refer to the inflection point where the pitch value begins increasing in the direction of the high tone targets as a pitch “elbow.” The location of the elbow could possibly signal the location of primary stress. As for vowel duration, the means of the durations of V1s and those of V2s and the ratios of the latter to the former are provided in Table 3.6. Comparisons using paired t-tests showed that when produced with trochaic stress, V2s were significantly longer than V1s only for the stimuli that involved vowel reduction (t(11) = −4.688, p < 0.001), not for those that did not (t(11) = −1.567, p = 0.145). On the other hand, when produced with iambic stress, V2s were significantly longer than V1s for both the non-reduction (t(11) = −9.157, p < 0.001) and reduction (t(11) = − 11.098, p < 0.001) stimuli. While the results appeared mixed, it has to be noted that there was again a difference between the two stress patterns in terms of V2-to-V1 duration ratios. On average, the ratio values for trochaic stress, which ranged roughly between 1.1 and 1.4, were much closer to one than those for iambic stress, which were as high as about 2.5–3.0. For this reason, we averaged the durational ratios

42

3 Training to Perceive English Lexical Stress …

Fig. 3.3 Mean pitch values (in Hz) at the 0, 25, 50, 75, and 100% time points for the pitch contours of the stimuli produced in rising intonation. The values were calculated separately for the nonreduction and reduction stimuli and subdivided by stress pattern. The error bars represent 95% confidence intervals

of the two tokens of the same noun or verb and compared them across the stress patterns (e.g., pérmit vs. permít). One single comparison, which included the nonreduction and reduction pairs, was performed. We found that the ratio values were indeed significantly higher for the words with iambic stress than for their trochaically stressed counterparts (t(11) = −9.226, p < 0.001). Therefore, the extent to which V2s were longer than V1s might be a cue to where the primary stress should be: the greater the extent, the more likely the stimulus was to have iambic stress. The results of vowel intensity measurements are summarized in Table 3.7. Paired t-tests revealed that the V1s and V2s of the stimuli with trochaic stress showed no

3.3 Acoustic Analysis

43

Table 3.6 Means (with SDs in parentheses) for the durations (in milliseconds) of the first and second vowels of the stimuli in rising intonation and for the duration ratios of the second vowels to the first vowels. The values are subdivided by presence/absence of vowel reduction and by stress pattern Stress pattern

First vowel

Second vowel

Duration ratio

Trochaic

114.92 (24.27)

156.33 (40.73)

1.37 (0.30)

Iambic

81.75 (14.70)

208.25 (59.31)

2.53 (0.42)

Trochaic

126.92 (31.34)

137.58 (14.32)

1.13 (0.23)

Iambic

68.83 (11.96)

186.58 (41.06)

2.98 (0.66)

Non-reduction

Reduction

Table 3.7 Means (with SDs in parentheses) for the average intensity values (in dB) of the first and second vowels of the stimuli in rising intonation by presence/absence of vowel reduction and by stress pattern Stress pattern

First vowel

Second vowel

Trochaic

71.88 (2.58)

73.57 (3.22)

Iambic

71.40 (1.74)

72.47 (2.34)

Trochaic

73.47 (3.51)

74.94 (2.67)

Iambic

72.97 (3.45)

72.87 (2.27)

Non-reduction

Reduction

significant difference in average intensity value (non-reduction: t(11) = −1.461, p = 0.172; reduction: t(11) = −1.375, p = 0.197). The same was also true for the iambically stressed stimuli (non-reduction: t(11) = −1.512, p = 0.159; reduction: t(11) = 0.085, p = 0.934). It seems that the overall intensity of vowels might not reliably signal stress position. When the words were embedded in a sentence with rising intonation, there were two potential suprasegmental cues to lexical stress: the timing of pitch elbow and the relative durations of V1 and V2. Trochaically stressed words were realized with a pitch elbow on the initial syllables, thus having an earlier pitch rise; in contrast, iambically stressed words were realized with a pitch elbow on the second syllables and a later pitch rise. In addition, compared with trochaic stress, iambic stress was associated with a greater V2-to-V1 duration ratio. A complete list of acoustic measurements taken on the stimuli produced in rising intonation can be found in Appendix B. To conclude, the acoustic analysis demonstrated that the two stress patterns were not suprasegmentally identical in rising intonation. The question was whether Taiwanese EFL learners’ perception of lexical stress in such an intonation context improved after our perceptual training, the procedure of which is reported in the following section.

44

3 Training to Perceive English Lexical Stress …

3.4 Training and Testing Procedures 3.4.1 Pre-test Before the training intervention, all participants received the pre-test, which assessed their initial performance on identification of the stress patterns of the 96 stimuli. They were asked to wear headphones and complete the test on a desktop computer. The test was a two-alternative forced-choice one. In each trial, the participants saw two words on the computer screen at the same time: one was the noun from one of the noun–verb pairs, with primary stress (indicated by an acute mark) on the first syllable (e.g., pérmit); the other was its verb counterpart, with primary stress on the second syllable (e.g., permít). Then, they were presented with an auditory stimulus and prompted to select the word that matched the stimulus by clicking on it. The test was partitioned into 12 blocks, each of which presented all the stimuli of a noun–verb pair, one per trial, and thus consisted of eight trials (2 stress patterns × 2 intonation contexts × 2 tokens). As a result, there were totally 96 trials. Pauses were interposed between the blocks and the participants could decide to take a break or move on at their discretion. The pre-test lasted about 10 min. The recording of responses and presentation of all the stimuli were controlled by E-Prime 2.0 software.

3.4.2 Perceptual Training Starting on the next day of the pre-test, the training program used the same 96 stimuli and was conducted on a desktop computer in a quiet room. It lasted for six days in a row and contained six sessions. The training method was composed of two phases: a self-practice phase followed by a task phase. In the self-practice phase, the stimuli with different stress patterns but of the same word pair were presented in a contrastive fashion to the trainees. They saw a pair of words (e.g., pérmit and permít) that each had a sound icon below on a PowerPoint slide, as shown in Fig. 3.4. Once an icon was clicked, a stimulus corresponding to the word above would be played. The trainees could click on the icons as many times as they wanted to listen to the stimuli and familiarize themselves with differences in the pronunciations of the two words. After they thought they had enough practice, the trainees proceeded to the task phase, which was implemented by the PowerPoint slide shown in Fig. 3.5. The slide asked them to finish the following six steps: Fig. 3.4 Self-practice slide

3.4 Training and Testing Procedures

45

Fig. 3.5 Task slide

1. Listen to an auditory stimulus, which was either of the two words from the previous practice phase (e.g., pérmit), 2. Determine whether it was a noun or a verb by saying it out, 3. Immediately listen to a pre-recorded sound file that gave the correct answer (e.g., It’s a noun.), 4. Listen to the same stimulus again and orally repeat it at least once, 5. Listen to a sentence with the final word omitted (i.e., The noun is _____.) and continue it by reading out the stimulus, 6. Self-check their own pronunciation by listening to a pre-recorded sound file in which the sentence was completely read out (e.g., The noun is pérmit). The above two phases were repeated to train perception of each word of the 12 noun– verb pairs in both falling and rising intonation. Each training session included four pairs (two non-reduction pairs and two reduction pairs) and was divided into a falling intonation block and a rising intonation block. Both blocks contained stimuli from all the four pairs, but the former contained only those produced in falling intonation and the latter only those produced in rising intonation. A short break was placed between the two blocks. Figure 3.6 shows the setup of one training session.

Fig. 3.6 Structure of a training session

46

3 Training to Perceive English Lexical Stress …

With the 12 word pairs, three different training sessions were constructed. These sessions used one of the two tokens of each word and were completed within the first three days of the training. The remaining three sessions, which were completed within the last three days, were essentially repetitions of the first three ones except that this time the other tokens of the words were used and the order of the two blocks was reversed (i.e., the rising intonation block first). On average, it took about 20 min to complete one session and three hours to finish the whole training program. In addition to the perceptual training, an activity was prepared for participants who were involved in this study as a control group. The activity also comprised six sessions but was irrelevant to perception of lexical stress of the word pairs. In each session, the participants listened to an English song while reading its lyrics, some words of which were left blank for them to fill in. They were then required to sing the song and record it. The average duration of each session in the activity was roughly 20 min, similar to that in the training.

3.4.3 Post-test The post-test was the same as the pre-test, administered on the day after the training ended.

3.5 Participants At the outset, 41 non-English major undergraduate students were recruited. They were studying at a university in Southern Taiwan and had learned English as a foreign language in Taiwan for, on average, 7.2 years (SD = 0.8). None of them had lived in an English-speaking country before. However, they all had a TOEIC score of at least 800, which was required for participation in this study. According to TOEIC (2016) score descriptors, learners who score 800 or higher have a proficiency level that corresponds to the B2 or C1 level in the Council of Europe’s Common European Framework of Reference and may be considered to be upper-intermediate to advanced learners. We imposed the score requirement to ensure that our participants had sufficient metalinguistic knowledge about English vocabulary and lexical stress. Group assignment was random, with 21 in the trainee group and 20 in the control group. Both groups took the pre-test and post-test, but only the trainee group received the training. Two participants from the trainee group and one from the control group withdrew from the study and their results are not presented here. No participants reported hearing or speech problems. All received an honorarium of 450 NT dollars (around 15 US dollars) for their participation.

3.6 Response Accuracy Analysis

47

3.6 Response Accuracy Analysis This section describes the statistical analysis performed to examine whether and how the participants’ performance had changed immediately after the training intervention and if there were other factors that might influence their perception of English lexical stress. Specifically, their response accuracy in the pre-test and the post-test was analyzed by fitting linear mixed-effects logistic regression models using the GLMER function from the lme4 package (Bates et al. 2015) in R (R Core Team 2015). As the dependent variable, the participants’ response was either correct or incorrect, coded as 1 and 0, respectively. The predictors of the model, all effect-coded and each containing two levels, included Phase, Intonation, and Stress Pattern. Phase represented the contrast between the pre-test and the post-test (coded as −0.5 and 0.5, respectively), Intonation the contrast between falling intonation (−0.5) and rising intonation (0.5), and Stress Pattern the contrast between trochaic stress (−0.5) and iambic stress (0.5). For each set of data to be analyzed, the best-fitting model was obtained by backward elimination of variables. Following Barr et al. (2013) suggestions, we constructed a full model with the maximal random effect structure, which included a by-subject random intercept and all possible random slopes, and entered the main effects and interactions of the three predictors as fixed effects. Then, we removed, one at a time, a term from the random effects and repeated this until the more parsimonious model resulted in significantly (p < 0.05) worse fit as compared with the less parsimonious one using ANOVA. The same variable exclusion process was repeated for the fixed effects. Since there were two groups (i.e., trainee and control) and two types of stimuli (i.e., non-reduction and reduction), the data were partitioned into four subsets and we applied the procedure described above to find the best-fitting model for each subset. The results of the trainee group are first reported below.

3.7 Results 3.7.1 Trainee Group The best-fitting model for analyzing the trainees’ responses to the non-reduction stimuli contained the main effects and interactions of Phase, Intonation, and Stress Pattern as fixed effects, with a by-subject random intercept and random slopes for Intonation and Stress Pattern. Table 3.8 shows the fixed effects output and Fig. 3.7 the mean percentages of correct responses, subdivided according to the three predictors. The model revealed a significant main effect for Phase, indicating that performance in the post-test was generally better than that in the pre-test. The main effect of Stress Pattern was also significant, indicating that, regardless of the other factors, the iambically stressed stimuli were identified more accurately than the trochaically

48

3 Training to Perceive English Lexical Stress …

Table 3.8 Fixed effects of the best-fitting model for the trainees’ responses to the non-reduction stimuli β

SE(β)

z

p

(Intercept)

1.297

0.165

7.852